摘要
通过Lucene API实现对PDF文档的一次全文检索,为了更精确地定位搜索关键词,设计并实现了一种新的二次索引算法,该二次索引带有关键词的页码、坐标及其上下文等信息。利用该二次索引可将检索结果定位到PDF文档的具体页,然后在页面上标示出关键字的具体位置,使对PDF文档的二次检索达到了类似Google Book的图书检索效果。系统测试结果说明系统具有良好检索性能,有较高的查全率和查准率,能够满足用户快速检索的需求。系统作为西安市数字方志全文检索平台投入使用已有2年,取得了较好的应用成果。
In the paper,it implements the fu'st index in PDF document by Lucene API. In order to locate the search keyword more accurately,this paper designs and implements a new algorithm for the second index. It contains the information about the keywords' page number, coordinates, context and so on. Which can be made used of locating the retrieval results in the specific page of the book and marking the specific positions of the keywords. Thus, the effect of the second retrieval in PDF document is as similar as Google Book. The test result proved that this system is provided with high retrieval performance, recall rate and precision rate. It can be satisfied with the requirement of quickly retrieving websites ' documents. This system has been using for 2 years as the full-text retrieval system for Xi ' an data chorography and it gets lots of application fruit.
出处
《计算机技术与发展》
2011年第10期121-124,共4页
Computer Technology and Development
基金
教育部特色专业建设点(TS11772)
关键词
全文检索
二次索引
二次检索
查全率
查准率
full-text retrieval
second Index
second retrieval
recall
precision