摘要
分析了企业级搜索引擎应具有的功能和总体架构,研究了Lucene的系统结构及检索原理,提出了统一处理html、pdf、word等多种常用文档的思路。针对中文特点设计搜索引擎的构建技术,包括从源数据采集、文档解析与分词、索引器、信息检索、结果排序的全过程,基于Lucene软件包实现了一个原型系统,取得了较好的搜索效果。
The structure and function of the enterprise search engine has been analyzed,introduces the structure and the index principles of Lucene,put forward the method of the deal with html,pdf, word documents.Design of search engine technology based on characteristics of the Chinese,include the process of Collection of source data,Document Analysis and Segmentation,Indexer, information search,result sorting.Realize a prototype system based on Lucene,and achieved a good search results.
作者
李海丰
LI Hai-feng (Colleage of Computer Science of Central South University of Forestry and Technology, Changsha 410004, China)
出处
《电脑知识与技术》
2009年第2期926-929,共4页
Computer Knowledge and Technology
关键词
LUCENE
企业搜索引擎
中文分词
非结构化文档
lucene
enterprise search engine
chinese word segmentation
unstructured documents