摘要
中文分词技术对中文搜索引擎的查准率及查全率有重大影响。在剖析开源搜索引擎Nutch的源代码结构的基础上,基于JavaCC实现了一个可扩展的词法分析器并将其与Nutch集成,构建了一个支持智能中文分词的互联网搜索引擎Nutch-Enhanced。它可用作评测各类中文分词算法对搜索引擎的影响的实验平台。对NutchEnhanced的搜索质量与Nutch、Google、百度进行了对比评测。结果表明它远优于Nutch,其查全率达到了0.74,前30个搜索结果的查准率达到了0.86,总体上具有与Google,百度接近的中文搜索质量。
Chinese word segmentation has a vital effect on the precision and the recall of web search engine for Chinese. By analyzing an open source web search engine-Nutch, a scalable lexical analyzer is implemented based on JavaCC. Then through integrating it with Nutch, a web search engine-NutchEnhanced which supports intelligent Chinese word segmentation is constructed, and is used as a platform for testing the effect of various Chinese word segmentation algorithms in search engine. The experimental result show, for Chinese query, NutchEnhanced outperforms Nutch on the precision. With recall of 0.74 and precision of top 30 results getting 0.86, its Chinese search quality is as good as Google and Baidu in general.
出处
《计算机工程与设计》
CSCD
北大核心
2006年第23期4395-4398,4407,共5页
Computer Engineering and Design
基金
国家863高技术研究发展计划基金项目(2004AA119030)
关键词
中文分词
分词算法
搜索引擎
词法分析器
检索精度
Chinese word segmentation
word segmentation algorithm
search engine
lexical analyzer
precision