期刊文献+

支持智能中文分词的互联网搜索引擎的构建 被引量:8

Construction of web search engine supporting intelligent Chinese word segmentation
下载PDF
导出
摘要 中文分词技术对中文搜索引擎的查准率及查全率有重大影响。在剖析开源搜索引擎Nutch的源代码结构的基础上,基于JavaCC实现了一个可扩展的词法分析器并将其与Nutch集成,构建了一个支持智能中文分词的互联网搜索引擎Nutch-Enhanced。它可用作评测各类中文分词算法对搜索引擎的影响的实验平台。对NutchEnhanced的搜索质量与Nutch、Google、百度进行了对比评测。结果表明它远优于Nutch,其查全率达到了0.74,前30个搜索结果的查准率达到了0.86,总体上具有与Google,百度接近的中文搜索质量。 Chinese word segmentation has a vital effect on the precision and the recall of web search engine for Chinese. By analyzing an open source web search engine-Nutch, a scalable lexical analyzer is implemented based on JavaCC. Then through integrating it with Nutch, a web search engine-NutchEnhanced which supports intelligent Chinese word segmentation is constructed, and is used as a platform for testing the effect of various Chinese word segmentation algorithms in search engine. The experimental result show, for Chinese query, NutchEnhanced outperforms Nutch on the precision. With recall of 0.74 and precision of top 30 results getting 0.86, its Chinese search quality is as good as Google and Baidu in general.
出处 《计算机工程与设计》 CSCD 北大核心 2006年第23期4395-4398,4407,共5页 Computer Engineering and Design
基金 国家863高技术研究发展计划基金项目(2004AA119030)
关键词 中文分词 分词算法 搜索引擎 词法分析器 检索精度 Chinese word segmentation word segmentation algorithm search engine lexical analyzer precision
  • 相关文献

参考文献13

  • 1梁南元.书面汉语自动分词系统—CDWS[J].中文信息学报,1987,(2):44-52. 被引量:45
  • 2Mike Cafarella,Doug Cutting.Building Nutch:Open source search[J].ACM Queue,2004,(2):56-61. 被引量:1
  • 3Benedict L.Comparison of Nutch and Google search engine implementations on the oregon state university website[R].Oregon State University,2004. 被引量:1
  • 4孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学,2001,3(1):22-32. 被引量:101
  • 5ZHOU Li-Xin.Research of segmentation of Chinese texts in Chinese search engine[C].Tucson,AZ:Proceedings of the IEEE Systems,Man,and Cybernetics Conference,2001.2627-2631. 被引量:1
  • 6彭波.搜索引擎的混合索引技术[J].计算机工程与应用,2004,40(22):16-18. 被引量:6
  • 7Liu J,Lei M,Wang J,et al.Digging for gold on the web:Experience with the WebGather[C].Beijing:Proc of the 4th International Conference on High Performance Computing in the AsiaPacific Region,IEEE Computer Society Press,2000.751-755. 被引量:1
  • 8ZHANG Hua-Ping,Yu Hong-Kui,Xiong De-Yi,et al.HHMM-based Chinese lexical analyzer ICTCLAS[C].Sapporo,Japan:Proceedings of the 2nd SigHan Workshop,2003.184-187. 被引量:1
  • 9ZHANG Hua-Ping,LIU Qun,CHENG Xue-Qi,et al.Chinese lexical analysis using hierarchical hidden markov model[C].Sapporo,Japan:Proceedings of the 2nd SigHan Workshop,2003.63-70. 被引量:1
  • 10Rohit Khare,Doug Cutting,Kragen Sitaker,et al.Nutch:A flexible and scalable open-source web search engine[R].CommerceNet Labs Technical Report,2004. 被引量:1

二级参考文献69

共引文献144

同被引文献44

引证文献8

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部