期刊文献+

一种基于百度百科的中文网络文本关键词抽取方法

Keywords Extraction Method for Chinese Web Text Based on Baidu Baike
下载PDF
导出
摘要 网络上存在海量的中文文本资源,其中有许多具有稀疏性与不规范性,这令以统计词组方式来抽取文本关键词的方法表现不佳.基于百度百科本文提出一种中文网络文本关键词抽取方法,通过百科知识关系将文本从外延词条集合映射到能体现其内涵的语义主题空间中,再利用主题间的关系进行权值调整,最后通过Nave Bayes法回溯并找到原文的关键词.该方法有效地避开穷举词条的统计方式,能在很大程度上解决现有文本挖掘方法无法抽取网络词汇和新生词汇这一难题.在两个数据集上的实验表明,该方法在规范的文本和不规范文本上都有着较好且稳定的表现. Based on words counting, the traditional keywords extraction methods are not able to work well on Chinese texts in the web, because many of these texts are spares and nonstandard. BaiduBaike is a rich and dynamic Chinese Encyclopedia which is closely relat- ed to hot spots and web popular. In this paper,we propose a new keywords extraction method for Chinese web text,which is based on BaiduBaike. In our method,the rich knowledge in BaiduBaike is used to map text into semantic topics from a set of Chinese words, and then the relationship among semantic topics is adopted to adapt the topics' weight in the text. At last the keywords of the text are extracted according to Naive Bayes. This method avoids counting Chinese words, and can resolve web words and novel words to a great extent. Experiments on two datasets have demonstrated that our method can get good and stable result.
作者 陈叶旺
出处 《小型微型计算机系统》 CSCD 北大核心 2014年第11期2422-2427,共6页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61202298)资助 福建省自然科学基金项目(2012J05117)资助 中央高校基本科研业务费(JB-ZR1217)资助 厦门市科技计划项目(3502Z20133029)资助
关键词 网络文本 百度百科 语义主题 web text baidu baike semantic topic
  • 相关文献

参考文献6

二级参考文献48

  • 1董振东.汉语分词研究漫谈[J].语言文字应用,1997(1):109-114. 被引量:11
  • 2李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量:92
  • 3卢志茂,刘挺,李生.统计词义消歧的研究进展[J].电子学报,2006,34(2):333-343. 被引量:28
  • 4O. Medelyan, D. Milne, C. Legg, et al. Mining Meaning from Wikipedia[J].International Journal of Human-Computer Studies,September 2009,67 (9):716-754. 被引量:1
  • 5E.Agichtein,L.Gravano.Snowball:Extracting Relations from Large Plain-Text Collections[C]//Proceedings of the fifth ACM conference on Digital libraries.New York,NY,USA:ACM,2000:85-94. 被引量:1
  • 6M.Ruiz-Casado,E.Alfonseca,P.Castells.Automatic Extraction of Semantic Relationships for WordNet by Means of Pattern Learning from Wikipedia[J].Natural Language Processing and Information Systems 2005,3513:233-242. 被引量:1
  • 7Y.Yan,N.Okazaki,Y.Matsuo,et al.Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web[C]//Proceeding of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP:Volume 2-Volume 2. 被引量:1
  • 8P. Pantel,M. Pennacchiotti. Espresso:Leveraging Generic Patterns for Automatically Harvesting Semantic Relations[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics,2006:113-120. 被引量:1
  • 9F. M. Suchanek,G. Ifrim,G. Weikum. LEILA:Learning to Extract Information by Linguistic Analysis[J].ACL,2006:18-25. 被引量:1
  • 10G.Wang,Y.Yu,H.Zhu.PORE:Positive-Only Relation Extraction from Wikipedia Text.Lecture Notes in Computer Science[C]//Proceedings of Lecture Notes in Computer Science,2007,Volume 4825:580-594. 被引量:1

共引文献171

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部