期刊文献+

基于集成学习的自动标引方法研究 被引量:10

Automatic Indexing Method Based on Ensemble Learning
下载PDF
导出
摘要 目前大多数自动标引方法不能有效利用文本中包含的多个特征。而支持向量机、条件随机场模型等统计机器学习模型能够有效利用文本包含的多种特征进行关键词提取。同时,由于各种自动标引模型性能各异,综合利用各种模型进行集成学习方式的自动标引,能够提高自动标引的质量。为了进一步提高自动标引的质量,本文试图整合统计机器学习模型与集成学习方法的优势,对文档进行基于多分类模型综合投票方式的自动标引。实验结果表明基于集成学习方法的自动标引能提高标引结果的查准率和召回率。另外,集成学习标引模型中,基分类器加权的标引结果,优于基分类器未加权的标引结果。 Currently, most methods of automatic indexing cannot use the features of documents effectively. The statistical machine learning models including support vector machine, conditional random fields, can use the features of documents more sufficiently and effectively. At the same time, the automatic indexing models performance varies in the task of automatic indexing. ff we combine these models to index the documents by ensemble learning, the performance of indexing can he improved. In order to improve the performance of indexing, a method which integrates the statistical machine learning models and ensemble learning is proposed in this paper. This method indexes the documents through voting of multiple indexing models. Experimental results show that the indexing method based on ensemble leaning outperforms other methods according to the precision and recall measurement. Moreover, the indexing model based on ensemble learning with the weighted voting outperforms the model without the weighted voting.
作者 章成志
出处 《情报学报》 CSSCI 北大核心 2010年第1期3-8,共6页 Journal of the China Society for Scientific and Technical Information
基金 本研究受中国博士后科学基金资助项目(20080430463)、教育部人文社会科学研究一般项目(08JC870007)、南京理工大学科研启动基金项目(AB41123)资助.致谢:感谢论文评审人对本文提出修改建议.
关键词 自动标引 关键词提取 集成学习 automatic indexing, keywords extraction, ensemble Learning
  • 相关文献

参考文献22

  • 1李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量:92
  • 2Cohen J D. Highlights : Language and Domain-independent Automatic Indexing Terms for Abstracting[J]. Journal of the American Society for Information Science, 1995, 46 ( 3 ) : 162-174. 被引量:1
  • 3Luhn H P. A Statistical Approach to Mechanized Encoding and Searching of Literary Information [ J ]. IBM Journal of Research and Development, 1957, 1(4): 309-317. 被引量:1
  • 4Salton G, Yang C S, Yu C T. A Theory of Term Importance in Automatic Text Analysis [ J ]. Journal of the American society for Information Science, 1975, 26( 1 ) : 33-44. 被引量:1
  • 5Chien L F. PAT-tree-based Keyword Extraction for Chinese Information Retrieval [ C ]//Proceedings of the 20th Annum International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR1997). PA, USA : Philadelphia, 1997 : 50-59. 被引量:1
  • 6Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction [ J ]. Information Processing and Management, 2007, 43(6) : 1705-1714. 被引量:1
  • 7Hulth A. Improved Automatic Keyword Extraction Given More Linguistic Knowledge [ C]//Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo,Japan: 2003: 216-223. 被引量:1
  • 8索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J].中文信息学报,2006,20(6):25-30. 被引量:88
  • 9Salton G, Buckley C. Automatic Text Structuring and Retrieval Experiments in Automatic Encyclopedia Searching [ C ]//Proceedings of the Fourteenth SIGIR Conference. New York: ACM, 1991: 21-30. 被引量:1
  • 10Frank E, Paynter G W, Witten I H. Domain-Specific Keyphrase Extraction [ C ]//Proceedings of the 16th International Joint Conference on Artificial Intelligence. Sweden: Stockholm, 1999 : 668-673. 被引量:1

二级参考文献12

共引文献155

同被引文献105

引证文献10

二级引证文献107

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部