期刊文献+

基于维基语义的多文档文摘研究 被引量:2

Multi-documents summarization utilizing semantics in Wikipedia
下载PDF
导出
摘要 多文档文摘作为自然语言处理领域的重要技术之一,能从不同角度辅助用户实现高效的信息获取.由于文档集合内的内容往往来自不同的信息源,文本之间通常存在丰富而复杂的语义关系.常用的基于词的文档表示法,难以为文摘的语义分析过程提供充足而准确的数据信息.为此,我们提出使用维基百科——当今世界最大的在线概念语料库——为多文档文摘的提取提供语义支持.一方面,我们通过提取文档中的维基概念,生成准确一致的句子表示形式.另一方面,在计算句子特征时,我们利用维基词条的首段指导机器文摘的提取.我们首先通过计算概念在维基中的全局相关性和当前文档集内的局部相关性,获取概念的权重.然后在维基概念表示的基础上,为文档中的句子提取多种基于维基的特征,并最后用于文摘生成.在实验中,我们依次用各个维基特征独立生成文摘,并使用ROUGE(Recall-Oriented Understudy for Gisting Evaluation,面向召回率的要点评估)指标评价文摘质量.通过比较,实验验证了维基词条首段能较好的提升文摘质量. As an importance technique of natural language processing,multi-documents summarization can facilitate users' information retrieval processes.As the documents in a collection are always collected from different resources,there exist abundant and also complex semantic relations inside a document collection.It's hard for the widely used word-based text representation to provide sufficient and accurate information for semantic analysis in summarization process.Thus,we try to use Wikipedia,which has extensive concepts coverage,to extract the concept-based representation of documents.We assess the importance of concepts using both global and local information.The global relatedness of concepts is based on Wikipedia's link structure,while the local relatedness is calculated based on concepts' co-occurrence in sentence.Three wiki-based features are proposed: The first one is the widely used sentence salience feature based on Markov Chain.The other two are both based on sentence similarity with first paragraphs of concept articles in Wikipedia,but one using all concepts occurring in collection while the other using only other contained in sentence itself.Finally we linearly combined these features to select important sentences,which are then concatenated to form summary.We compared these features in experiments,and proved that the first paragraph of related concepts' Wikipedia articles can bring better summary quality.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2011年第4期398-406,共9页 Journal of Nanjing University(Natural Science)
基金 教育部科学技术研究重点项目(108126) 国家自然科学基金(10871019/a0107)
关键词 自动文摘 语义分析 概念表示 维基百科 automatic summarization semantic analysis concept representation Wikipedia
  • 相关文献

参考文献24

  • 1Luhn H P. The automatic creation of literature abstracts. IBM Journal of Research and Devel- opment, 1958, 2(2): 159-165. 被引量:1
  • 2Ogden C K, Richards I A. The meaning of meaning. Harcourt, Brace and World, New York, 1946, 109-138. 被引量:1
  • 3Wu C W, Liu C L. Ontology-based text sum- marization for business news articles. Proceed- ings of the 18^th International Conference on Computers and Their Applications. Honolulu, Hawaii, USA, 2003, 389-392. 被引量:1
  • 4Nastase V. Topic-driven multi-document sum- marization with encyclopedic knowledge and spreading activation. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 2008, 763-772. 被引量:1
  • 5YehJ Y, Ke H R, Yang W P, et al. Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management, 2005, 41(1): 75-95. 被引量:1
  • 6http://en, wikipedia, org/wiki/Main_ Page. 被引量:1
  • 7http://en, wikipedia, org/wiki/List_ of_ Wiki - pedias. 被引量:1
  • 8Milne D, Witten I H. An open source toolkit for mining Wikipedia. Proceedings of New Zeal- and Computer Science Research Student Confer- ence, 2009, 9. 被引量:1
  • 9http://en, wikipedia, org/wiki/Chinese_ Wiki- pedia. 被引量:1
  • 10Mihalcea R, Csomai A. Wikify! : Linking docu ments to encyclopedic knowledge. Proceedings of the Association for Computing Machinery (ACM) Conference on Information and Knowl edge Management, 2007, 233-242. 被引量:1

同被引文献11

引证文献2

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部