摘要
多文档文摘作为自然语言处理领域的重要技术之一,能从不同角度辅助用户实现高效的信息获取.由于文档集合内的内容往往来自不同的信息源,文本之间通常存在丰富而复杂的语义关系.常用的基于词的文档表示法,难以为文摘的语义分析过程提供充足而准确的数据信息.为此,我们提出使用维基百科——当今世界最大的在线概念语料库——为多文档文摘的提取提供语义支持.一方面,我们通过提取文档中的维基概念,生成准确一致的句子表示形式.另一方面,在计算句子特征时,我们利用维基词条的首段指导机器文摘的提取.我们首先通过计算概念在维基中的全局相关性和当前文档集内的局部相关性,获取概念的权重.然后在维基概念表示的基础上,为文档中的句子提取多种基于维基的特征,并最后用于文摘生成.在实验中,我们依次用各个维基特征独立生成文摘,并使用ROUGE(Recall-Oriented Understudy for Gisting Evaluation,面向召回率的要点评估)指标评价文摘质量.通过比较,实验验证了维基词条首段能较好的提升文摘质量.
As an importance technique of natural language processing,multi-documents summarization can facilitate users' information retrieval processes.As the documents in a collection are always collected from different resources,there exist abundant and also complex semantic relations inside a document collection.It's hard for the widely used word-based text representation to provide sufficient and accurate information for semantic analysis in summarization process.Thus,we try to use Wikipedia,which has extensive concepts coverage,to extract the concept-based representation of documents.We assess the importance of concepts using both global and local information.The global relatedness of concepts is based on Wikipedia's link structure,while the local relatedness is calculated based on concepts' co-occurrence in sentence.Three wiki-based features are proposed: The first one is the widely used sentence salience feature based on Markov Chain.The other two are both based on sentence similarity with first paragraphs of concept articles in Wikipedia,but one using all concepts occurring in collection while the other using only other contained in sentence itself.Finally we linearly combined these features to select important sentences,which are then concatenated to form summary.We compared these features in experiments,and proved that the first paragraph of related concepts' Wikipedia articles can bring better summary quality.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2011年第4期398-406,共9页
Journal of Nanjing University(Natural Science)
基金
教育部科学技术研究重点项目(108126)
国家自然科学基金(10871019/a0107)
关键词
自动文摘
语义分析
概念表示
维基百科
automatic summarization
semantic analysis
concept representation
Wikipedia