摘要
从文档集合的语义结构理解文档集合可以提高多文档摘要的质量。本文通过抽取中文多文档摘要文档集中的主-述-宾三元组结构构建文档语义图,再对语义图中的节点利用编辑距离进行语义聚类,并应用Page-Rank排序算法对语义图进行权重计算后,选取包含权重较高的节点及链接关系的三元组生成文档集合的多文档摘要。在摘要的评测阶段,将基于句子抽取的多文档摘要结果和基于文档语义图生成的多文档摘要分别与由评测员人工生成的摘要进行ROUGE相关度评测,并对利用编辑距离对语义图进行语义聚类前后的结果进行了比较。实验结果表明,基于文档语义图生成的多文档摘要与人工生成的摘要结果重叠度更高,而利用编辑距离对语义图进行聚类则进一步改进了摘要的质量。
Proper processing of the document set based on its semantic structure helps bring about better multi-document summaries. In this paper, subject-object-predicate triples are firstly extracted from document set to construct document semantic graph. Then the edit distance based clustering and PageRank algorithm are applied to optimize the graph structure and to assign weights to the vertices and links, respectively. Finally, triples with more weighted vertices and links are collected as the summary. Evaluated against the extraction-based summarization in terms of the ROUGE score on a set of manual generated summaries, it shows that the semantic graph-based summarization gained more overlaps with manually created summaries, and the edit distance-based graph structure optimization is positive to the the summarization quality.
出处
《中文信息学报》
CSCD
北大核心
2009年第3期110-115,共6页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(60373095
60673039)
国家863高科技计划资助项目(2006AA01Z151)