摘要
W eb文档的迅猛增长使W eb文摘技术成了当今的一个研究热点。由于W eb文档的特殊性,使得W eb文摘不同于传统的文本自动文摘。本文分析了W eb文档的特点;给出了W eb文摘的定义;提出了基于句子抽取的W eb文摘生成算法。算法中将每个W eb句子权重分解为W eb特征词权重和W eb句子结构权重,并用机器学习的方法来计算二者所占的比重。W eb特征词权重根据文档分类树图进行权值调整,W eb句子结构权重充分考虑排版格式和超连接属性。通过对1000篇W eb文档的文摘实验,证明文中所提W eb文摘算法切实可行。
Web Document Summarization (WDS) is becoming one of the hot subjects in the text summarization field due to the rapidly increasing number of documents on Web. However, WDS is different from traditional text summarization because it processes hyperlinked texts. This paper first analyses the features of Web documents, then gives a definition for WDS, and finally presents an algorithm for WDS based on sentences extraction. Each sentence's weight is a weighted sum of words' weight and its sentence-structure's weight. The former weight is adjusted by document class tree graph and the latter weight considers both the Web formats and hyperlink attributes. The weight proportion of words and structures is learned by a machine learning approach. Experiments on 1,000 Web documents show that our algorithm is feasible.
出处
《中文信息学报》
CSCD
北大核心
2006年第6期54-60,108,共8页
Journal of Chinese Information Processing
基金
国家部委基金资助项目(2003WL01)
关键词
计算机应用
中文信息处理
Web文摘
文本文摘
Web文档预处理
文摘后处理
computer application
Chinese information processing
Web document summarization
automatic text summarization
preprocessing of Web document
postprocessing of summary