摘要
本文提出一种结合网页分块与统计的方法来抽取新闻类网页中的正文。首先,在网页解析的基础上根据标签信息对网页进行分块处理,并计算出每一个内容块的实际长度;其次,在得到内容块的长度集合后,计算这些内容块长度的均值,同时利用方差能反映一组数据的波动大小的特性,按内容块长度降序排列并依次计算去掉最大内容块后的方差变化情况,寻找最有可能的正文内容块;最后随机选取了一些新闻网页进行测试,结果显示准确率可达96%,充分证明了该方法的有效性。
This paper proposes a method of using Webpage segmentation and statistics to extract the text from news Webpage. Firstly, based on the parsing of the Webpage, the paper segments the Webpage according to HTML tags, and calculates the actual length of each content block. Secondly, the paper calculates the mean value of the length of the content block after obtaining the length set of the content block. Meanwhile, by the use of the charac- teristics of the variance which reflect the fluctuation of the data, the paper calculates the change of the variance after taking away the largest block in a descending order iteratively to find the promising content block. Finally, the paper selects some news Webpages randomly for test. The result shows the precision can reach 96% , which confirms the validity of the method fully.
出处
《情报理论与实践》
CSSCI
北大核心
2010年第1期117-120,共4页
Information Studies:Theory & Application
基金
全国教育科学"十一五"规划2009年度教育部青年专项课题"网络课程使用现状自动量化评价系统研究"的成果之一
项目编号:ECA090441
关键词
数据挖掘
网页分块
数学期望
正文抽取
data mining
Webpage segmentation
mathematical expectation
text extraction