摘要
提出一种新颖的网页去噪方法,利用标签和锚文本在网页中不同部分的分布差异来判断是否为正文信息,同时根据正文部分的不同区域标签的分布波动,算法自我学习并调整相关阈值,可有效去除网页噪音.该方法简单易行,网页正文信息提取及网页分类的实验均表明了该方法是有效的.
A novel approach for removing Web page noises is presented by exploiting the differences of density of anchor text and tag in different parts of Web page.According to fluctuations in the tag distribution of content regions,the algorithm adaptively learns relative thresholds so as to effectively remove Web noises.In the experiments of content information extraction and Chinese Web page classificaition,it indicates that the approach for denoising is effective and feasible compared to other approaches.
出处
《郑州大学学报(理学版)》
CAS
北大核心
2009年第1期44-47,共4页
Journal of Zhengzhou University:Natural Science Edition
基金
国家863计划项目
编号2006AA012196
关键词
标签密度
锚文本密度
正文信息
网页去噪
tag density
anchor density
content information
Web denoising