期刊文献+

基于标签密度的自适应正文提取方法 被引量:3

Adaptive Approach for Content Extraction Based on Tag Density
下载PDF
导出
摘要 提出一种新颖的网页去噪方法,利用标签和锚文本在网页中不同部分的分布差异来判断是否为正文信息,同时根据正文部分的不同区域标签的分布波动,算法自我学习并调整相关阈值,可有效去除网页噪音.该方法简单易行,网页正文信息提取及网页分类的实验均表明了该方法是有效的. A novel approach for removing Web page noises is presented by exploiting the differences of density of anchor text and tag in different parts of Web page.According to fluctuations in the tag distribution of content regions,the algorithm adaptively learns relative thresholds so as to effectively remove Web noises.In the experiments of content information extraction and Chinese Web page classificaition,it indicates that the approach for denoising is effective and feasible compared to other approaches.
作者 孙皓 董守斌
出处 《郑州大学学报(理学版)》 CAS 北大核心 2009年第1期44-47,共4页 Journal of Zhengzhou University:Natural Science Edition
基金 国家863计划项目 编号2006AA012196
关键词 标签密度 锚文本密度 正文信息 网页去噪 tag density anchor density content information Web denoising
  • 相关文献

参考文献3

二级参考文献39

  • 1Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. In: SIGKDD, 2002 被引量:1
  • 2Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde.Enhanced topic distillation using text, markup tags, and hyperlinks. In: SIGIR, 2001 被引量:1
  • 3S. Chakrabarti, M. Joshi, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In :WWW, Hawaii. ACM, 2002 被引量:1
  • 4Yiming Yang. Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995 被引量:1
  • 5Li Xiaoli and Shi Zhongzhi. Innovating Web page classification through reducing noise. Journal of Computer Science & Technology, 2002 ,17(1): 9 ~ 17 被引量:1
  • 6http://162. 105.80.84/cgi-bin/getdirectory? ccode = 0 被引量:1
  • 7http://e. pku. edu. cn 被引量:1
  • 8Yang Y. Expert network:effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the Seventeenth International ACM SIGIR Conference on Research and Development in Information Retrieval,1994. 13 ~ 22 被引量:1
  • 9Lewis D. D., et al. Training algorithms for linear text classitiers. In: Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996. 298 ~ 306 被引量:1
  • 10Michael W. Berry, Murray Browne. Understand Search Engines (Mathematical Modeling and Text Retrieval). SLAM,1999 被引量:1

共引文献197

同被引文献18

引证文献3

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部