期刊文献+

网页净化及去重研究综述 被引量:1

Survey of Web Page Purification and Deduplication Research
下载PDF
导出
摘要 随着互联网的快速发展与搜索引擎的广泛使用,网页数据已经成为各种应用与研究的重要数据源之一。然而由于网页的特殊性,它所包含的信息并非都是各种应用所必需,例如:广告,导航条等。它们的存在会对各种应用产生不利影响。此外,网页检索结果中经常出现内容相同的冗余页面的问题。所以在网页数据的应用过程中网页净化、网页去重是一个基础问题,也是目前研究的一个热点问题。所以很有必要对网页净化和网页去重领域进行总结,以便更好地深入研究。从网页净化、去重的必要性出发,对它们进行定义和分类,概述多种网页净化、去重的方法和框架,并对其进行总结。 With the rapidly development of Internet and widely use of search engine, web data became the major source of date for lots of research and web applications. However, due to the particularity of web page, the information it contains is not necessary for variety of applications, such as ad- vertising, navigation bar. They will have adverse effects to variety of applications.In addition, there is another problem that the Web search results often contain redundant pages. Therefore, in the process of pages of data application, page purification and deduplicationis are a basic problem, and it's also a hot issue in the present study. Thus it is necessary to sum up fields on the page de-noise anddeduplication, in order to carry out in-depth study better. Firstly, this pa- per gives a brief introduction to the necessity of Web page purification and deduplication. Then, this paper presents a classification hierarchy of the Web page purification methods and Web page deduplication methods, discusses the existing problems and the future directions in the fields. W
作者 罗元
出处 《现代计算机》 2013年第10期3-7,12,共6页 Modern Computer
关键词 网页去重 网页净化 信息检索 万维网 Deletion of Duplicated Web Pages Web Page Purification Information Retrieval WW
  • 相关文献

参考文献22

  • 1毛先领,何靖,闫宏飞.网页去噪:研究综述[J].计算机研究与发展,2010,47(12):2025-2036. 被引量:18
  • 2Gibson D, Punera K, Tomkins A. The Volume and Evolution of Web Page Templates[C]. In: Proceedings of the 14th Int- Conf on World Wide Web. New York: ACM, 2005:830-839. 被引量:1
  • 3Yi L, Liu B, Li X. Eliminating Noisy in Formation in Web Pages for Data Mining[C]. In: Proceedings of the 9th ACM SIGKDD IntConf on Knowledge Discovery and Data Mining. New York: ACM, 2003:296-305. 被引量:1
  • 4Gupta S, Kaiser G, Neistadt D, et al. DOM-Based Content Extraction of HTML Documents [C]. In: Proceedings of the 12th Int Conf on World Wide Web. New York: ACM, 2003: 207-214. 被引量:1
  • 5宋睿华,马少平,陈刚,李景阳.一种提高中文搜索引擎检索质量的HTML解析方法[J].中文信息学报,2003,17(4):19-26. 被引量:20
  • 6Cai D, Yu S, Wen J R, et al. Extracting Content Structure for Web Pages Based on Visual Representation[C]. In: Proceed- ings of Web Technologies and Applications: 5th Asia-Pacific Web Conf. Berlin: Springer, 2003:406-417. 被引量:1
  • 7LIU W, Meng X, et al. Vision-Based Web Data Records Extraction[C]. In: Proceedings of the 9th Int Workshop on the Web and Databases(WebDB 2006). New York: ACM, 2006: 61-70. 被引量:1
  • 8孙桂煌,刘发升.基于正文特征的网页正文信息提取方法[J].现代计算机,2008,14(9):34-38. 被引量:5
  • 9宋明秋,张瑞雪,吴新涛,李文立.网页正文信息抽取新方法[J].大连理工大学学报,2009,49(4):594-597. 被引量:20
  • 10孔胜.王字.一种基于正文特征的新闻网页抽取方法.大连理工大学学报.2010(29):122-125. 被引量:1

二级参考文献82

共引文献103

同被引文献3

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部