摘要
随着互联网的快速发展与搜索引擎的广泛使用,网页数据已经成为各种应用与研究的重要数据源之一。然而由于网页的特殊性,它所包含的信息并非都是各种应用所必需,例如:广告,导航条等。它们的存在会对各种应用产生不利影响。此外,网页检索结果中经常出现内容相同的冗余页面的问题。所以在网页数据的应用过程中网页净化、网页去重是一个基础问题,也是目前研究的一个热点问题。所以很有必要对网页净化和网页去重领域进行总结,以便更好地深入研究。从网页净化、去重的必要性出发,对它们进行定义和分类,概述多种网页净化、去重的方法和框架,并对其进行总结。
With the rapidly development of Internet and widely use of search engine, web data became the major source of date for lots of research and web applications. However, due to the particularity of web page, the information it contains is not necessary for variety of applications, such as ad- vertising, navigation bar. They will have adverse effects to variety of applications.In addition, there is another problem that the Web search results often contain redundant pages. Therefore, in the process of pages of data application, page purification and deduplicationis are a basic problem, and it's also a hot issue in the present study. Thus it is necessary to sum up fields on the page de-noise anddeduplication, in order to carry out in-depth study better. Firstly, this pa- per gives a brief introduction to the necessity of Web page purification and deduplication. Then, this paper presents a classification hierarchy of the Web page purification methods and Web page deduplication methods, discusses the existing problems and the future directions in the fields. W
出处
《现代计算机》
2013年第10期3-7,12,共6页
Modern Computer
关键词
网页去重
网页净化
信息检索
万维网
Deletion of Duplicated Web Pages
Web Page Purification
Information Retrieval
WW