摘要
网页检索结果中,用户经常会得到内容相同的冗余页面。它们不但浪费了存储资源,而且给信息检索或其它文本处理带来诸多不便。论文在抽取出新闻标题、主题内容和发布日期的前提下,依据新闻的时间性(易碎性),按发布日期分“群”,对冗余网页去重方法进行了探索性研究,从而很大程度地缩小了计算时间,提高了去重准确性。
In the homepage retrieval result,users often get the redundant page with same content.It not only wa set the storing resources,but also bring a great deal of inconvenience to information retrieval or other text-processing.We first extract the news title,the subject content and the issue date in this article,then divide group according to data issued on the basis of news fragility and conduct the exploration research to duplicated web pages removal.It greatly reduces the computing time,enhances the duplicated news webpages deletion accuracy.
出处
《计算机工程与应用》
CSCD
北大核心
2007年第6期119-121,共3页
Computer Engineering and Applications
基金
国家自然科学基金(the National Natural Science Foundation of China under Grant No.60475022)
山西省自然科学基金(the NaturalScience Foundation of Shanxi Province of China under Grant No.20041041)
山西省留学回国人员基金项目(No.2002004)。
关键词
新闻网页
主题内容抽取
网页去重
权值计算
news webpages
theme's extraction
duplicated web pages removal
weight calculating