摘要
新闻类网页是互联网上冗余信息的重灾区。冗余网页不仅会加剧搜索引擎的处理负担,并且会降低用户体验,因此有必要对互联网上的冗余新闻网页实施消重处理。该算法依据新闻报道的自然语法特点将一篇新闻报道分解到词,从7类词性类别中提取该类别最高词频的词组成新闻报道的特征词群;通过词级倒排索引的建立,完成不同网页间特征词群的检索和对比;通过类型倒排索引的建立,完成重复和近似网页的识别和分类管理。本算法在实施过程借助于搜索引擎系统原有模块,避免新模块的引入保持了系统的简洁性;实验表明该算法是有效的,在测试的网页中召回率达93.5%,准确率达88.4%。冗余网页小粒度分类识别上具有的缺陷,在很大程度上影响了准确率的提高。
News pages are always nightmares of the redundant messages on the internet. On one hand, redundant messages could increase searching burden of search engine. On the other hand, they would lower user's experience. So it is necessary to deal with these news pages. The algorithm will decompose a news report into words according to grammar. It will constitute feature words group by picking up the highest frequency words from 7 categories of part- of-speech. It finishes retrieving feature words group and comparing them between different web pages by building word-level inverted index. It finishes detecting and managing duplicate or near-duplicate web pages by building class inverted index. This algorithm utilizes the original module of the search engine in the implementation process, and it keeps simplicity of the system avoiding the introduction of the new module. The algorithm is proven efficient in our experiment testing: the recall rate of web pages reaches 93.5 %, and the precision rate reaches 88.4 %. The redundant pages that have defects in their classification and identification largely influence the improvement of accuracy rate.
出处
《成都信息工程学院学报》
2012年第4期374-379,共6页
Journal of Chengdu University of Information Technology
基金
四川省科技厅软科学计划资助项目(2011ZR0058)
成都信息工程学院自然科学与技术发展基金项目(CSRF201002)对本文的资助
关键词
计算机应用
网页消重
词性分类
特征词群
computer application
elimination of duplicated web pages
part-of-speech classification
feature words group