摘要
针对海量网页信息,提出适于搜索引擎使用的网页相似度处理算法。算法依据网页抽象形成的概念,在倒排文档基础上建立相似度处理模型。该模型缩小了需要进行相似度计算的网页文档范围,节约大量时间和空间资源,为优化相似度计算奠定了良好基础。
To solve near-replicas of documents on the Web obtained by search engine, a similarity dealing algorithm was proposed. Based on concepts extracted from the Web pages and inverted file, the algorithm built a model which shrank the scale of the Web pages processed. The algorithm saved a great deal of temporal and spatial resources and provides a good foundation for near-replicas detection.
出处
《计算机应用》
CSCD
北大核心
2006年第12期3030-3032,共3页
journal of Computer Applications
基金
西北工业大学研究生创业种子基金资助项目(Z200644)
关键词
相似网页
概念抽取
聚类分析
消重
near-repllcas documents
concept extraction
cluster analysis
near-replicas detection