摘要
详细阐述了主题描述与定义、相关度计算、抓取策略等主题爬虫的关键技术。综合考虑了特征词在相同文本的不同位置和在不同文本的位置权重,利用改进的 TF - IDF 公式计算,同时将这些特殊位置考虑进去以改进传统的向量空间模型 VSM (Vector Space Model)。根据改进的 VSM 方法计算主题页面相关性,同时将改进的 Shark Search 和 HITS 算法结合,既弥补了 Web 全局性之不足,也消除了 HITS 算法中的“主题漂移”现象。实验结果表明该方案用于指导主题爬虫的抓取具有很高的灵活性和准确性。
This paper researched the key techniques of focused crawler,such as the crawling topic description,calculation of correlation and the search strategy of Web pages. Overall considering the key words in different locations in the same text and the location weight in different text,using of modified TF - IDF formula to calculate,and taking the special position into account,the traditional vector space model(VSM)is improved. The theme page relevance is calculated based on the improved VSM. And the integration of improved Shark Search and HITS algorithm can not only make up the deficiency of Web global,but also eliminate the‘ topic drift’ phenomenon in the HITS algorithm. The experimental results indicate that the flexibility and accuracy of our scheme are very high in focused crawling.
出处
《山东师范大学学报(自然科学版)》
CAS
2015年第3期21-24,共4页
Journal of Shandong Normal University(Natural Science)
基金
山东省教育科学规划公关课题(ZK1037123C023)
关键词
主题爬虫
VSM
相关度计算
搜索策略
focused crawler
VSM
relevance calculation
search strategy