摘要
为克服主题爬虫主题漂移现象,提高搜索引擎的查准率和查全率,提出了一个基于PageRank算法与Bagging算法的主题爬虫设计方法。将主题爬虫系统分为爬虫爬行模块和主题相关性分析模块。利用一种改进的PageRank算法改善了爬虫的搜索策略,进行网页遍历与抓取。用向量空间模型表示网页主题,使用Bagging算法构造网页主题分类器进行主题相关性分析,过滤与主题无关网页。实验结果表明,该方法在网页抓取的性能上和主题网页的查准率上都取得较好的效果。
To solve the topic drifting of focused crawler and improve the precise ratio and recall ratio of general search engine results,a focused crawler based on PageRank and Bagging algorithm is put forward. The focused crawler system is divided into two modules. One is search strategy module and another is topic relevant analysis module. Search strategy is carried on by PageRank algorithm and topic relevant analysis is implemented through the Bagging algorithm. At last,the experimental results show that this method can improve the quality of snatch at the webpage and the precise rate of topic web.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第14期3309-3312,共4页
Computer Engineering and Design
基金
国家自然科学基金项目(60573179)