摘要
主题爬虫是实现垂直搜索引擎的核心技术。介绍主题爬虫的两个重要爬行算法:基于网页内容评价的Shark-Search算法和基于网页链接关系的Hits算法,并分析了各自的优缺点,提出了一种新的主题爬行策略:将上述两种算法的优点结合起来即将基于网页内容评价和基于网页链接关系算法结合起来判断待下载url的优劣,并实现了一个主题爬虫。这种新策略正好弥补了两个算法各自的不足。通过与Shark-Search算法和Hits算法实现的主题爬虫对比,发现用新算法实现的主题爬虫查准率比这两种算法高。
Topical crawler is the core technology to achieve vertical search engine.There are two important crawling algorithms to be introduced:content-based evaluation of Shark-Search algorithm and link-based relationships Hits algorithms.It analyzed their respective advantages and disadvantages and proposed a new topical crawling strategy that is to combine the two algorithms which include content-based evaluation and link-based relationships,to judge whether url to be downloaded is good or bad,and implements a topical crawler.This new crawling strategy can make up for the deficiencies of the two algorithms.With the Shark-Search algorithm and the algorithm of the Hits contrast,it is inferred that the effect of using the new topical crawling algorithm which reaches the degree of accuracy is better than those two algorithms.
出处
《计算机技术与发展》
2010年第11期76-79,共4页
Computer Technology and Development
基金
海南省自然科学基金资助项目(609003)
海南大学科研项目(hd09xm84)
关键词
主题爬虫
爬行策略
垂直搜索引擎
topical crawler
crawling strategy
vertical search engine