摘要
为了在浩如烟海的Web信息中更快地找到用户关心的信息,提出了一种主题爬行方法——MatchLink,它通过文档向量模型来评估网页链接的主题相关度,通过朴素贝叶斯算法和多层分类的方法计算链接所在网页的主题相关度,并根据这2个相关度优先下载主题相关的页面,实验表明其结果好于BestFirst和BreadthFirst。
How to find what a user wants in tremendous amount of Web information is a great challenge to web search engine. By focusing downloading web pages on a given domain, focused crawlers can save a great deal of works and improve the quality of the information they provide. We put forward a method of focused crawling MatchLink. It uses document vector model to evaluate topic relevance of the anchor and uses Naive Bayes algorithm and multilayer classification method to compute the topic relevance of the web page containing the anchor. According to these two relevancies, topic relevant web pages have prior claim to be downloaded. Experiment shows that the result is better than BestFirst and BreadthFirst.
出处
《北京工业大学学报》
EI
CAS
CSCD
北大核心
2007年第11期1227-1232,共6页
Journal of Beijing University of Technology