摘要
主题爬虫核心问题是主题的相关性判别问题。如何在爬取过程中,快速、准确地判别爬取页面的主题相关度,是决定主题爬虫搜索策略好坏的关键所在。提出利用两步向量空间模型计算的方法进行主题识别,并将基于两步向量空间模型的主题爬虫与传统基于一步向量空间模型的主题爬虫进行比较,实验表明基于两步向量空间的主题爬虫在主题相关度判别和执行效率方面都有较好的表现,同时对"隧道现象"也有一定的改善。
The core issue of the theme crawler is the discrimination of the topic.In the process of crawling,the fast and accurate identification of the topic relevance of crawling pages is the key to decide the strategy of the search strategy.Proposed method of two step vector space model is used to identify themes.And compared two-step vector space model strategy with traditional one-step vector space model strategy.Experimental results show that the two step vector space strategy in to identify topic relevance and efficiency have better performance,also has a certain improvement on the "tunnel phenomenon".