摘要
针对传统爬虫方法存在搜索易陷入局部最优,且很少考虑结合历史爬行经验对爬行路径进行修正的缺陷,提出一种基于WL抽样的主题爬行方法。该方法分别使用向量空间模型(VSM)和PageRank算法对链接的相关性和重要性进行评价,采用区域竞争策略从具有主题相关或潜在价值的链接集合中选出目标链接。基于概率密度函数,WL抽样算法对侯选集中选出的目标链接进行抽样判断,根据历史统计经验指导爬虫的后续爬行,从而优化搜索路径。实验结果表明,提出的基于WL抽样的主题爬虫方法比其他主题爬虫方法能搜索到更多主题相关的网页,其爬准率和所有下载网页主题相关度的标准差具有明显优势。
Aiming at the problem that the traditional crawler methods are easy to fall into local optima of the search and rarely consider modifying the crawling path based on historical crawling experience,a focused crawler method based on Wang−Landau(WL)sampling is proposed.This method uses the vector space model(VSM)and PageRank algorithm to evaluate the relevance and importance of links,respectively.Regional competition strategy is used to select the target link from the link set containing the topic−related links and links with potential value.Based on probability density function,the WL algorithm is used to sample the selected target links in the set,and guides the subsequent crawling of the crawler according to the historical statistical experience,so as to optimize the search path.The experimental results show that the WL-based focused crawling method can search more topic-relevant webpages than other methods in the literature,and the climbing accuracy and standard deviation of topic relevance of all downloaded pages are also significantly improved.
作者
刘景发
陈靖岚
赵鹏
LIU Jingfa;CHEN Jinglan;ZHAO Peng(School of Information Science&Technology,Guangdong University of Foreign Studies,Guangzhou 510006;School of Computer&Software,Nanjing University of Information Science&Technology,Nanjing 210044)
出处
《电子科技大学学报》
EI
CAS
CSCD
北大核心
2023年第4期578-587,共10页
Journal of University of Electronic Science and Technology of China
基金
广东省基础与应用基础研究(2021A1515011974,2023A1515011344)。