期刊文献+

基于Wang−Landau抽样的主题爬虫方法 被引量:2

Focused Crawler Method Based on Wang−Landau Sampling
下载PDF
导出
摘要 针对传统爬虫方法存在搜索易陷入局部最优,且很少考虑结合历史爬行经验对爬行路径进行修正的缺陷,提出一种基于WL抽样的主题爬行方法。该方法分别使用向量空间模型(VSM)和PageRank算法对链接的相关性和重要性进行评价,采用区域竞争策略从具有主题相关或潜在价值的链接集合中选出目标链接。基于概率密度函数,WL抽样算法对侯选集中选出的目标链接进行抽样判断,根据历史统计经验指导爬虫的后续爬行,从而优化搜索路径。实验结果表明,提出的基于WL抽样的主题爬虫方法比其他主题爬虫方法能搜索到更多主题相关的网页,其爬准率和所有下载网页主题相关度的标准差具有明显优势。 Aiming at the problem that the traditional crawler methods are easy to fall into local optima of the search and rarely consider modifying the crawling path based on historical crawling experience,a focused crawler method based on Wang−Landau(WL)sampling is proposed.This method uses the vector space model(VSM)and PageRank algorithm to evaluate the relevance and importance of links,respectively.Regional competition strategy is used to select the target link from the link set containing the topic−related links and links with potential value.Based on probability density function,the WL algorithm is used to sample the selected target links in the set,and guides the subsequent crawling of the crawler according to the historical statistical experience,so as to optimize the search path.The experimental results show that the WL-based focused crawling method can search more topic-relevant webpages than other methods in the literature,and the climbing accuracy and standard deviation of topic relevance of all downloaded pages are also significantly improved.
作者 刘景发 陈靖岚 赵鹏 LIU Jingfa;CHEN Jinglan;ZHAO Peng(School of Information Science&Technology,Guangdong University of Foreign Studies,Guangzhou 510006;School of Computer&Software,Nanjing University of Information Science&Technology,Nanjing 210044)
出处 《电子科技大学学报》 EI CAS CSCD 北大核心 2023年第4期578-587,共10页 Journal of University of Electronic Science and Technology of China
基金 广东省基础与应用基础研究(2021A1515011974,2023A1515011344)。
关键词 网络爬虫 信息检索 暴雨灾害 台风灾害 Wang-Landau抽样 focused crawler information retrieval rainstorm disaster typhoon disaster Wang-Landau sampling
  • 相关文献

参考文献13

二级参考文献70

共引文献41

同被引文献19

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部