摘要
针对双结构网络的特点及其URL去重面临的挑战,根据Bloom Filter的工作原理,提出一种基于可扩展的动态可分裂Bloom Filter的URL去重机制,并在原型系统中进行实现和部署。实验结果表明,该机制能够有效适用于大规模、高性能和分布式的双结构网络爬虫应用。
In this paper, the concept of Dual-Structural Network is firstly introduced and theprinciples of Bloom Filter are surveyed. Then, the basic requirements for detecting duplicatedURLs in Dual-Structural Network are analyzed. Moreover,a dynamic splittable Bloom Filter forweb crawlers is proposed, which can increase its capacity according to application requirementsand fit large-scale, high-performance and distributed web crawlers. Finally, the feasibility and ef-ficiency of the proposed Bloom Filter is demonstrated by a series of experiments.
作者
袁志伟
杨鹏
刘旋
YUAN Zhiwei YANG Peng LIU Xuan(School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration of the Ministry of Education, Southeast University, Nanjing 211100,China)
出处
《太原理工大学学报》
CAS
北大核心
2016年第1期68-74,共7页
Journal of Taiyuan University of Technology
基金
国家863计划课题基金资助项目:基于内容聚类与兴趣适配的高效内容分发技术(2013AA013503)
国家自然科学基金资助项目:具有互补双结构的新型网络及关键技术研究(61472080)
中国工程院咨询研究基金资助项目:国家第二网络战略研究(2015-XY-04)