摘要
网络爬虫如何在限定带宽的条件下进行爬行是一个有巨大应用价值的问题,但是目前对这个方面的研究较少,本文提出了一种基于对站点礼貌爬行的爬虫带宽控制策略,通过对不同站点下载速度的建模分析和基于礼貌爬行的访问频率控制,得到了面向站点的爬行控制算法,最后实验证明这种方法能够充分利用所限定的带宽。
How to run under constrained bandwidth for web crawlers is of great applicant value, however, it has been seldom studied. This paper present a crawler bandwidth controlling policy based on polite crawling. The model of predict downloading speed of differ- ent sites is set up, and the maximum request frequency of sites are obtained based on polite crawling. Upon these, a site-based controlling algorithm of crawling is presented. The experimental results prove effectiveness of it.
出处
《微计算机信息》
北大核心
2008年第33期76-77,106,共3页
Control & Automation
基金
国家自然科学基金项目"基于增量学习的主题爬虫关键技术研究"(No.60603066)
关键词
网络爬虫
限定带宽
礼貌爬行
Web crawler
bounded bandwidth
polite crawling