摘要
随着互联网的快速发展,其信息量和相关服务也随之快速增长.如何从海量的信息中快速、准确地抓取所需要的信息变得越来越重要,因此负责互联网信息收集工作的网络爬虫将面临着巨大的机遇和挑战.目前国内外一些大型搜索引擎只给用户提供不可制定的搜索服务,而单机的网络爬虫又难当重任,因此可定制性强、信息采集速度快和规模大的分布式网络爬虫便应运而生.通过对原有Scrapy框架的学习和研究,将Scrapy和Redis结合改进原有的爬虫框架,设计并实现了一个基于Scrapy框架下的分布式网络爬虫系统,然后将从安居客、58同城、搜房等网站抓取的二手房信息存入MongoDB中,便于对数据进行进一步的处理和分析.结果表明基于Scrapy框架下的分布式网络爬虫系统同单机网络爬虫系统相比效率更高且更稳定.
With the rapid growth of the Internet,the amount of information and related services are growing rapidly.How to capture the information from massive information quickly and accurately is becoming more and more important,so the network crawler is also facing great challenges and opportunities.At present,domestic and foreign large search engines can only provide non-customizable search services for users,and a single-machine web crawler cannot assume the difficult task. Therefore,the distributed web crawler with flexible customization,high information acquisition speed and large scale has come into being.In this paper,through the study of the original Scrapy framework,the original crawler framework is improved by combining Scrapy and Redis,and a distributed crawler system based on Web information Scrapy framework is designed and implemented. The second-hand housing information captured from www. anjuke.com,www.58.com and www.fang. com is stored in Mongo DB,so that the data can be processed and analyzed.The results show that the distributed crawler system based on Scrapy framework is more efficient and stable than the single-machine web crawler system.
作者
李代祎
谢丽艳
钱慎一
吴怀广
LI Daiyi;XIE Liyan;QIAN Shenyi;WU Huaiguang.(School of Computer and Communication Engineerring,Zhengzhou University of Light Industry,Zhengzhou 450002,China;Henan School of Administration of Industry and Commerce, Zhengzhou 450002, China)
出处
《湖北民族学院学报(自然科学版)》
CAS
2017年第3期317-322,共6页
Journal of Hubei Minzu University(Natural Science Edition)
基金
国家自然科学基金项目(61672470)
河南省科技攻关项目(162102410076)
河南省重大科技专项(161100110900)