摘要
随着互联网的快速发展,信息量也随之快速增长。为了快速地获取特定的有效信息,通过对开源爬虫框架Scrapy的学习研究,结合Redis数据库和MongoDB数据库,设计并实现了一个分布式网络爬虫系统。此次针对58同城租房信息进行爬取,网页数据存入MongoDB数据库,网页链接存入Redis数据库,着重对反爬虫问题进行处理优化,并使用Docker容器对传统部署环境进行了改造。运行结果表明,基于Docker的分布式爬虫系统比基于VM的分布式爬虫系统运行效率更高更稳定。
With the rapid development of the Internet,the amount of information has also grown rapidly.In order to quickly obtain specific and effective information,this paper designs and implements a distributed web crawler system by studying the open source crawler framework Scrapy,combining Redis database and MongoDB database. This time,we crawled the 58 city rent information,and the webpage data was stored in the MongoDB database. The webpage link was stored in the Redis database,focusing on the optimization of the anti-crawl problem,and the traditional deployment environment was modified by using the Docker container. The running results show that the Docker-based distributed crawler system runs more efficiently and stably than the VM-based distributed crawler system.
作者
方奇洲
程友清
FANG Qi zhou;CHENG You qing(Wuhan Research Institute of Posts and Telecommunications,Wuhan 430074,China;FiberHome Telecommunication Technologies Co.,Ltd,Wuhan 430074,China)
出处
《电子设计工程》
2020年第8期61-65,共5页
Electronic Design Engineering