摘要
随着互联网技术的飞速发展,互联网信息和资源呈指数级爆炸式增长。如何快速有效的从海量的网页信息中获取有价值的信息,用于搜索引擎和科学研究,是一个关键且重要的基础工程。分布式网络爬虫较集中式网络爬虫具有明显的速度与规模优势,能够很好的适应数据的大规模增长,提供高效、快速、稳定的Web数据爬取。本文采用Redis设计实现了一个主从式分布式网络爬虫系统,用于快速、稳定、可拓展地爬取海量的Web资源。系统实现了分布式爬虫的核心框架,可以完成绝大多数Web内容的爬取,并且节点易于拓展,爬取内容可以定制,主从结构使得系统稳定且便于维护。
With the rapid development of Internet technology, the Internet information and resources are expo-nentially explosive growth. How to quickly and effectively obtain valuable information from a large amount of web pages for search engines and scientific research is a key and important infrastructure project. Distributed web crawler has obvious advantages in speed and scale, which can adapt to the massive growth of data, and provide effi-cient, fast and stable Web data crawling. In this paper, Redis is used to design and implement a master-slave distrib-uted network crawler system, which can be used for fast, stable and scalable crawling Web resources. The system realizes the core framework of the distributed crawler, which can complete the crawling of the vast majority of Web content, and the nodes are easy to expand, the crawling content can be customized, and the master-slave structure makes the system stable and easy to maintain.
出处
《软件》
2017年第10期83-87,共5页
Software
基金
湖北省自然科学基金资助项目"面向数字取证的数据约简技术研究"(2015CFB764)