摘要
大数据时代如何精确而有效地抓取用户所需要的数据成为了一个至关重要的问题,提出一种可配置的聚焦网络爬虫框架,基于配置文件的设置,构建一个数据采集精确、可控性强的聚焦网络爬虫。在此基础上改进聚焦爬虫工作流程,实现Deep Web表单自动提交以及Deep Web数据抓取。实验通过高能物理研究所网站与手机腾讯微博的数据爬取以及爬虫在高能物理研究所大数据平台上的实际运行效果说明了爬虫设计的有效性与实用性。
How to capture data from the Internet accurately and efficiently is of utmost significance in Big Data era.In this paper we propose a customized web crawler framework , and by setting up configuration files we can construct a highly accurate and controllable focused web crawler .In addition to this, we implement the Deep Web form submitting and Deep Web data capturing based on the improvement of workflow of the focused crawl -er.Experiments on capturing the data from the IHEP website and mobile Tencent microblog as well as its practi -cal performance on the big data platform of IHEP indicate the effectiveness and practicability of the crawler .
出处
《核电子学与探测技术》
CAS
CSCD
北大核心
2014年第3期353-358,共6页
Nuclear Electronics & Detection Technology