摘要
针对个性化数据采集,提出一个轻量级网络爬虫框架,该框架包括控制器、下载器、解析器、线程池和代理池等组件。在此框架下,设计一个具有实时处理能力的爬虫控制器,能够自动保存和恢复任务场景。详细介绍爬虫控制器的工作原理和C#实现,并将其应用于站内文章采集。实验结果表明:所提出的爬虫框架是高效易用的,控制器的实时处理能力在实际爬虫开发中非常重要。
Aiming at personalized data collection,a lightweight web crawler framework is proposed,which includes components such as controller,downloader,parser,thread pool,and agent pool.Under the above framework,a crawler controller with real-time processing capabilities is designed,which can automatically save and restore task scene.The working principle and C#implementation of the crawler controller are introduced in detail,and it is applied to the collection of articles within a website.The experimental results show that the crawler frame⁃work proposed in this paper is efficient and easy to use,and the controller’s real-time processing ability is very important in the actual crawler development.
作者
李健
张克亮
LI Jian;ZHANG Ke-liang(Luoyang Campus,Information Engineering University,Luoyang 471003)
出处
《现代计算机》
2021年第5期91-96,共6页
Modern Computer
基金
国家自然科学基金重大项目:多语言言语数据的获取、标注和分析研究(No.11590771)。
关键词
网络爬虫
爬虫框架
实时控制器
C#
Web Crawler
Crawler Frameworks
Real Time Controller
C#