摘要
在使用单个数据源进行论文数据采集的过程中,存在着数据全面性不足、数据采集速度因网站访问频率限制而受限等问题。针对这些问题,提出了一个基于多数据源的论文数据爬虫技术。首先,以知网、万方数据、维普网、超星期刊四大中文文献服务网站为数据源,针对检索关键词完成列表页数据的爬取与解析;然后通过任务调度策略,去除各数据源之间重复的数据,同时进行任务的均衡;最后采用多线程对各数据源进行论文详情信息的抓取、解析与入库,并构建网页进行检索与展示。实验表明,在单个网页爬取与解析速度相同的情况下,该技术能够更加全面、高效地完成论文信息采集任务,证实了该技术的有效性。
There are many problems in the process of collecting paper data using single data source,such as insufficient data comprehensiveness and limited data collection speed due to website access frequency limitation.Aiming at these problems,this paper proposed a paper data crawling technology for multi-data sources.Firstly,it used the four Chinese document service websites-HowNet,Wanfang Data,Weipu,and Chaoxing as data sources,completed the task of crawling and parsing list page data for the search keywords.Then it used the task scheduling strategy to remove repeated data and balance the tasks.Finally,it used multi-threads for each data source to crawl,parse and store the detail information of the papers,and built a website for search and display.Experiments show that under the same crawling and parsing speed,this technology can complete the paper information collection task more comprehensively and efficiently,which proves the effectiveness of this technology.
作者
侯晋升
张仰森
黄改娟
段瑞雪
Hou Jinsheng;Zhang Yangsen;Huang Gaijuan;Duan Ruixue(Institute of Intelligent Information,Beijing Information Science&Technology University,Beijing 100101,China;National Economic Security Early Warning Engineering Beijing Laboratory,Beijing 100044,China)
出处
《计算机应用研究》
CSCD
北大核心
2021年第2期517-521,共5页
Application Research of Computers
基金
国家自然科学基金资助项目(61772081)
科技创新服务能力建设—科研基地建设—北京实验室—国家经济安全预警工程北京实验室项目(PXM2018_014224_000010)
国家重点研发计划课题(2018YFB1402901)。
关键词
网络爬虫
多源数据源
多线程
信息处理
数据展示
Web crawler
multiple data source
multithreading
information processing
data demonstration