摘要
为提高网页采集的效率和精准度,提出一种支持可视化模板配置的抽取方法。此方法通过在目标网页中点选元素的方式,自动生成基于DOM路径的抽取模板。将详细介绍基于DOM路径的抽取原理;研究可视化模板配置的关键技术;并将此方法应用于新闻采集系统,以测试其实用效果。
In order to improve the efficiency and precision of Web acquisition, proposes an extraction method supporting visualized template configuration. This method automatically generates a template based on the DOM path by clicking on the elements in the destination page. Introduces the principle of the extraction method in detail, and discusses the key technology of visualized template configuration, and applies this method to the news acquisition system to test its practical effect.
作者
李健
马延周
LI Jian;MA Yan-zhou(Basic Department of Luoyang Campus,the PLA Information Engineering University,Louyang 471003)
出处
《现代计算机》
2018年第7期56-60,共5页
Modern Computer
基金
国家自然科学基金重大项目(No.11590771)
关键词
网络爬虫
网页抽取
DOM模板
可视化配置
Web Crawler
Webpage Extraction
DOM Template
Visual Configuration