摘要
传统的包装器都由人工定义,要为不同类型的页面制作不同的包装器,因此包装器维护的开销很大,一旦原来的页面风格变了,原来的包装器也就需要重新定义。针对现有方法需要人工定义和维护包装器,并且准确率还有待提升的问题,提出一种可行的基于DOM树抽象的包装器自动生成技术。该技术主要由两个部分组成:目标类型网页的DOM树抽象和目标节点的定位及包装器生成。运用该技术可以对多种类型的网页实现包装器的自动生成。该技术针对主流的购物网站(京东、亚马逊、苏宁、当当)及主流书籍信息网站(豆瓣读书)进行了实验,实验结果表明该方法的平均精确率和召回率能够达到96%和99%。
Traditional wrappers are defined by hand, and different wrappers are made for different types of Web pages, so the maintenance of the wrapper is a great eost. Once the original page style has ehanged, the original wrapper also needs to be redefined. Aiming at the problem that the wrapper needs to be defined and maintained manually and the accuracy still needs to be improved in the existing methods, this paper presented a feasible automatic wrapper generation technique based on DOM tree abstraction. The technology consists of two parts: first, DOM tree abstraction for the target type of the pages; seeond, the target node locating and the wrapper generation. It can be used for a variety of types of Web pages. The experiments were eondneted on mainstream shopping websites (Jingdong, Amazon, Snning, Dangdang) and mainstream book information website ( Douban Books). The experimental results show that the average precision and recall of this method ean reach 96% and 99%.
作者
张佳俊
王一洲
陈星
张颖
ZHANG Jiajun;WANG Yizhou;CHEN Xing;ZHANG Ying(College of Mathematics and Computer Science,Fuzhou University,Fuzhou Fujian 350108,China;Fujian Provincial Key" Laboratory of Network Computing and Intelligent Information Processing,Fuzhou Fujian 350108,China;National Engineering Research Center of Software Engineering,Peking University,Beijing 100871,China)
出处
《计算机应用》
CSCD
北大核心
2018年第A01期150-154,182,共6页
journal of Computer Applications
基金
国家重点研发计划项目(2017YFB1002000)
国家自然科学基金资助项目(61402111)
海西政务大数据应用协同创新中心项目
关键词
DOM
抽象
信息抽取
包装器
自动生成
DOM
abstraction
information extraction
wrapper
automatic generation