期刊文献+

基于DOM树抽象的包装器自动生成技术

Automatic generation technology of wrapper based on DOM tree abstraction
下载PDF
导出
摘要 传统的包装器都由人工定义,要为不同类型的页面制作不同的包装器,因此包装器维护的开销很大,一旦原来的页面风格变了,原来的包装器也就需要重新定义。针对现有方法需要人工定义和维护包装器,并且准确率还有待提升的问题,提出一种可行的基于DOM树抽象的包装器自动生成技术。该技术主要由两个部分组成:目标类型网页的DOM树抽象和目标节点的定位及包装器生成。运用该技术可以对多种类型的网页实现包装器的自动生成。该技术针对主流的购物网站(京东、亚马逊、苏宁、当当)及主流书籍信息网站(豆瓣读书)进行了实验,实验结果表明该方法的平均精确率和召回率能够达到96%和99%。 Traditional wrappers are defined by hand, and different wrappers are made for different types of Web pages, so the maintenance of the wrapper is a great eost. Once the original page style has ehanged, the original wrapper also needs to be redefined. Aiming at the problem that the wrapper needs to be defined and maintained manually and the accuracy still needs to be improved in the existing methods, this paper presented a feasible automatic wrapper generation technique based on DOM tree abstraction. The technology consists of two parts: first, DOM tree abstraction for the target type of the pages; seeond, the target node locating and the wrapper generation. It can be used for a variety of types of Web pages. The experiments were eondneted on mainstream shopping websites (Jingdong, Amazon, Snning, Dangdang) and mainstream book information website ( Douban Books). The experimental results show that the average precision and recall of this method ean reach 96% and 99%.
作者 张佳俊 王一洲 陈星 张颖 ZHANG Jiajun;WANG Yizhou;CHEN Xing;ZHANG Ying(College of Mathematics and Computer Science,Fuzhou University,Fuzhou Fujian 350108,China;Fujian Provincial Key" Laboratory of Network Computing and Intelligent Information Processing,Fuzhou Fujian 350108,China;National Engineering Research Center of Software Engineering,Peking University,Beijing 100871,China)
出处 《计算机应用》 CSCD 北大核心 2018年第A01期150-154,182,共6页 journal of Computer Applications
基金 国家重点研发计划项目(2017YFB1002000) 国家自然科学基金资助项目(61402111) 海西政务大数据应用协同创新中心项目
关键词 DOM 抽象 信息抽取 包装器 自动生成 DOM abstraction information extraction wrapper automatic generation
  • 相关文献

参考文献6

二级参考文献72

  • 1李蕾,周延泉,王菁华.基于全信息的中文信息抽取系统及应用[J].北京邮电大学学报,2005,28(6):48-51. 被引量:11
  • 2顾铮,顾平.信息抽取技术在中医研究中的应用[J].医学信息(西安上半月),2007,20(1):27-30. 被引量:11
  • 3V Crescenzi,G Mecca,P Merialdo. RoadRunner-Towards Automatic Data Extraction from Large Web Sites[C].In:Proceedings of the 26th International Conference on Very Large Data Bases 被引量:1
  • 4Alberto H F Laender,Berthier A Nebeiro Neto et al.A Brief Survey of Web Data Extraction Tools[J].ACM,2002;31(2) 被引量:1
  • 5Joachim Hammer,Jason McHugh,Hector Garcia-Molina. Semistructured Data:The TSIMMIS Experience[C].In:Proceedings of the First East-European Syposium on Advances in Databases and Information Systems (ADBIS97), 1997:1 ~8 被引量:1
  • 6J McHugh,S Abiteboul,R Goldman et al. Lore:A Database Management System for Semistructured Data[J].ACM SIGMOD, 1997; 26 (3):54~66 被引量:1
  • 7http://www.w3.org/People/Raggett/tidy 被引量:1
  • 8齐振宇,赵军,杨帆.一种开放式中文命名实体识别的新方法[c]∥第五届全国信息检索学术会议论文集,2009:60-69. 被引量:1
  • 9WebHarvest [EB/OL]. [2009-12-25]. http//web-harvest source-forge.net. 被引量:1
  • 10NLPCN. Ansj [EB/OL]. [2014-07-01]. http://www.nlpcn.org/resource/list/4. 被引量:1

共引文献39

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部