基于视觉信息和标签路径的数据抽取

Data Extraction Based on Vision and Tag Path

下载PDF

导出

摘要结合网页的视觉信息和DOM树结构,研究从Deep Web查询结果页面中抽取半结构化数据的问题。通过视觉块与整个网页的面积比定位数据区域。根据数据记录两两相邻等视觉特征找到包含数据记录的一组节点,并通过比较各节点的DOM树结构的相似度去除噪音节点。根据xpath属性将各条数据记录的数据项对齐。对整个抽取过程生成模板,可以使抽取效率得到很大提高。对8个Deep Web网站进行了抽取数据实验,结果表明本文方法是有效的。 Semi-structured data extracted from Deep Web query results page is studied, based on the visual information and DOM tree structure of pages. The data region is determined by the ratio of visual block area to the entire page. A set of nodes with data records are identified according to visual features, such as adjacency. Noise nodes are eliminated by comparing the similarity of nodes＇ DOM tree struc- ture. According to xpath attributes, all data items are aligned. Template is generated for the process of extraction, which significantly improves the extraction efficiency. Experiments of data extraction were con- ducted with eight Deep Web websites, the results of which fully testify the effectiveness of our method.

作者张巍邹晓明谈凤真

机构地区中国海洋大学信息科学与工程学院

出处《中国海洋大学学报（自然科学版）》 CAS CSCD 北大核心 2015年第5期114-119,共6页 Periodical of Ocean University of China

基金山东省自然科学基金项目(ZR2012FM016)资助

关键词 DEEP WEB数据抽取视觉信息标签路径 Deep Web data extraction visual feature tag path

分类号 TV149.2 [水利工程—水力学及河流动力学]

引文网络
相关文献

参考文献9

1刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量：136
2Wang Y, Hu J. A machine learning based approach for table detec- tion on the Web [C].//Proc of the llth Int Conf on World Wide Web. New York: ACM, 2002: 242-250. 被引量：1
3Pinto D, McCallum A, Wei X. Table extraction using conditional random fields [C].//Proc of the 26th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2003: 235-242. 被引量：1
4Crescenzi V, Mecca G, Merialdo P. Road-runner: Towards Auto- matic Data Extraction from Large Web Sites[C].//Proc of the 26th Int'l Conf. on Very Large Database Systems. Roma, Italy: [s.n.], 2001:109 118. 被引量：1
5Chang Chia-Hui, Lui C. IEPAD: Information Extraction Based on Pattern Discovery [C].//Proceedings of the 10th International Conference on World Wide Web. Hong Kong: Is. n. ], 2001: 681- 688. 被引量：1
6Liu B, Grossman R L, Zhai Yanhong. Mining data records in Web pages [C].//Proc of the 9th Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2003: 601-606. 被引量：1
7Zhai Y, Liu B. Web data extraction based on partial tree alignment I-C].//Proe of the 14th Int Conf on World Wide Web. New York: ACM, 2005: 76-85. 被引量：1
8Cai D, Yu S, Wen J R, et al. VIPS: a vision-based page segmenta- tion algorithm [R]. Microsoft Technical Report, MSR-TR-2003- 79, 2003. 被引量：1
9Liu W, Meng X, Meng W. Vision-based Web data records extrac- tion [C].//Proc of the 9th Int Workshop in Web and Databases. New York: ACM, 2006: 20-25. 被引量：1

二级参考文献60

1.[EB/OL].http://www.cogsci.Princeton.edu,. 被引量：2
2Fetterly D,Manasse M,Najork M,Wiener J L.A largescale study of the evolution of Web pages//Proceedings of the 12th International World Wide Web Conference.Budapest,2003:669-678 被引量：1
3Chang K C,He B,Li C,Patel M,Zhang Z.Structured databases on the Web:Observations and Implications.SIGMOD Record,2004,33(3):61-70 被引量：1
4Cope J,Craswell N,Hawking D.Automated discovery of search interfaces on the Web//Proceedings of the 14th Australasian Database Conference(ADC 2003).Adelaide,2003:181-189 被引量：1
5Zhang Z,He B,Chang K C.Understanding Web query interfaces:Best-effort parsing with hidden syntax//Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data.Paris,2004:107-118 被引量：1
6Arasu A,Garcia-Molina H.Extracting structured data from Web pages//Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data.San Diego,2003:337-348 被引量：1
7Crescenzi V,Mecca G,Merialdo P.RoadRunner:Towards automatic data extraction from large Web sites//Proceedings of the 27th International Conference on Very Large Data Bases.Italy,2001:109-118 被引量：1
8Wittenburg K,Weitzman L.Visual grammars and incremental parsing for interface languages//Proceedings of the IEEE Symposium on Visual Languages (VL).Skokie,1990:111-118 被引量：1
9He H,Meng W,Yu C T,Wu Z.WISE-integrator:An automatic integrator of Web search interfaces for e-commerce//Proceedings of the 29th International Conference on Very Large Data Bases.Berlin,2003:357-368 被引量：1
10Peng Q,Meng W,He H,Yu C T.WISE-cluster:Clustering e-commerce search engines automatically//Proceedings of the 6th ACM International Workshop on Web Information and Data Management.Washington,2004:104-111 被引量：1

共引文献135

1魏勇刚,张国春,常勇,袁方.基于词性分析和领域知识的Deep Web语义标注[J].郑州大学学报（理学版）,2009,41(1):52-55. 被引量：7
2郑淑丽,韩江洪,程文娟,吴永忠.Deep Web查询接口自动识别方法[J].郑州大学学报（理学版）,2009,41(1):56-58. 被引量：1
3李颖,刘国华,佟冰,刘顺江.基于素数的多源模式匹配方法的研究[J].燕山大学学报,2009,33(2):141-145. 被引量：1
4李益民.一种基于关键词的大规模Deep Web信息检索系统[J].图书情报工作,2008,52(10):29-32.
5鲜学丰,方巍,赵朋朋,崔志明,胡鹏昱.一种Deep Web数据源质量评估模型[J].微电子学与计算机,2008,25(10):47-50. 被引量：6
6崔晓军,彭智勇,曾承.基于多标注源的Deep Web查询结果自动标注[J].计算机应用,2009,29(1):196-200. 被引量：3
7李益民,魏立新,解成俊.基于用户模式Deep Web检索系统的研究[J].计算机工程与设计,2009,30(3):767-769.
8马安香,张斌,高克宁,齐鹏,张引.基于结果模式的Deep Web数据抽取[J].计算机研究与发展,2009,46(2):280-288. 被引量：15
9李齐会.Deep Web查询接口的判定技术研究[J].计算机与数字工程,2009,37(3):131-134. 被引量：1
10高明,黄哲学.Deep Web研究现状与展望[J].集成技术,2012,1(3):47-54. 被引量：1

1伏世红,乔双全,王吉永.嫩江干流齐富堤防双层地基渗流稳定分析[J].水利科技与经济,2011,17(7):13-15. 被引量：5
2沈乔楠,安雪晖,于玉贞.基于视觉信息的堆石质量评价[J].清华大学学报（自然科学版）,2013,53(1):48-52. 被引量：6
3姚天禄,柳发桐,许尔明.九甸峡水利枢纽工程坝址河床深厚覆盖层勘察研究[J].水力发电,2010,36(11):76-78. 被引量：5
4瞿富强,杨启龙.安康电厂水调自动化系统Web网站的开发与应用[J].水电自动化与大坝监测,2005,29(4):72-74. 被引量：3
5李松磊,滕杰,董承山,吴彤.齐热哈塔尔水电站工程引水隧洞发生岩爆的概率模型计算[J].水利水电工程设计,2012,31(3):47-49.
6Yan Manman.Deep into the Nu River——A Real Shangri-La[J].China's Foreign Trade,2011(3):88-91.
7Shiyong Wu Ge Wang.Rock mechanical problems and optimization for the long and deep diversion tunnels at Jinping Ⅱ hydropower station[J].Journal of Rock Mechanics and Geotechnical Engineering,2011,3(4):314-328. 被引量：5
8MOURN FOR PROFESSOR ALLEN T. CHWANG WITH DEEP GRIEF[J].Journal of Hydrodynamics,2007,19(3):394-394.
9张成才,常静,张颖.基于MapGIS-IMS的防洪工程管理系统研究[J].人民黄河,2010,32(2):25-25. 被引量：2
10张晓威,刘春锋,王轶娟,崔福占.齐热哈塔尔水电站厂房排架有限元分析[J].水科学与工程技术,2013(4):76-79.

中国海洋大学学报（自然科学版）

2015年第5期

浏览历史

内容加载中请稍等...

基于视觉信息和标签路径的数据抽取

参考文献9

二级参考文献60

共引文献135

相关作者

相关机构

相关主题

浏览历史