摘要
结合网页的视觉信息和DOM树结构,研究从Deep Web查询结果页面中抽取半结构化数据的问题。通过视觉块与整个网页的面积比定位数据区域。根据数据记录两两相邻等视觉特征找到包含数据记录的一组节点,并通过比较各节点的DOM树结构的相似度去除噪音节点。根据xpath属性将各条数据记录的数据项对齐。对整个抽取过程生成模板,可以使抽取效率得到很大提高。对8个Deep Web网站进行了抽取数据实验,结果表明本文方法是有效的。
Semi-structured data extracted from Deep Web query results page is studied, based on the visual information and DOM tree structure of pages. The data region is determined by the ratio of visual block area to the entire page. A set of nodes with data records are identified according to visual features, such as adjacency. Noise nodes are eliminated by comparing the similarity of nodes' DOM tree struc- ture. According to xpath attributes, all data items are aligned. Template is generated for the process of extraction, which significantly improves the extraction efficiency. Experiments of data extraction were con- ducted with eight Deep Web websites, the results of which fully testify the effectiveness of our method.
出处
《中国海洋大学学报(自然科学版)》
CAS
CSCD
北大核心
2015年第5期114-119,共6页
Periodical of Ocean University of China
基金
山东省自然科学基金项目(ZR2012FM016)资助