摘要
Div+CSS流行于Web页面的布局,在这种布局下,网页中很多数据记录以重复结构的形式聚集在一个层级。提出一种基于属性标签的Web数据提取的方法,构造带有属性标签的DOM树,通过比较属性标签的值挖掘重复模式,制定三个规则排除干扰模式,找到数据域,进而从数据域中提取出数据记录。
Div+CSS is popular in Webpage layout.On such layout,a lot of data records of Webpage gather in a layer in the form of repetition structure.This paper proposes a method to extract the Web data based on attribute tag of Webpage.By constructing a DOM tree with the attribute tag and comparing the value of the tag attributes,repetitive patterns are mined.Three rules are made to remove the disturbing patterns and to identify the data regions.Then the data records in data regions can be extracted.
出处
《计算机应用与软件》
CSCD
北大核心
2012年第11期156-159,共4页
Computer Applications and Software
基金
上海市信息安全综合管理技术研究重点实验开放课题资助项目(AGK2009008)