摘要
获取Deep Web中信息的主要途径是通过在其提供的查询接口上提交查询来实现的,目前大部分的研究以表单内的<form></form>标签获得表单内容结构,判断是不是一个Deep Web查询接口。提出了接口块的概念,设计了一种基于页面信息和视觉信息的接口块定位方法,最后将判定接口块是不是Deep Web接口看作是一个模式识别的分类问题,通过抽取适当的表单结构特征,采用C4.5决策树和SVM相结合的分类算法来进行接口块的判定,得到页面中含有的Deep Web查询接口。采用UIUC的TEL-8数据集进行实验,结果表明,该方法的准确率达到了97.30%,具有良好的可行性和实用性。
Most getting Deep Web interface method is to get the<form></form>tag in a page,and then judge it’s a DeepWeb query interface or not.The interface block concept is proposed.Based on the vision information,the interface positionpage is located.By extracting appropriate form architectural feature and applying classification algorithm combining C4.5decision tree and SVM,so as the query interface is found out within the interface block.TEL-8data sets of UIUC areadopted in the experiments,and the findings indicate that the method reaches the accuracy of97.30%,and it is of goodfeasibility and practicability
作者
杨永红
高磊
余航
徐欣辰
YANG Yonghong;GAO Lei;YU Hang;XU Xinchen(Exploration & Development Research Institute, Shengli Oilfield Branch Company SINOPEC, Dongying, Shandong 257000, China;College of Computer Engineering and Science, Shanghai University, Shanghai 200444, China)
出处
《计算机工程与应用》
CSCD
北大核心
2017年第7期109-114,共6页
Computer Engineering and Applications
关键词
DeepWeb接口
文档对象化模型树
接口块
多类分类
Deep Web interface
Document Object Model(DOM)tree
interface block
multi-class classification