一种多分类器Deep Web数据源的自动分类与判别方法

A METHOD TO AUTOMATICALLY CLASSIFY AND DISCRIMINATE DEEP WEB DATA SOURCE USING MULTI-CLASSIFIER

下载PDF

导出

摘要 Deep Web数据源的发现和其领域相关性越来越引起人们的关注和兴趣。针对在判别查询接口时,提取精度低和忽略领域相关性的问题,提出一种采用多分类器对Deep Web数据源进行自动分类和判别的方法,其思想是:对爬虫获取到的页面使用朴素贝叶斯分类器对其进行领域相关性分类,然后使用改进的决策树分类器来对特定领域的数据源进行判定。实验结果表明此方法相比于使用单一决策树分类器有更好的性能,其召回率和精度都有所提高。 Recently, the discovery of Deep Web data source and its domain correlation attract growing attention and interests. This paper proposed a method using multi-classifier to automatically classify and discriminate the data source of Deep Web to solve the problem that when discriminating the query interfaces the extraction precision is low and the domain correlation is overlooked. The notion of the method is ,first it uses Naive Bayes classifier to classify the pages snatched by the crawler upon their domain correlation; secondly, it uses the improved CA. 5 Decision tree algorithm to judge the data source in specific domain. The result of the experiment competed with the single decision tree classifi- er proved that this method has better performance in higher recall rate and precision.

作者李志涛刘全周文云

机构地区苏州大学计算机科学与技术学院南京大学计算机软件新技术国家重点实验室

出处《计算机应用与软件》 CSCD 2010年第2期11-13,70,共4页 Computer Applications and Software

基金国家自然科学基金项目(60673092 60775046 60873116) 教育部科学技术研究重点项目(207040) 中国博士后科研基金项目(20060390919) 江苏省自然科学基金项目(BK2008161) 江苏省高校自然科学基金(06KJB520104)

关键词深网网页表单朴素贝叶斯分类决策树 Deep web Html form Naisve Bayes classification Decision tree

分类号 TP391.43 [自动化与计算机技术—计算机应用技术] TP391 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

参考文献11

1Bergman M K. The deep web:surfacing the hidden value [J/OL]. The Journal of Electronic Publishing, 2001,7 ( 1 ). http://www. press. mich. edu/jep/07-01/bergman. html. 被引量：1
2Chang K C C,He B Li. CStructured databases on the Web:Observations and Implications[ R]. Technical Report, UIUC,2004. 被引量：1
3Bergman M K. Deep Web Whitepaper [ EB/OL]. 2004. http ://briightplanet. com. 被引量：1
4Fllorescu D Levyay, Mendel A Zon. Database techniques for the worldwide web:A survey [J]. SIGMOD Record,1998,27(3) :59-74. 被引量：1
5He B, Pater M, Zhang Z. Accessing the Deep Web: A Survey [ C ]// Communications of the ACM(CACM) ,2007. 被引量：1
6Cope J, Craswell N, Hawking D. Automated Discovery of Search Interfaces on the Web[ C]//Proceeding of ADC2003,2003. 被引量：1
7高岭,赵朋朋,崔志明.Deep Web查询接口的自动判定[J].计算机技术与发展,2007,17(5):148-151. 被引量：13
8Rennie J, McCallum A. Using Reinforcement Learning to Spider the web Efficiently[ C]//Proceeding of ICML,1999. 被引量：1
9Akilandeswari J, Gopalan N P. A Novel Design of Hidden Web crawler Using Reinforcement Learning [ C ]//APPT2007. Based Agents. Berlin Heidelberg, c2007. 被引量：1
10David D Lewis. Naive(Bayes) at forty: The independence assumption in information retrieval [ C ]//ECML-98. 1998. 被引量：1

二级参考文献5

1Ghanem T M,Aref W G.Databases Deepen the Web[J].IEEE Computer,2004,73(1):116-117. 被引量：1
2Bergman M K.The Deep Web:Surfacing Hidden Value[J/OL].The Journal of Electronic Publishing,2001,7(1)[2001].http://www.press.umich.edu/jep/07-01/bergman.html. 被引量：1
3Sherman C,Price G.The Invisible Web:Uncovering Information Sources Search Engines Can't See[M].New York:Cyber Age Books,2001. 被引量：1
4Bergman M K.Deep Web White Paper[EB/OL].2004.http://brightplanet.com/technology/deepweb.asp. 被引量：1
5Lage J P,da Silva A S,Golgher P B,et al..Automatic generation of agents for collecting hidden Web pages for data extraction[J].Data & Knowledge Engineering,2004,49:177-196. 被引量：1

共引文献12

1李文骏,崔志明.基于搜索引擎的Deep Web数据源发现技术[J].计算机技术与发展,2008,18(8):58-60. 被引量：2
2赵志宏,黄蕾,刘峰,陈振宇.Deep Web搜索技术进展综述[J].山东大学学报（工学版）,2009,39(2):15-20. 被引量：5
3杨丽华,袁方,姚增利,王煜.基于启发式规则的Deep Web接口发现[J].河北大学学报（自然科学版）,2010,30(1):107-112. 被引量：1
4沈炜,蒙祖强.基于Web日志粒度化的深网数据库分类[J].微计算机信息,2010,26(15):161-162.
5张云坤.基于数据集成的高校图书馆个性化信息服务研究[J].图书馆工作与研究,2010(7):25-27. 被引量：4
6张志平,张云坤,李文骏.Deep web在个性化信息服务中的应用[J].电子商务,2010,11(8):62-63.
7张云坤.基于Deep Web数据集成的个性化信息服务研究[J].现代情报,2010,30(10):74-76.
8王鸿,余建桥.基于N-Gram的Deep Web接口属性抽取[J].计算机与现代化,2010(12):135-138. 被引量：1
9张亮,陆余良,房珊瑶.基于量子自组织神经网络的Deep Web分类方法研究[J].计算机科学,2011,38(6):205-210. 被引量：3
10陈明,郭建兵,赵朋朋,崔志明.Deep Web中基于表单特征的松弛方法[J].计算机工程与设计,2012,33(1):168-172.

1网页表单标准有新进展[J].大众软件,2003(17):61-61.
2高岭,赵朋朋,崔志明.Deep Web查询接口的自动判定[J].计算机技术与发展,2007,17(5):148-151. 被引量：13
3钱丽.基于HTML5的网页表单设计与实现[J].科技视界,2012(28):178-178.
4冯小民.网页表单轻松填[J].电脑,2004(5):130-131.
5晁浩,阮晓钢.基于基因表达谱的前列腺癌分类方法研究[J].计算机工程与应用,2005,41(31):178-179.
6肖旻.一种基于向量空间模型的邮件自动过滤算法研究[J].福建电脑,2006(8):12-13. 被引量：1
7郑淑丽,韩江洪,程文娟,吴永忠.Deep Web查询接口自动识别方法[J].郑州大学学报（理学版）,2009,41(1):56-58. 被引量：1
8王建民.网页表单无障碍设计[J].电子商务,2012,13(11):61-62. 被引量：1
9王海军.用FRONTPAGE 2000制作网页表单[J].师范教育,2003,0(6):33-33.
10李文骏,崔志明.基于搜索引擎的Deep Web数据源发现技术[J].计算机技术与发展,2008,18(8):58-60. 被引量：2

计算机应用与软件

2010年第2期

浏览历史

内容加载中请稍等...

一种多分类器Deep Web数据源的自动分类与判别方法

参考文献11

二级参考文献5

共引文献12

相关作者

相关机构

相关主题

浏览历史