期刊文献+

一种多分类器Deep Web数据源的自动分类与判别方法

A METHOD TO AUTOMATICALLY CLASSIFY AND DISCRIMINATE DEEP WEB DATA SOURCE USING MULTI-CLASSIFIER
下载PDF
导出
摘要 Deep Web数据源的发现和其领域相关性越来越引起人们的关注和兴趣。针对在判别查询接口时,提取精度低和忽略领域相关性的问题,提出一种采用多分类器对Deep Web数据源进行自动分类和判别的方法,其思想是:对爬虫获取到的页面使用朴素贝叶斯分类器对其进行领域相关性分类,然后使用改进的决策树分类器来对特定领域的数据源进行判定。实验结果表明此方法相比于使用单一决策树分类器有更好的性能,其召回率和精度都有所提高。 Recently, the discovery of Deep Web data source and its domain correlation attract growing attention and interests. This paper proposed a method using multi-classifier to automatically classify and discriminate the data source of Deep Web to solve the problem that when discriminating the query interfaces the extraction precision is low and the domain correlation is overlooked. The notion of the method is ,first it uses Naive Bayes classifier to classify the pages snatched by the crawler upon their domain correlation; secondly, it uses the improved CA. 5 Decision tree algorithm to judge the data source in specific domain. The result of the experiment competed with the single decision tree classifi- er proved that this method has better performance in higher recall rate and precision.
出处 《计算机应用与软件》 CSCD 2010年第2期11-13,70,共4页 Computer Applications and Software
基金 国家自然科学基金项目(60673092 60775046 60873116) 教育部科学技术研究重点项目(207040) 中国博士后科研基金项目(20060390919) 江苏省自然科学基金项目(BK2008161) 江苏省高校自然科学基金(06KJB520104)
关键词 深网 网页表单 朴素贝叶斯分类 决策树 Deep web Html form Naisve Bayes classification Decision tree
  • 相关文献

参考文献11

  • 1Bergman M K. The deep web:surfacing the hidden value [J/OL]. The Journal of Electronic Publishing, 2001,7 ( 1 ). http://www. press. mich. edu/jep/07-01/bergman. html. 被引量:1
  • 2Chang K C C,He B Li. CStructured databases on the Web:Observations and Implications[ R]. Technical Report, UIUC,2004. 被引量:1
  • 3Bergman M K. Deep Web Whitepaper [ EB/OL]. 2004. http ://briightplanet. com. 被引量:1
  • 4Fllorescu D Levyay, Mendel A Zon. Database techniques for the worldwide web:A survey [J]. SIGMOD Record,1998,27(3) :59-74. 被引量:1
  • 5He B, Pater M, Zhang Z. Accessing the Deep Web: A Survey [ C ]// Communications of the ACM(CACM) ,2007. 被引量:1
  • 6Cope J, Craswell N, Hawking D. Automated Discovery of Search Interfaces on the Web[ C]//Proceeding of ADC2003,2003. 被引量:1
  • 7高岭,赵朋朋,崔志明.Deep Web查询接口的自动判定[J].计算机技术与发展,2007,17(5):148-151. 被引量:13
  • 8Rennie J, McCallum A. Using Reinforcement Learning to Spider the web Efficiently[ C]//Proceeding of ICML,1999. 被引量:1
  • 9Akilandeswari J, Gopalan N P. A Novel Design of Hidden Web crawler Using Reinforcement Learning [ C ]//APPT2007. Based Agents. Berlin Heidelberg, c2007. 被引量:1
  • 10David D Lewis. Naive(Bayes) at forty: The independence assumption in information retrieval [ C ]//ECML-98. 1998. 被引量:1

二级参考文献5

  • 1Ghanem T M,Aref W G.Databases Deepen the Web[J].IEEE Computer,2004,73(1):116-117. 被引量:1
  • 2Bergman M K.The Deep Web:Surfacing Hidden Value[J/OL].The Journal of Electronic Publishing,2001,7(1)[2001].http://www.press.umich.edu/jep/07-01/bergman.html. 被引量:1
  • 3Sherman C,Price G.The Invisible Web:Uncovering Information Sources Search Engines Can't See[M].New York:Cyber Age Books,2001. 被引量:1
  • 4Bergman M K.Deep Web White Paper[EB/OL].2004.http://brightplanet.com/technology/deepweb.asp. 被引量:1
  • 5Lage J P,da Silva A S,Golgher P B,et al..Automatic generation of agents for collecting hidden Web pages for data extraction[J].Data & Knowledge Engineering,2004,49:177-196. 被引量:1

共引文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部