基于关键词的深度万维网数据库选择被引量：11

Keyword-Based Deep Web Database Selection

下载PDF

导出

摘要该文提出一种基于关键词的深度万维网查询方法:用户用关键词的方式提交查询,该方法在线地选择能够反映查询意图并且提供高质量结果的万维网数据库.这种方法既避免了深度万维网数据抓取这一代价高、难度大的操作,又可支持多领域的数据库上的关键词查询,从而能够与现有的搜索引擎实现无缝集成.文中侧重于讨论基于关键词的数据库选择,从以下两个方面解决这一问题所涉及的挑战:(1)提出了一种度量关键词-领域属性关联的相关性模型,并设计了基于随机游动的算法从查询日志中发现潜在的关键词-属性关联;(2)给出了一种新的数据采样方法,并用于基于采样的数据库-查询的相关性模型中,最终解决深度万维网的数据库选择问题.在中文深度万维网真实数据集上的实验表明:提出的方法能够有效地选择与关键词查询相关的数据库,提供高质量的结果. This paper proposes a keyword-based Deep Web search method： Given keyword queries provided by users,the proposed method on-the-fly selects the databases capturing the queryintent and providing high-quality data.The method,which is much more efficient than Deep Webcrawling,can support keyword search over multiple-domain Deep Web databases,and thus can besmoothly integrated with the existing search engine architecture.In this paper,we focus on key-word-based Deep Web database selection,and studythe research challenges that naturally arisein the proposed method.（1） We introduce an effective model to measure the relevance of database-domain attributes with respect to keyword queries,and propose a random-walk algorithm to compute the relevance fromdatabase query logs.（2） We develop a novel database sampling method for measuring the relevance of databases with respect to queries,in order to select relevant data-bases in the selected domains.We have implemented our methods on real data sets fromthe Chinese Deep Web.The experi mental results show that our methods achieve high effectiveness.

作者范举周立柱

机构地区清华大学计算机科学与技术系

出处《计算机学报》 EI CSCD 北大核心 2011年第10期1797-1804,共8页 Chinese Journal of Computers

基金国家自然科学基金重点项目"支持中文Web研究的基础设施建设和应用中的基本方法与关键技术"(60833003)资助

关键词深度万维网万维网数据库关键词查询领域选择数据库选择 deep Web Web databases keyword search domain selection database selection

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献13

1Madhavan J, Cohen S, Dong X, Halevy A, Jeffery S, Ko D, Yu C. Web-scale data integration: You can afford to pay as you go//Proceedings of the CIDR. Asilomar, USA, 2007: 342-350. 被引量：1
2刘玉奎,周立柱,范举.中文深度万维网数据库的现状研究[J].计算机学报,2011,34(2):360-370. 被引量：7
3Madhavan J, Ko D, Kot L, Ganapathy V, Rasmussen A, Halevy A. Google's deep web crawl. PVLDB, 2008, 1: 1241- 1252. 被引量：1
4He H, Meng W, Yu C, Wu Z. Automatic integration of Web search interfaces with wise integrator. VLDB Journal, 2004, 12: 256- 273. 被引量：1
5He B, Zhang Z, Chang K C-C. Knocking the door to the deep web: Integrating web query interfaces//Proceedings of theSIGMOD. Paris, France, 2004:913-914. 被引量：1
6Zhang Z, He B, Chang K C C. Light weight domain based form assistant: Querying Web databases on the Fly//Proceedings of the VLDB. Trondheim, Norway, 2005:97-108. 被引量：1
7Fan J, Li G, Zhou L. Interactive SQL query suggestion: Making databases user-friendly//Proeeedings of the ICDE. Hannover, Germany, 2011:351- 362. 被引量：1
8Agarwal G, Kabra G, Chang K C C. Towards rich query in terpretation: Walking back and forth for mining query tern plates//Proceedings of the WWW. Raleign, USA, 2010: 1-10. 被引量：1
9Bu Y, Howe B, Balazinska M, Ernst M D. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 2010, 3(1): 285 -296. 被引量：1
10Si L, Callan J P. Relevant document distribution estimation method for resource selection//Proceedings of the S1GIR. Toronto, Canada, 2003: 298-305. 被引量：1

二级参考文献16

1Ipeirotis P G,Gravano L,Sahami M.Probe,count,and classify:Categorizing hidden web databases//Proceedings of the SIGMOD Conference.Santa Barbara,CA,2001:67-78. 被引量：1
2Chau M,Chen H.A machine learning approach to web page filtering using content and structure analysis.Decision Support Systems,2008,44(2):482-494. 被引量：1
3Barbosa L,Freire J.Combining classifiers to identify online databases//Proceedings of the 16th International Conference on World Wide Web.Banff,Alberta,Canada,2007:431-440. 被引量：1
4Cope J,Craswell N,Hawking D.Automated discovery ofsearch interfaces on the web//Proceedings of the 14th Australian Database Conference.Australia,2003:181-189. 被引量：1
5Raghaven S,Garcia-Molina H.Crawling the hidden web//Proceedings of the 27th International Conference on Very Large Data Bases.Italy,2001,129-138. 被引量：1
6Chang K C,He B,Li C.Structured databases on the Web:Observations and implications.SIGMOD Record,2004,33 (3):61270. 被引量：1
7Gravano L,Ipeirotis P G,Sahami M.QProber:A system for automatic classification of hidden-web databases.ACM Transactions on Information System,2003,22(1):1-41. 被引量：1
8Su W,Wang J,Lochovsky F H.Automatic hierarchical classification of structured deep web databases//Proceedings of the 7th International Conference on Web Information Systems Engineering,China,2006:210-221. 被引量：1
9He B,Tao T,Chang K C-C.Clustering structured Web sources:A schema-based,model-differentiation approach// Proceedings of the Current Trends in Database Technology-EDBT 2004 Workshops.Greece,2004:536-546. 被引量：1
10Lu Y,He H,Peng Q,Meng W,Yu C T.Clustering e-commerce search engines based on their search interface pages using wise-cluster.Data Knowledge Engine,2006,59(2):231-246. 被引量：1

共引文献6

1高明,黄哲学.Deep Web研究现状与展望[J].集成技术,2012,1(3):47-54. 被引量：1
2宋广军,宋婉约.数据挖掘技术在行业应用中的分析与比较[J].科技风,2012(19):97-98. 被引量：1
3王宁,杨扬,由海涌,赵耀培,孟坤.极大有序频繁项目集的时间属性分析方法[J].小型微型计算机系统,2013,34(1):120-124. 被引量：3
4褚龙现,张琳.基于GIS的森林资源数据库系统的设计[J].中南林业科技大学学报,2012,32(6):48-54. 被引量：5
5丁传羽,陈军华,夏海峰.基于关键词的深度万维网数据库查询[J].计算机与数字工程,2013,41(4):616-618. 被引量：1
6何小明.深层网页垂直爬虫技术研究综述[J].电子世界,2018,0(16):42-43.

同被引文献77

1姚天顺,张俐,高竹.WordNet综述[J].语言文字应用,2001(1):27-32. 被引量：33
2赵朋朋,高岭,崔志明.基于查询接口特征的Deep Web数据源自动分类[J].微电子学与计算机,2006,23(10):47-50. 被引量：11
3吴友政,赵军,徐波.基于主题语言模型的句子检索算法[J].计算机研究与发展,2007,44(2):288-295. 被引量：8
4刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量：136
5Umara Noor, Zahid Rashid, Azhar Rauf. A survey of automat- ic Deep Web classification techniques[J]. International Journalo{ Computer Applications,2011,19(6) :43-50. 被引量：1
6Jiawei Han,Micheline Kamber.数据挖掘:概念与技术[M].范明,孟小峰,译.北京:机械工业出版社,2007. 被引量：1
7LI Guoliang, WU Hao, FENG Jianhua, et al. DBease: Mak- ing databases user-friendly and easily accessible[C]//Proceed- ings of the 5th Biennial Conference on Innovative Data System Research, 2011 : 45-56. 被引量：1
8Bin He, Zhen Zhang, Kevin Chen-Chuan Chang. MetaQueri er: Querying Structured Web Sources on-the-fly[C]//Proceed ings of the 2005 ACM SIGMOD international conference on Management of data, 2005 : 927-929. 被引量：1
9George A. Miller, WordNet: a Lexical Database for English [M]. Communications of the ACM, 1995,38(11) : 39-41. 被引量：1
10LIANG Hao, ZUO Wanli, REN Fei, et al. Translating Que ry for Deep Web Using Ontology [C]//Proceedings of the 2008 International Conference on Computer Science and Soft ware Engineering, 2008 (4) : 427-430. 被引量：1

引证文献11

1万常选,邓松,刘喜平,廖国琼,刘德喜,江腾蛟.Web数据源选择技术[J].软件学报,2013,24(4):781-797. 被引量：16
2丁传羽,陈军华,夏海峰.基于关键词的深度万维网数据库查询[J].计算机与数字工程,2013,41(4):616-618. 被引量：1
3万常选,邓松,刘德喜,江腾蛟,刘喜平.面向混合类型关键词查询的非合作结构化深网数据源选择[J].计算机研究与发展,2014,51(4):905-917. 被引量：6
4邓松,万常选,吁亮,刘德喜,雷刚,王映龙.非合作结构化深网数据源摘要的动态更新[J].微电子学与计算机,2014,31(4):36-39. 被引量：1
5夏海峰,陈军华.基于语义相似度计算的Deep Web数据库查询[J].微型机与应用,2014,33(8):64-67. 被引量：2
6邓松.实体信息集成检索的深网数据源选择[J].计算机工程,2016,42(10):75-79. 被引量：2
7鲜学丰,崔志明,方立刚,顾才东,孙逊.面向Deep Web本地化数据集成的数据源两层选择模型[J].计算机工程,2017,34(3):32-39. 被引量：3
8王嵘冰,党小婉,徐红艳,冯勇.基于模板的Deep Web实体识别信息抽取方法研究[J].辽宁大学学报（自然科学版）,2017,44(2):97-104.
9邓松,万常选.基于主题与概率模型的非合作深网数据源选择[J].软件学报,2017,28(12):3241-3256. 被引量：1
10邓松.面向旅游人文信息集成的Web数据源选择[J].山东大学学报（理学版）,2016,51(3):70-76.

二级引证文献27

1姚瑶,王战红,石磊.一种基于页面聚类的Web概念化建模新方法[J].微电子学与计算机,2015,32(1):156-160. 被引量：2
2李宝林.浅析网站性能优化技术[J].电子设计工程,2014,22(2):5-6. 被引量：2
3万常选,邓松,刘德喜,江腾蛟,刘喜平.面向混合类型关键词查询的非合作结构化深网数据源选择[J].计算机研究与发展,2014,51(4):905-917. 被引量：6
4邓松,万常选,吁亮,刘德喜,雷刚,王映龙.非合作结构化深网数据源摘要的动态更新[J].微电子学与计算机,2014,31(4):36-39. 被引量：1
5夏海峰,陈军华.基于语义相似度计算的Deep Web数据库查询[J].微型机与应用,2014,33(8):64-67. 被引量：2
6曾小燕,周统钢.累积反馈学习的简单贝叶斯舆情信息分类技术[J].嘉应学院学报,2014,32(5):18-22.
7杨毅.浅谈网站性能提升的途径[J].计算机光盘软件与应用,2014,17(14):149-150.
8王继奎,李少波.基于真值发现的冲突数据源质量评价算法[J].浙江大学学报（工学版）,2015,49(2):303-308. 被引量：2
9夏立新,楚林,王忠义,石义金,李京蔚.基于网络文本挖掘的就业知识需求关系构建[J].图书情报知识,2016,33(1):94-100. 被引量：39
10马晓珺,刘凌霞.基于语义指向性分析的数据库访问查询优化设计[J].微电子学与计算机,2016,33(2):104-108. 被引量：2

1丁传羽,陈军华,夏海峰.基于关键词的深度万维网数据库查询[J].计算机与数字工程,2013,41(4):616-618. 被引量：1
2刘玉奎,周立柱,范举.中文深度万维网数据库的现状研究[J].计算机学报,2011,34(2):360-370. 被引量：7
3赵玲,关立行.ASP查询Web数据库记录的分页显示技术与实现[J].微机发展,2003,13(2):89-90. 被引量：4
4杨波,刘渊.基于算术平均值的网络流量数据采样方法[J].微计算机信息,2007(24):106-107. 被引量：1
5王姝,陈崚.基于正交试验设计的粒子群优化算法[J].扬州大学学报（自然科学版）,2010,13(2):57-60. 被引量：4
6吴伟娜,孙世鹏,杨风,戴敏龙,张宏.常用排序算法的比较与分析[J].电脑知识与技术,2013,9(3X):2146-2148. 被引量：4
7汪廷华,陈峻婷.核函数的选择研究综述[J].计算机工程与设计,2012,33(3):1181-1186. 被引量：53
8张永平,张功萱,朱昭萌,张巍,郭箭.基于并行压缩感知的物联网海量数据处理[J].计算机应用与软件,2012,29(10):58-61. 被引量：2
9康雪娟,景军锋.两种采用CAN总线进行通信的系统比较[J].可编程控制器与工厂自动化（PLC FA）,2008(7):72-74.
10陈媛媛,彭新光.分类算法的分析与比较[J].电子产品可靠性与环境试验,2004,22(6):72-75. 被引量：3

计算机学报

2011年第10期

浏览历史

内容加载中请稍等...

基于关键词的深度万维网数据库选择被引量：11

参考文献13

二级参考文献16

共引文献6

同被引文献77

引证文献11

二级引证文献27

相关作者

相关机构

相关主题

浏览历史

基于关键词的深度万维网数据库选择 被引量：11

参考文献13

二级参考文献16

共引文献6

同被引文献77

引证文献11

二级引证文献27

相关作者

相关机构

相关主题

浏览历史

基于关键词的深度万维网数据库选择被引量：11