期刊文献+

基于Shark-Search和Hits算法的主题爬虫研究 被引量:18

Research on Topical Crawler of Shark-Search Algorithm and Hits Algorithm
下载PDF
导出
摘要 主题爬虫是实现垂直搜索引擎的核心技术。介绍主题爬虫的两个重要爬行算法:基于网页内容评价的Shark-Search算法和基于网页链接关系的Hits算法,并分析了各自的优缺点,提出了一种新的主题爬行策略:将上述两种算法的优点结合起来即将基于网页内容评价和基于网页链接关系算法结合起来判断待下载url的优劣,并实现了一个主题爬虫。这种新策略正好弥补了两个算法各自的不足。通过与Shark-Search算法和Hits算法实现的主题爬虫对比,发现用新算法实现的主题爬虫查准率比这两种算法高。 Topical crawler is the core technology to achieve vertical search engine.There are two important crawling algorithms to be introduced:content-based evaluation of Shark-Search algorithm and link-based relationships Hits algorithms.It analyzed their respective advantages and disadvantages and proposed a new topical crawling strategy that is to combine the two algorithms which include content-based evaluation and link-based relationships,to judge whether url to be downloaded is good or bad,and implements a topical crawler.This new crawling strategy can make up for the deficiencies of the two algorithms.With the Shark-Search algorithm and the algorithm of the Hits contrast,it is inferred that the effect of using the new topical crawling algorithm which reaches the degree of accuracy is better than those two algorithms.
出处 《计算机技术与发展》 2010年第11期76-79,共4页 Computer Technology and Development
基金 海南省自然科学基金资助项目(609003) 海南大学科研项目(hd09xm84)
关键词 主题爬虫 爬行策略 垂直搜索引擎 topical crawler crawling strategy vertical search engine
  • 相关文献

参考文献15

  • 1CCNIC.第25次中国互联网络发展状况统计报告[EB/OL]. 2010. http://www.cnnic.cn/uploadfiles/pdf/2010/1/ 15/101600. pdf. CI2NIC. 被引量:1
  • 2Panidis A, Poulos G K C, Pitas I. Combining Text and Link Analysis for Focused Crawling - an Application for Vertical Search Engines[J]. Information System,2007,32(6) :886 -908. 被引量:1
  • 3Menczer F,Pant G,Srinivasan P. Topical web crawlers: evaluating adaptive algorithms[J]. ACM Transactions on Internet Technology,2004,4(4) :378 - 419. 被引量:1
  • 4Menczer F, Pant G. Evaluating Topic - Driven Web Crawlers[ C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: [s. n. ] ,2001:9 - 12. 被引量:1
  • 5欧阳柳波,李学勇,李国徽,王鑫.专业搜索引擎搜索策略综述[J].计算机工程,2004,30(13):32-33. 被引量:34
  • 6黄萱菁,吴立德,石崎洋之,徐国伟.独立于语种的文本分类方法[J].中文信息学报,2000,14(6):1-7. 被引量:52
  • 7Bra D P, Post R. Searching for arbitrary information in the WWW: the fish - search for mosaic [ C ]//Second WWW Conference. Chicago: ACM Press, 1994: 45 - 51. 被引量:1
  • 8Herseovici M,Jacov M,SMaarek Y. The Shark- Search Algorithm- An Application:Tailored Web Site Mapping[J]. Computer Networks and ISDN Systems, 1998,30 : 317 - 326. 被引量:1
  • 9Page L,Brin S,Motwani R. The PageRank Citation Ranking: Bring Order to the Web[ R]. Stanford, CA. Stanford University, 1998. 被引量:1
  • 10Kleinberg J. Authoritative Sources in A Hyperlinked Environment[J] .Journal of the ACM,1999,46(5) :604 - 632. 被引量:1

二级参考文献14

  • 1吴军,王作英,禹锋,王侠.汉语语料的自动分类[J].中文信息学报,1995,9(4):25-32. 被引量:24
  • 2Menczer F. Complementing Search Engines with Online Web Mining Agents[J]. Decision Support Systems, 2003, 35(2): 195-212 被引量:1
  • 3Bra D P, Houben G, Kornatzky et al. Information Retrieval in Distributed Hypertexts[C]. In: Proc. of the 4th RIAO Conference,1994 被引量:1
  • 4Hersovici M, Heydon A, Mitzenmacher M, et al. The Shark-search Algorithm-An Application: Tailored Web Site Mapping[C]. In: Proc.of the World-Wide Web Conference, 1998 被引量:1
  • 5Cho J, Garcia-Molina H, Page L. Efficient Crawling Through URL Ordering[J]. Computer Networks, 1998, 30(1-7): 161- 172 被引量:1
  • 6Rennie J, McCallum A. Using Reinforcement Learning to Spider the Web Efficiently[C]. In: Proc. of the International Conference on Machine Learning(ICML 99), 1999 被引量:1
  • 7Diligenti M, Coetzee F M, Lawrence S, et al. Focused Crawling Using Context graphs[C]. In: Proc. of the International Conference on Very Large Database(VLDB00), 2000 被引量:1
  • 8Bharat K, Henznger. Improved Algorithms for Topic Distillation in A Hyperlinked Environment[C]. In: Proc. of SIGIR Conference on Research and Development in Information Retrieval, 1998 被引量:1
  • 9Aggarwal C, Al-Garawi F, Yu S P. Intelligent Crawling on the World Wide Web with Arbitrary Predicates[C]. In: Proc. of the 10th International World Wide Web Conference, 2001 被引量:1
  • 10Ester M, Grob M, Kriegel H. Focused Web Crawling: A Generic Framwork for Specifying the User Interest and for Adaptive Crawling Stratrgies[C]. In: Proc. of the International Conference on Very Large Database(VLDB01 ), 2001 被引量:1

共引文献84

同被引文献183

引证文献18

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部