期刊文献+

基于结构驱动的网络论坛采集路径研究 被引量:1

Structure-driven based traversal strategy for Web forum crawling
下载PDF
导出
摘要 网络论坛中蕴涵着大量具有实用价值和商业价值的信息,是搜索引擎和问答系统信息的重要来源。针对论坛结构复杂、链接种类繁多,以及容易陷入采集陷阱等问题,提出了一种基于结构驱动的采集路径选择方法。首先根据用户标注的少量类型数据,利用DOM树对采样网页基于网页结构进行结构聚类;其次根据各节点的评价进行采集路径选择;最后对翻页链接进行有效的识别和处理。实验表明,该方法采集的覆盖率和有效率明显优于传统算法,并且应用在中国科学院计算所舆情监测平台上取得了良好的效果。 Forums contain much practical and business information,which is the important source of information for search engines and question answering system.Complex structure of the forums,a great variety of links and the issues that being easy to fall into the trap of crawling are all the problems when collect information.This paper proposed a crawling method based on structure-driven path selection to solve these problems.First,used a small number of types of data marked by the users,and used DOM tree to cluster by structure based on Web-based structure.And then,chose the collected route according to the evaluation of each node,at last identified and processed the link to the page effectively.Experiments show that the coverage and efficiency of collection is better than the traditional algorithm.And get good results through the golaxy public opinion monitoring system of ICT.
出处 《计算机应用研究》 CSCD 北大核心 2011年第9期3284-3287,共4页 Application Research of Computers
基金 国家自然科学基金资助项目(60873166) 国家教育部科学技术研究重点资助项目(109028) 北京市教育科学基金资助项目(AHA09110)
关键词 信息检索 论坛采集 结构驱动 聚类 路径选择 information retrieval forum crawling structure-driven clustering traversal path selection
  • 相关文献

参考文献7

  • 1VIDAL M L A,SILVA A S da,De MOURA E S.Structure-drivencrawler generation by example. Proc of the 29th Annual Interna-tional ACM SIGIR Conference on Research and Development in Infor-mation Retrieval . 2006 被引量:1
  • 2CAI Rui,YANG Jiang-ming,LAI Weiet al.iRobot:an intelligentcrawler for Web forums. Proc of the 17th International WorldWide Web Conference . 2008 被引量:1
  • 3WANG Y,YANG Jiang-ming,LAI Weiet al.Exploring traversalstrategy for Web forum crawling. Proc of the 31st Annual Inter-national ACM SIGIR Conference on Research and Development in In-formationa Retrieval . 2008 被引量:1
  • 4GUO Yan,LI Kui,ZHANG kaiet al.Board forum crawling:a Webcrawling method for Web forum. Proc of International Conferenceon Web Intelligence . 2006 被引量:1
  • 5李盛韬.基于主题的Web信息采集技术研究[D]中国科学院研究生院(计算技术研究所),中国科学院研究生院(计算技术研究所)2002. 被引量:1
  • 6NAJORK M,WIENER J L.Breadth-first search crawling yields high-quality pages. Proc of the 19th International World Wide WebConference . 2010 被引量:1
  • 7李魁,程学旗,郭岩,张凯.WWW论坛中的动态网页采集[J].计算机工程,2007,33(6):80-82. 被引量:11

二级参考文献5

  • 1Cho J,Garcia-Molina H,Page L.Efficient Crawling Through URL Ordering[C]//Proceedings of the 7^th International World Wide Web Conference.1998:161-172. 被引量:1
  • 2Najork M,Wiener J L.Breadth-first Crawling Yields High-quality Pages[C]//Proceedings of the 10^th International World Wide Web Conference.2001:114-118. 被引量:1
  • 3Li Jun,Furuse K,Yamaguchi K.Focused Crawl -ing by Exploiting Anchor Text Using DecisionTree[C]//Proceedings of the 14^th International World Wide Web Conference.2005:1190-1191. 被引量:1
  • 4Castillo C.Effective Web Crawling[D].University of Chile,2004. 被引量:1
  • 5Brin S,Page L.The Anatomy of a Large-scale Hypertextual Web Search Engine[J].Computer Networks and ISDN Systems,1998,30(1-7):107-117. 被引量:1

共引文献10

同被引文献7

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部