期刊文献+

基于网页的站内信息采集技术的研究与实现 被引量:1

A Study and Implement of Intranet Gather Information Technology Based on Web Page
下载PDF
导出
摘要 实现站内搜索引擎的关键一步是信息的自动采集.站内信息采集技术是通过分析网页的HTML代码,获取网内的超链信息,使用广度优先搜索算法和增量存储算法,实现自动地连续分析链接、抓取文件、处理和保存数据的过程.系统在再次运行中通过应用属性对比技术,在一定程度上避免了对网页的重复分析和采集,提高了信息的更新速度和搜全率. The key step of implementation of intranet search engine is to gather information automatically. The intranet gathering information system realizes that it continuously analyzes hyperlinks, crawls files, processes and stores data by analyzing HTML codes, abstracting hyperlinks, designing the breadth-first search algorithm and increment memory algorithm. When the system runs again ,the technology of attribute comparing is applied ,the speed of update and the rate of recall are improved.
出处 《内蒙古大学学报(自然科学版)》 CAS CSCD 北大核心 2009年第2期203-207,共5页 Journal of Inner Mongolia University:Natural Science Edition
基金 内蒙古工业大学科学研究项目(X200806)
关键词 信息采集 广度优先搜索算法 增量存储 gather information breadth first search increment memory
  • 相关文献

参考文献5

二级参考文献33

  • 1刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:198
  • 2孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学,2001,3(1):22-32. 被引量:101
  • 3Shkapenyuk V, Suel T. Design and Implementation of a High- performance Distributed Web Crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE'02), San Jose, CA, 2002:357-368 被引量:1
  • 4Cho J, Garcia-Molina H, Page L. Efficient Crawling Through Url Ordering. In 7^th Int. World Wide Web Conference, 1998 被引量:1
  • 5Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. In Proc. of the 8^th Int. World Wide Web Conference (WWW8), 1999 被引量:1
  • 6Rennie J, McCallum A. Using Reinforcement Learning to Spider the Web Efficiently. In Proc. of the Int. Conf. on Machine Learning (ICML),1999 被引量:1
  • 7Spertus E. Parasite: Mining Structural Information on the Web. In : Proc. of the Sixth Int'l World Wide Web Conf. , 1997 被引量:1
  • 8Cho J, Garcia-Molina H. The Evolution of the Web and Implications for an Incremental Crawler. In Proc. of 26th Int. Conf. on Very Large Data Bases, 2000:117-128 被引量:1
  • 9Henzinger M R, Heydon A, Mitzenmacher M, et al. on Near-uniform URL Sampling. In Proc. of the 9^th Int. World Wide Web Conference, 2000 被引量:1
  • 10Raghavan S, Garcia-Molina H. Crawling the Hidden Web. In Proc. of 27^th Int. Conf. on Very Large Data Bases, 2001 被引量:1

共引文献88

同被引文献3

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部