期刊文献+

应用正则式抽取Google网页内容 被引量:6

Extracting the Content of Google Web Page with Regular Expressions
下载PDF
导出
摘要 正确、完整地抽取搜索网页的内容,是对检索到的信息进行处理的基本前提。本文分析了Google网页的结构特征,给出了一组匹配Google网页内容的正则式,并用V isual C#实现了一个内容抽取器。对多个Google网页的实际应用表明,本文提出的正则式匹配方法可以抽取Google网页的全部主要内容。 That properly and completely extracting the content of search Web pages is the basic precondition for handling the information retrieved. This paper analyses the structure characteristic of Google Web pages, presents a group of regular expressions for matching the content of these pages, and realizes a content extractor with Visual C#. The results from practical application to many Google Web pages shows that the matching method with regular expressions can extract the whole main content of Google Web pages.
作者 张健 欧红
出处 《现代图书情报技术》 CSSCI 北大核心 2005年第9期50-53,共4页 New Technology of Library and Information Service
关键词 正则式 抽取 网页 GOOGLE Regular expressions Extraction Web page Google
  • 相关文献

参考文献8

二级参考文献22

  • 1Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. In: SIGKDD, 2002 被引量:1
  • 2Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde.Enhanced topic distillation using text, markup tags, and hyperlinks. In: SIGIR, 2001 被引量:1
  • 3S. Chakrabarti, M. Joshi, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In :WWW, Hawaii. ACM, 2002 被引量:1
  • 4Yiming Yang. Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995 被引量:1
  • 5Li Xiaoli and Shi Zhongzhi. Innovating Web page classification through reducing noise. Journal of Computer Science & Technology, 2002 ,17(1): 9 ~ 17 被引量:1
  • 6http://162. 105.80.84/cgi-bin/getdirectory? ccode = 0 被引量:1
  • 7http://e. pku. edu. cn 被引量:1
  • 8Yang Y. Expert network:effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the Seventeenth International ACM SIGIR Conference on Research and Development in Information Retrieval,1994. 13 ~ 22 被引量:1
  • 9Lewis D. D., et al. Training algorithms for linear text classitiers. In: Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996. 298 ~ 306 被引量:1
  • 10Michael W. Berry, Murray Browne. Understand Search Engines (Mathematical Modeling and Text Retrieval). SLAM,1999 被引量:1

共引文献84

同被引文献37

引证文献6

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部