期刊文献+

面向分层结构的网页分类与抓取 被引量:2

Categorization and Extraction of Web Pages Based on Hierarchy
下载PDF
导出
摘要 传统网络爬虫为基于关键字检索的通用搜索引擎服务,无法抓取网页类别信息,给文本聚类和话题检测带来计算效率和准确度问题。本文提出基于站点分层结构的网页分类与抽取,通过构建虚拟站点层次分类树并抽取真实站点分层结构,设计并实现了面向分层结构的网页抓取;对于无分类信息的站点,给出了基于标题的网页分类技术,包括领域知识库构建和基于《知网》的词语语义相似度计算。实验结果表明,该方法具有良好的分类效果。 Traditional web crawler provides services based on searching keywords. It cannot extract the categorization information of web pages, thus resulting in efficiency and accuracy problems on text clustering and topic detection. To solve this problem, a method of categorization and extraction of web pages based on hierarchy is proposed in this paper. By building a virtual hierarchy categorization tree and extracting the hierarchies of real web sites, a web page is categorized when it is crawled. For sites which have no categorization information, a page title based categorization algorithm is presented, including building up the domain knowledge base and calculating the semantic similarity based on Hownet. The experimental results demonstrate that this method achieves preferable effects.
出处 《计算机工程与科学》 CSCD 北大核心 2012年第11期1-6,共6页 Computer Engineering & Science
基金 广东省科技计划基金资助项目(2010B010600017)
关键词 网络爬虫 网页分类 领域知识库 知网 web crawler page categorization domain knowledge base Hownet
  • 相关文献

参考文献9

二级参考文献22

  • 1刘宏伟,黄静.基于朴素贝叶斯算法的垃圾邮件网关[J].微计算机信息,2006,22(06X):73-75. 被引量:6
  • 2黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38. 被引量:1
  • 3Gyongyi, Z. and Garcia-Molina, H. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web. 2005. 被引量:1
  • 4D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and Statistics: Using statistical analysis to locate spam web pages. In: 7th International Workshop on the Web and Databases 2004. 被引量:1
  • 5Z. Gy ngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004. 被引量:1
  • 6W. Wang et al. EviRank: An Evidence Based Content Trust Model for Web Spam Detection. APWeb/WAIM 2007 Ws, LNCS 4537, pp. 299 - 307, 2007. 被引量:1
  • 7Krysta M. Svore, Qiang Wu, Chris J.C. Burges. Improving Web Spam Classification using Rank-time Features. AIRWeb '07, May 8, 2007 Banff, Alberta, Canada. 被引量:1
  • 8T. Urvoy, T. Lavergne, and P. Filoche, Tracking Web Spam with Hidden Style Similarity, Proc. 2nd Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb 06), 2006:. 被引量:1
  • 9J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM, 46:119 - 130, 1997. 被引量:1
  • 10A. Bencz'ur, K. Csalog'any, and T. Sarl'os. Link-based similarity search to fight web spam. In Proc. of AIRWEB 2006, Seattle, 2006. 被引量:1

共引文献400

同被引文献34

引证文献2

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部