期刊文献+

基于WebKit的网络爬虫 被引量:3

A WebKit based web crawler
下载PDF
导出
摘要 以AJAX技术为代表的Web应用新技术的出现,赋予了JavaScript更加丰富的功能。但也导致更多的URL以数据形式存在于JavaScript代码中,给网络爬虫的URL提取带来了新的挑战。为了解决这一问题,在此提出了一种基于WebKit的网络爬虫,以WebKit作为爬虫的前端来解析并执行JavaScript。一是实现JavaScript对网页DOM的修改,从而将存在于此类代码中的URL转换成HTML形式并以静态分析方法来提取;二是定位JavaScript页面导航的代码并且劫持输入导航方法及对象的变量以提取变量中的URL。这充分降低了客户端脚本给爬虫带来的障碍,从而更好地提取网页中的URL。 With the emergence of the new web application technologies symbolized by the AJAX technology,the richer functions has been attached to JavaScript,but this also leads to more URL existing in the form of data in JavaScript codes and brings new challenges to the URL extraction of Web crawler.To solve these problems,a WebKit-based web crawler is proposed in this paper.JavaScript is parsed and executed by taking WebKit engine as the front end of the crawler,with which the DOM modification on the web page of JavaScript is implemented to convert URL existing in those codes into HTML style and extract them by static analysis method,and also the page navigation codes of JavaScript is positioned and the input navigation method is intercepted to extract URL in these variables.These two methods has sufficiently reduced the barriers caused by client side script,so URL in the web page can be extracted more perfectly.
出处 《现代电子技术》 2013年第18期62-64,68,共4页 Modern Electronics Technique
关键词 网络爬虫 浏览器引擎 WEBKIT JAVASCRIPT web crawler browser engine WebKit JavaScript
  • 相关文献

参考文献12

二级参考文献16

  • 1印鉴,陈忆群,张钢.搜索引擎技术研究与发展[J].计算机工程,2005,31(14):54-56. 被引量:53
  • 2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 3Raghavan S, Garcia-Molina H. Crawling the hidden web [ C ]//Roma, Italy:Prec. of the 27th International Conference on Very Large Data- Bases(VLDB) ,2001:129 - 139. 被引量:1
  • 4Barbosa L, Freire J. Anadaptive crawler for locating hidden-web entry points [ C ]//Alberta, Canada: Proc. of the 16th international conference on World Wide Web,2007:441 -450. 被引量:1
  • 5Ntoulas A, Zerfos P, Cho J. Downloading textual hidden web content through key word queries [ C ]//North California, USA : Proc. of the 5th ACM/IEEE-CS joint conference on Digital libraries,2005:100- 109. 被引量:1
  • 6Alvarez M, Raposo J, Pan A, et al. Crawling the Content Hidden Behind Web Forms [ J]. Lecture Notes in Computer Science,2007,4702:322 -333. 被引量:1
  • 7Alvarez M, Pan A, Raposo J, et al. Crawling Web Pages with Support for Client-Side Dynamism[ C ]//HongKong, China: Proc. of the 7th International Conference on Web Age Information Management (WAIM06). 2006 : 252 - 262. 被引量:1
  • 8Moailla. Tutorial: Embedding Rhino [ EB/OL]. 2006 - 11 - 14. http ://www. mozilla, org/rhino/tutorial, html. 被引量:1
  • 9Mozilla. Rhino documentation [ EB/OL]. 2008 - 4 - 14. http ://developer. rnozilla, org/en/docs/Rhino documentation. 被引量:1
  • 10Liu H Y, Milios E, Janssen J. Focused Crawling by Learning HMM from User' s Topic - specific Browsing[ C ]. Proceedings of the web intelligence. IEEE/WIC/ACM International Conference on Web intelligence. Washington DC, USA:IEEE Computer Society ,2004. 被引量:1

共引文献9

同被引文献14

引证文献3

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部