摘要
以AJAX技术为代表的Web应用新技术的出现,赋予了JavaScript更加丰富的功能。但也导致更多的URL以数据形式存在于JavaScript代码中,给网络爬虫的URL提取带来了新的挑战。为了解决这一问题,在此提出了一种基于WebKit的网络爬虫,以WebKit作为爬虫的前端来解析并执行JavaScript。一是实现JavaScript对网页DOM的修改,从而将存在于此类代码中的URL转换成HTML形式并以静态分析方法来提取;二是定位JavaScript页面导航的代码并且劫持输入导航方法及对象的变量以提取变量中的URL。这充分降低了客户端脚本给爬虫带来的障碍,从而更好地提取网页中的URL。
With the emergence of the new web application technologies symbolized by the AJAX technology,the richer functions has been attached to JavaScript,but this also leads to more URL existing in the form of data in JavaScript codes and brings new challenges to the URL extraction of Web crawler.To solve these problems,a WebKit-based web crawler is proposed in this paper.JavaScript is parsed and executed by taking WebKit engine as the front end of the crawler,with which the DOM modification on the web page of JavaScript is implemented to convert URL existing in those codes into HTML style and extract them by static analysis method,and also the page navigation codes of JavaScript is positioned and the input navigation method is intercepted to extract URL in these variables.These two methods has sufficiently reduced the barriers caused by client side script,so URL in the web page can be extracted more perfectly.
出处
《现代电子技术》
2013年第18期62-64,68,共4页
Modern Electronics Technique