摘要
通过观察网站呈现网页的规律及网页本身的结构特点,提出基于URL类型及网页链接变化规律的入口页面识别算法,优先抓取入口页面.在实际应用中,取得了较好的更新效果.
The refreshment algorithm based on URL type and outlink change is proposed by observing the page orderliness of Web sites and the structural characteristics of the page. This algorithm is used for fetching the entry pages,and a perfect effect in real application is obtained.
出处
《郑州大学学报(理学版)》
CAS
2007年第2期60-64,共5页
Journal of Zhengzhou University:Natural Science Edition
基金
国家自然科学基金资助项目
编号90412015
关键词
入口页面
网页更新
增量采集
entry page
page refreshment
incremental crawler