摘要
针对中国W eb的高效增量搜集,设计试验考察了网页的短期变化规律,估算出增量搜集需要的最小搜集能力。提出一个通用的增量式搜集系统模型和它的性能准则,该模型阐明了增量搜集的运行原理。针对该模型,结合北大天网增量搜集系统的开发经验,讨论了它的性能瓶颈并给出解决方案。对增量搜集的两类目标——变化网页和新网页,探讨了相应的搜集策略。介绍了该模型的实现和性能状况。该文的工作为增量搜集系统的设计和实现提供了一个成功的模型。
This paper is aimed at efficient incremental information collection from the Chinese web. The experiments were first designed and performed to inspect how pages were evolved in a short period. Based on the results, a general system model was established for incremental spiders. Then the latent performance bottle-necks in implementation were deeply analyzed, with corresponding solutions supplied. Besides, two particular approaches were put forward to efficiently collect updated or newly-born pages in this mo...
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2005年第S1期1882-1886,共5页
Journal of Tsinghua University(Science and Technology)
基金
国家自然科学基金重点资助项目(60435020)
教育部博士点基金项目(20030001076)
关键词
增量式
网页搜集
系统模型
中国Web
实现策略
incremental spider
web crawling
system model
the Chinese web
implementation strategies