摘要
针对集中式搜索引擎的瓶颈问题,提出一种既具有集中式搜索引擎优点又解决了其瓶颈门题的分布协作式搜索引擎系统。系统的设计思想是使地理上分散在不同地方的搜索引擎在信息收集与更新上进行协作。讨论了信息收集程序(Crawler)的3种工作方式:封闭式、交叉式和交换式。提出了成组传送和复制热门URL两种方法来降低在交换式工作方式下传送的URL信息频率和信息量。讨论了Web的3种划分方法:URL散列法、站点散列法和分类法。通过模拟实验验证了在封闭式工作方式下当Crawler数量较少时可以得到较好的收集率。验证了站点散列法比URL散列法能显著减少外部链接的数量。验证了成组传送对降低在交换式工作方式下传送URL信息量所起的作用。
Aiming at the problem of the bottleneck of centralized search engine, a system model of distributed cooperative search engine is presented. The main idea was that the search engines in deficient places are made to cooperate each other on information gathering. Three crawling modes, firewall mode, cross - over mode and exchange mode, were discussed. The methods of batch communication and replicating popular URL are presented to reduce URL exchanges in exchange mode. Three schemes, URL- hash based, site - hash based and hierarchical to partition the Web were discussed. The following conclusions are drawn from the experiments, when there is a relatively small number of crawlers, the firewall mode provides good coverage, and the site- hash based partitioning scheme significantly reduces communication overhead compared to the URL - hash based scheme, and batch communication reduces communication overhead in exchange mode.
出处
《抚顺石油学院学报》
2003年第4期57-60,共4页
Journal of Fushun Petroleum Institute
关键词
分布协作式
搜索引擎
信息收集
Distributed cooperative
Search engine
Information gathering