摘要
Deep Web中包含着大量高质量内容,当前的搜索引擎技术还不能对其进行搜索,研究Deep Web的大小、质量及分布情况将有助于找到对其进行有效搜索的方法和技术。以网络蜘蛛采集的2006年10月的数据为样本,利用统计、概率等定量方法和定性方法,首次对中文Deep Web的大小、质量及分布情况进行调查,得出概况如下:①Deep Web大小比Surface Web的大240倍以上;②包含的文件总数量和总存储量分别为507亿、11700TB;③可搜索数据库数量超过3万个;④内容质量较高;⑤内容主题分布不均匀。
There are lots of valuable contents in Deep Web that can't be searched by current Search Engine technology. It' s useful to find an effective way or technology to search the deep web by researching the size, quality, distribution of Deep Web. With statistical, probabilistic and qualitative methods, firstly research the size, quality, distribution of Deep Web in Chinese with the sample data fetched by a web spider in October, 2006. Results are as below: ①the size of Deep Web is 240 times more than Surface web; ②the total count and storage of the Deep Web is 50.7 billion and 1.17 thousand TB; ③the count of Searchable Data Bases is more than 30 thousand;④the quality of contents are higher; ⑤the distribution of contents is not even.
出处
《情报学报》
CSSCI
北大核心
2008年第2期256-260,共5页
Journal of the China Society for Scientific and Technical Information