摘要
实现站内搜索引擎的关键一步是信息的自动采集.站内信息采集技术是通过分析网页的HTML代码,获取网内的超链信息,使用广度优先搜索算法和增量存储算法,实现自动地连续分析链接、抓取文件、处理和保存数据的过程.系统在再次运行中通过应用属性对比技术,在一定程度上避免了对网页的重复分析和采集,提高了信息的更新速度和搜全率.
The key step of implementation of intranet search engine is to gather information automatically. The intranet gathering information system realizes that it continuously analyzes hyperlinks, crawls files, processes and stores data by analyzing HTML codes, abstracting hyperlinks, designing the breadth-first search algorithm and increment memory algorithm. When the system runs again ,the technology of attribute comparing is applied ,the speed of update and the rate of recall are improved.
出处
《内蒙古大学学报(自然科学版)》
CAS
CSCD
北大核心
2009年第2期203-207,共5页
Journal of Inner Mongolia University:Natural Science Edition
基金
内蒙古工业大学科学研究项目(X200806)