摘要
设计并实现了RSS垂直爬虫算法,利用广度优先算法聚焦于RSS源进行自动采集,并在文本分词基础上,针对RSS源进行词语权重计算方法的改进及词语过滤,利用VSM方法实现RSS自动分类。实验结果表明,RSS系统在较低的负载下,能以较高的效率和正确率实现中文RSS信息自动检索与分类,从而有效进行RSS信息聚合管理。
This paper presents a web crawler fitting for RSS which uses breadth-first algorithm and focuses on RSS to carry out automatically collection.And based on word segment,it improves the method to calculate word weight,works on word filtering,and implements automatically classification aiming at RSS using VSM.Experimental result shows that the system achieves to retrieve and classify Chinese RSS information with lower system cost and higher accuracy.And it can take manage of RSS information syndication effectively.
出处
《计算机工程》
CAS
CSCD
北大核心
2011年第6期79-81,90,共4页
Computer Engineering
基金
天津市软件产业发展专项基金资助项目(07FZRJFX01300)
关键词
RSS
信息检索
爬虫
中文文本分类
向量空间模型
Really Simple Syndication(RSS)
information retrieval
crawler
Chinese text classification
VSM