摘要
针对新闻网站通过人工方式采集发布来自其它网站的Web新闻费时费力、易重采与漏采这一问题,综合运用Web信息采集技术、网页去噪技术、文本文档消重技术以及文本自动分类技术设计并实现了一种基于网络爬虫的Web新闻自动采集发布系统。在给出系统总体结构的基础上,对其各个模块的功能、设计与实现方法进行了详细介绍。实验表明,该系统设计合理,具有采集效率高、消重准确、集成方便、运行费用低等优点,可作为新闻网站的采编工具加以推广使用。
News sites manually gather and publish Web news from other sites, which is inefficient and easy to repeatedly collect or miss some news. To solve this problem, using Web information fetching technology, Web pages noises eliminating technology, replicated text documents eliminating technology and automatic text classifieation technology, a Web news automatically gathering and publishing system is designed and implemented. The whole structure of the system is presented, and then the main function and design method of its each rn(xlule are introduced. The experiment shows its design is reasonable, and crawling efficiency is high, and eliminating replicated documents is accurate, and integrating into a new site is easy, and operation cost is low, and it can be used as the gathering and editing tool of news sites widely.
出处
《计算机技术与发展》
2009年第9期250-252,F0003,共4页
Computer Technology and Development
基金
海南省自然科学基金项目(80638)
关键词
网络爬虫
网页去噪
文档消重
Web新闻发布
Web crawler
Web pages noises elimination
replicated documents elimination
Web news publishing