期刊文献+

一种基于Heritrix 可配置主题的聚焦爬虫方法 被引量:1

A Focused Crawler Method of Configurable Theme Based Heritrix
下载PDF
导出
摘要 通用搜索引擎存在不能有针对性地满足用户查询需求和搜索关键词难以准确描述的问题。从数据挖掘和机器学习的角度出发,提出一种基于网络爬虫开源框架Heritrix的可配置主题的聚焦爬虫方法,从指定的站源,根据不同的爬取策略,启动多线程爬取,按照预先设置的关键字和栏目信息进行分类搜索,把最符合条件和要求的信息爬取下来供选择、判断、分析和利用。这种方法可在一定程度上解决搜索引擎查询信息的需求问题,提升用户体验,提高检索效率。 During the time of development of the Internet,massive information was generated in the cyber-world and has become an important asset.Meanwhile users’requirement on information search has become higher and higher.How to search key information quickly and effectively is one of the most difficult problems to solve.Basically,the search engine satisfies needs in data searching.However,needs of users only focusing on special themes or fields cannot be satisfied.Through searching key words only is hard to describe their needs or their problems.Thus,this study focuses on data mining and machine learning and proposes a crawler method of configurable theme focused on crawler system that is based on open-source framework of web crawler Heritrix.To a certain extent this method can solve the above mentioned problems and improve users’perception and searching efficiency.
作者 王松 刘洪基 叶晓波 WANG Song;LIU Hongji;YE Xiaobo(School of Economics&Management,Chuxiong Normal University,Chuxiong,Yunnan Province 675000;Dept.of Management of State-owned Assets and Informationalization,Chuxiong Normal University,Chuxiong,Yunnan Province 675000)
出处 《楚雄师范学院学报》 2020年第6期124-131,共8页 Journal of Chuxiong Normal University
关键词 聚焦爬虫 可配置主题 HERITRIX focused crawler configurable theme Heritrix
  • 相关文献

参考文献11

二级参考文献46

共引文献58

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部