期刊文献+

基于T-Graph算法的主题爬虫研究 被引量:5

Research on topical crawler of T-Graph algorithm
下载PDF
导出
摘要 为解决传统主题爬虫抓取特定领域的网页信息效率低下问题,在分析主题爬虫算法T-Graph的基础上,提出一种改进的T-Graph主题爬虫算法。利用维基百科的相关知识,采用语义分析的特征项提取算法提取特征项,在词的语义层次上对文本进行相似度计算,且综合考虑了网页中不同位置文本的权重问题。将改进前后的算法进行实验对比,实验结果表明,在提高主题爬行质量方面,改进后的算法效果更好。 To solve the problem that traditional focused crawler has low efficiency in searching web resources relevant to specific topics, the T-Graph algorithm was analyzed. However, T-Graph algorithm is deficient. An optimization strategy was proposed by using the Wikipedia knowledge to extract features based on feature extraction algorithm using semantic analysis. The similarity of texts based on semantic level was computed. In addition, the weight value of textual content in different positions was taken into consideration synthetically. The optimization strategy was compared with the original one by experiments. The experimental results show that the optimization strategy is more efficient in improving crawling process.
出处 《计算机工程与设计》 CSCD 北大核心 2014年第9期3014-3017,3028,共5页 Computer Engineering and Design
基金 山东省教育科学规划攻关课题基金项目(ZK1037123C023)
关键词 主题爬虫 维基百科 相似度计算 权重 focused crawler T-Graph Wikipedia similarity computing weight value
  • 相关文献

参考文献11

二级参考文献59

  • 1王辉,左万利,袁华.一种基于质心与本体的文本分类方法[J].计算机研究与发展,2007,44(z2):6-11. 被引量:3
  • 2李卫,刘建毅,何华灿,王枞.基于主题的智能Web信息采集系统的研究与实现[J].计算机应用研究,2006,23(2):163-166. 被引量:15
  • 3赵佳鹤,王秀坤,刘亚欣.基于语义分析的主题信息采集系统的设计与实现[J].计算机应用,2007,27(2):406-408. 被引量:14
  • 4Novak B.A survey of focused web crawling algorithms [C].Proceedings of SIKDD at Multiconference IS. Slovenia: ACM Press,2004:55-58. 被引量:1
  • 5Chau M,Chen H.Personalized and focused web spiders[J].Web Intelligence,Springer-Verlag, 2003(2):197-217. 被引量:1
  • 6Rui Chen,Bipin C Desai.An enhanced web robot for the CINDI system[C].Proceedings of the C3 S2E Conference.Canadia:ACM Press,2008:133-135. 被引量:1
  • 7Almpanidisa G,Kotropoulos C,Pitasa l.Combining text and link analysis for focused crawling - An application for vertical search engines[J].Information Systems,2007,32(6):886-908. 被引量:1
  • 8Ching-Chi Hsu,Fan Wu.Topic-specific crawling on the web with the measurements of the relevancy context graph[J].Information Systems,2006,31(4):232-246. 被引量:1
  • 9Philip Resnik. Using information content to evaluate semantic simi- larity in a taxonomy [A]. In: C. Raymond Perrault, Chris S. Mellish, Renato deMori eds. Proceedings of the 14th International Joint Conference on Artificial InteUigence [ C]. Montreal: AAAI Press, 1995:448-453. 被引量:1
  • 10George A Miller. WordNet: a lexical database for english [ C].Communications of the ACM, 1995:38( 11 ) :39-41. 被引量:1

共引文献164

同被引文献42

引证文献5

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部