摘要
为解决传统主题爬虫抓取特定领域的网页信息效率低下问题,在分析主题爬虫算法T-Graph的基础上,提出一种改进的T-Graph主题爬虫算法。利用维基百科的相关知识,采用语义分析的特征项提取算法提取特征项,在词的语义层次上对文本进行相似度计算,且综合考虑了网页中不同位置文本的权重问题。将改进前后的算法进行实验对比,实验结果表明,在提高主题爬行质量方面,改进后的算法效果更好。
To solve the problem that traditional focused crawler has low efficiency in searching web resources relevant to specific topics, the T-Graph algorithm was analyzed. However, T-Graph algorithm is deficient. An optimization strategy was proposed by using the Wikipedia knowledge to extract features based on feature extraction algorithm using semantic analysis. The similarity of texts based on semantic level was computed. In addition, the weight value of textual content in different positions was taken into consideration synthetically. The optimization strategy was compared with the original one by experiments. The experimental results show that the optimization strategy is more efficient in improving crawling process.
出处
《计算机工程与设计》
CSCD
北大核心
2014年第9期3014-3017,3028,共5页
Computer Engineering and Design
基金
山东省教育科学规划攻关课题基金项目(ZK1037123C023)
关键词
主题爬虫
维基百科
相似度计算
权重
focused crawler
T-Graph
Wikipedia
similarity computing
weight value