期刊文献+

基于SMOTE和随机森林的Web spam检测 被引量:11

Web spam detection based on SMOTE and random forests
原文传递
导出
摘要 Web spam是指采用某些技术手段,使得网页在搜索引擎检索结果中的排名高于其应得排名的行为,它严重影响搜索结果的质量。考虑到Web spam数据集的严重不平衡情况,本研究提出先使用SMOTE过抽样方法平衡数据集,再利用随机森林算法训练分类器。通过对常见的单分类器和集成学习分类器的对比实验,发现SMOTE+RF方法表现较为突出,并根据实验结果优化了方法中的重要参数,对使用SMOTE方法后AUC值提高的原因进行了分析。在WEBSPAM UK2007数据集上的实验证明,该方法可以显著提高分类器的分类效果,其AUC值已经超过了Web Spam Challenge 2008上的最好成绩。 Web spam refers to the actions intended to mislead search engines into ranking some pages higher than they deserved, which could significantly deteriorate the quality of searching results. Considering the serious imbalance of the Web spam dataset, it was proposed to use over-sampling method SMOTE to balance the dataset, then to train the classi- fiers with random forests algorithm. The results showed that the SMOTE + RF method was more prominent by means of experimental comparison with the conventional single classifiers and the ensemble learning classifiers. The important pa- rameters of this method were optimized based on experimental results, and the reasons for the improvement of the AUC value after using SMOTE were also analyzed. Experimental results on WEBSPAM UK2007 dataset showed that this method could markedly improve the performance of the classifiers, of which the AUC value could exceed the best result of Web Spam Challenge 2008.
出处 《山东大学学报(工学版)》 CAS 北大核心 2013年第1期22-27,33,共7页 Journal of Shandong University(Engineering Science)
基金 国家自然科学基金资助项目(61170145) 教育部高等学校博士点专项基金资助项目(20113704110001) 山东省自然科学基金资助项目(ZR2010FM021)
关键词 集成学习 搜索引擎垃圾网页 随机森林 SMOTE 搜索引擎作弊 ensemble learning Web spare random forests SMOTE search engine spamming
  • 相关文献

参考文献26

  • 1EIRON N,MCCURLEY K S. Analysis of anchor text forweb search [ C ] //Proceedings of the 26th Annual Interna-tional ACM SIGIR Conference on Research and Develop-ment in Information Retrieval. Toronto, Canada: ACM,2003 :459-460. 被引量:1
  • 2李智超,余慧佳,刘奕群,马少平.网页作弊与反作弊技术综述[J].山东大学学报(理学版),2011,46(5):1-8. 被引量:9
  • 3GYONGYI Z,MOLINA H. Web spam taxonomy[C]//Proceedings of the 1st International Workshop on Adver-sarial Information Retrieval on the Web. Chiba, Japan :[s. n. ],2005:3947. 被引量:1
  • 4SPIRIN N, HAN J. Survey on Web spam detection: prin-ciples and algorithms [ J]. ACM SIGKDD ExplorationsNewsletter, 2011,13(2):50-64. 被引量:1
  • 5NTOULAS A, NAJORK M, MANASSE M,et al.Detecting spam Web pages through content analysis[ C]//Proceedings of the 15th International Conference on WorldWide Web. New York, USA: ACM, 2006:83-92. 被引量:1
  • 6BECCHETTI L,CASTILLO C, DONATO D,et al.Using rank propagation and probabilistic counting for link-based spam detection [ C ] //Proceedings of the Workshopon Web Mining and Web Usage Analysis (Web KDD).Philadelphia, USA: ACM, 2006: 1-10. 被引量:1
  • 7CASTILLO C,DONATO D, GIONIS A, et al. Knowyour neighbors : Web spam detection using the Web topol-ogy [C ] //Proceedings of the 30th Annual InternationalACM SIGIR Conference. New York, USA: ACM,2007:423-430. 被引量:1
  • 8ERDELYI M,GARZ6 A, BENCZUR A A. Web spamclassification : a few features worth more [ C ] //Proceed-ings of the 2011 Joint WICOW/AIRWeb Workshop onWeb Quality. Hyderabad, India: ACM, 2011:27-34. 被引量:1
  • 9GENG Guanggang, LI Qiudan, ZHANG Xinchang. Linkbased small sample learning for Web spam detection[C ] //Proceedings of the 18th International Conference onWorld Wide Web. Madrid, Spain: ACM, 2009: 1185-1186. 被引量:1
  • 10BIRO I, SIKLOSI D, SZABO J,et al. Linked latentDirichlet allocation in web spam filtering [ C ] //Proceed-ings of the 5 th International Workshop on Adversarial In-formation Retrieval on the Web. Madrid, Spain: ACM,2009:3740. 被引量:1

二级参考文献55

  • 1任仙文,李北平,王月兰,岳俊杰,梁龙.蛋白质相互作用的生物信息学研究进展[J].生物技术通讯,2006,17(6):976-980. 被引量:11
  • 2余慧佳,刘奕群,张敏,茹立云,马少平.基于大规模日志分析的搜索引擎用户行为分析[J].中文信息学报,2007,21(1):109-114. 被引量:117
  • 3RAIN J, SELIG L, REUSE H D, et al. The protein-protein interaction map of Helicobacter pylori[J]. Nature,2001, 409 (6817) : 211-215. 被引量:1
  • 4LIU L, CAI Y, LU W, et al. Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection[ J]. Biochemical and Biophysical Research Communications, 2009, 380: 318-322. 被引量:1
  • 5BOCK J R, GOUGH D A. Whole proteome interaction mining [J]. Sioinformatics, 2003, 19(1): 125-134. 被引量:1
  • 6NANNI L. Fusion of classifiers for predicting protein-protein interactions[J]. Neurocomputing, 2005, 68: 289-296. 被引量:1
  • 7MATTIN S, ROE D, FAUIDN J L. Predicting protein-protein interactions using signature products [ J ]. Bioinformatics, 2005, 21(2) : 218-226. 被引量:1
  • 8NANNI L. Hyperplanes for predicting protein-protein interactions[J]. Neurocomputing, 2005, 69: 257-263. 被引量:1
  • 9NANNI L, LUMINI A. An ensemble of K-local hyperplanes for predicting protein protein interactions[J]. Bioinformatics, 2006, 22(10) : 1207-1210. 被引量:1
  • 10DAYHOFF M O, SCHWARTZ R M, ORCUTT B C. A model of evolutionary change in proteins [ J ]. Atlas of Protein Sequence and Structure, 1978, 5(3) : 345-352. 被引量:1

共引文献9

同被引文献118

引证文献11

二级引证文献70

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部