期刊文献+

基于拟合特征分布的垃圾网页检测方法

Web spam detection based on fitting distribution of features
下载PDF
导出
摘要 为了有效地检测垃圾网页,通过分析网页内容特征和链接特征的分布,发现正常网页特征分布有规律而垃圾网页特征分布散乱,根据正常网页特征分布与垃圾网页特征分布的不同,提出了用分布函数拟合正常网页特征分布,并计算正常网页和垃圾网页比例与分布函数的差值,以差值为阈值使用C4.5决策树对垃圾网页进行检测。实验结果表明,该方法能够有效地减少被错误分类的正常网页,提高准确率。 Web spam disturbs users to obtain information normally and to detect spam pages effectively,distribution of web content features and linked features are analyzed.The result shows that normal web features distribute regular but spam web features distribute scattered.Based on the difference distribution,function to fit the distribution of normal web features is employed,and the difference between web proportion and the distribution function is calculated.Finally,C4.5 decision tree is constructed to detect spam pages with difference as threshold.The experimental results show that it can detect spam pages effectively.
作者 刘阳 张化祥
出处 《计算机工程与设计》 CSCD 北大核心 2013年第8期2651-2655,共5页 Computer Engineering and Design
基金 国家自然科学基金项目(61170145) 教育部高等学校博士点专项基金项目(20113704110001) 山东省自然科学基金和科技攻关计划基金项目(ZR2010FM021 2008B0026 2010G0020115)
关键词 垃圾网页 内容特征 链接特征 分布函数 决策树 web spam content features linked features distribution function decision trees
  • 相关文献

参考文献10

  • 1Bing Liu. Web data mining: Exploring hyperlinks, contents, andusage data [M]. Berlin, Heidelberg: Springer-Verlag. 2007. 被引量:1
  • 2GyongyiZ,Molina H G, Pedersen J. Combating web spam withTrustRank [C] //Proceedings of the 30th VLDB Conference.ACM Press, 2004: 576-587. 被引量:1
  • 3Carlos Castillo, Debora Donato. Aristides Gionis,et al. Knowyour neighbors : Web spam detection using the web topology[C] //Proceedings of the 30th Annual International ACM SI-GIR Conference on Research and Development in InformationRetrieval, 2007. 被引量:1
  • 4Janden B,Spink A. An analysis of web documents retrieved andviewed [C] // The 4th International Conference on InternetComputing, 2003: 65-69. 被引量:1
  • 5Yahoo. Research: Web spam collections [EB/OL]. http: //Bar-celona research, yahoo, net/web spam/datasets/, 2007. 被引量:1
  • 6Asano Y,Tezuka Y,Nishizeki T. Improvements of HITS algo-rithms for spam links [G]. LNCS 4505: APWeb/WAIM,2007: 479-496. 被引量:1
  • 7Jacob Abemethy, Olivier Chapelle. Graph regularization methodsfor Web spam decetion [J]. Mach Leam, 2010,81 (2):207-225. 被引量:1
  • 8Oren K T, Lillian L, Cornell U. PageRank without hyperlinks:Structural reranking using links induced by language models[J]. ACM Transactions on Information Systems,2010, 28(4): 18. 被引量:1
  • 9Metaxas P T. On the evolution of search engine rankings[C] // Proceedings of the WEBIST Conference, 2009. 被引量:1
  • 10Ntoulasa M,Najork M Manasse. Detecting spam WebPagesthrough content analysis [C] //Proceedings of the 15th Inter-national Conference on World Wide Web. New York: ACM,2006: 83-92. 被引量:1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部