期刊文献+

基于多结构特征的垃圾博客识别研究 被引量:6

Research of identifying Splog based on multiple structure features
下载PDF
导出
摘要 为解决日益严重的垃圾博客问题,对产生垃圾博客的作弊技术和相应的识别技术进行了研究。通过对大量中文垃圾博客的分析,结合对作弊者目的的研究,提出了从用户名、发帖时间间隔、博文内容、锚文本和链接地址、分类标签等博客的结构特征出发的特征提取方法。在特征提取的基础上,提出了基于多结构特征的识别方法,并建立了相应的系统模型。使用支持向量机和朴素贝叶斯模型作为分类器进行了实验,并与经典的基于内容的方法进行了对比。实验结果表明,在小的训练集上,基于多结构特征的方法正确率达到90%以上,比基于内容的方法提高了6个百分点,该方法可有效区分垃圾博客和正常博客。 To address the growing problem of Splog, the generating Splog technology and the corresponding recognition technology are studied. By analyzing a large number of Chinese Splog and the purposes of Splog maker, a method of extracting feature from blog structure features is proposed such as the user' s name, post time interval, post content, anchor text and link address, classification labels. Based on the feature extraction, a method based on the multiple structure features is proposed. The naive Bayesian model and support vector machines are used as the classifier in our model. Experiments on a small train dataset show that the method based on multiple structure features reaches an accuracy of 90%. Compared with the contend based method, proposed method increases the accuracy by 6%, indicating that the method can identify Splogs effectively.
作者 何苑 谭红叶
出处 《计算机工程与设计》 CSCD 北大核心 2010年第22期4932-4935,共4页 Computer Engineering and Design
基金 国家自然科学基金项目(60775041)
关键词 中文信息处理 垃圾博客 多结构特征 朴素贝叶斯 支持向量机 Chinese information processing Splog multiple structure features naive Bayesian support vector machine
  • 相关文献

参考文献1

二级参考文献7

  • 1Kolari P., and Finin T., Joshi A.. SVMs for the blogosphere: Blog identification and splog detection [C]//Proc. of the AAAI Spring Symp. on Computational Approaches to Analyzing Weblogs. California: AAAI Press, 2006: 92-99. 被引量:1
  • 2Kolari P. , Java A. , Finin T. , Mayfield J. , Joshi A. , Martineau J.. Blog Track Open Task: Spam Blog Classification[R]. TREC 2006 Blog Track Notebook. 被引量:1
  • 3Kolari P. , Java A. , Finin T.. Characterizing the splogosphere[C]//Proc, of the World Wide Web 2006 Workshop on the Webloggging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh, 2006. 被引量:1
  • 4Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tate mura, Belle L. Tseng. Splog Detection using self-sim ilarity analysis on blog temporal dynamics[C]//Proc of the ACM Workshop on Adversarial information re trieval on the web. 2007: 1-8. 被引量:1
  • 5Salvetti F., Nicolov N.. Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach[C]//Proc. of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, 137-140. 被引量:1
  • 6Ntoulas A. , Najork M. , Manasse M. , Fetterly D.. Detecting spam web pages through content analysis [C]//Proc. of the 15th international conference on World Wide Web, Edinburgh, Scotland, 2006:83-92. 被引量:1
  • 7Macdonald C. , Ounis I.. The TREC Blog06 Collection: Creating and Analysing a Blog Test Collection[R]. DCS Technical Report TR-2006-224. Department of Computing Science, University of Glasgow. 2006. 被引量:1

共引文献5

同被引文献107

  • 1胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9. 被引量:16
  • 2刘迁,焦慧,贾惠波.信息抽取技术的发展现状及构建方法的研究[J].计算机应用研究,2007,24(7):6-9. 被引量:41
  • 3JiaweiHan,MichelineKamber.数据挖掘:概念与技术.范明,孟小峰译.北京:机械工业出版社,2007:14-18,51-305. 被引量:4
  • 4WAN Xiao-jun. Using bilingual knowledge and ensemble techniques for unsupervised Chinese sentiment analysis[ C]//Proc of Conference on Empirical Methods in Natural Language Processing. 2008:553- 561. 被引量:1
  • 5PANG Be, LEE L. Opinion mining and sentiment analysis [ J ]. Foundations and Trends in Information Retrieval, 2008, 2 (1- 2) :1-135. 被引量:1
  • 6SU Qi, XU Xin-ying, GUO Hong-lei, et al. Hidden sentiment associ- ation in Chinese Web opinion mining[ C ]//Proc of the 17th Interna- tional Conference on World Wide Web. New York: ACM Press, 2008:959 - 968. 被引量:1
  • 7TITOV I, McDONALD R. Modeling online reviews with multi-grain topic models [ C ]//Proc of the 17th International Conference on World Wide Web. New York : ACM Press,2008 : 111- 120. 被引量:1
  • 8CHOI Y, CARDIE C. Learning with compositional semantics as structural inference for subsentential sentiment analysis [ C ]//Proc of Conference on Empirical Methods in Natural Language Processing. 2008 : 793- 801. 被引量:1
  • 9ZHAO Jun, LIU Kang, WANG Gen. Adding redundant features for CRFs-based sentence sentiment classification [ C ]//Proc of Confer- ence on Empirical Methods in Natural Language Processing. 2008: 117-126. 被引量:1
  • 10ZHANG Min, YE Xin-yao. A generation model to unify topic rele- vance and lexicon-based sentiment for opinion retrieval[ C ]//Proc of the 31 st International Conference on Research and Development in In- formation Retrieval. 2008:411-418. 被引量:1

引证文献6

二级引证文献34

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部