期刊文献+

基于文本加权KNN算法的中文垃圾短信过滤 被引量:19

Chinese Spam Message Filtering Based on Text Weighted KNN Algorithm
下载PDF
导出
摘要 针对K最近邻(KNN)算法在文本分类决策规则上由于样本重要性相同而导致分类效果不佳的问题,提出一种基于文本加权的KNN文本分类算法,并将其应用于垃圾短信的分类问题。在提取出特征词之后,考虑到特征词在文本中出现的频率对文本重要性的影响,引入第1个加权公式,同时针对垃圾短信数据集,采用关联规则算法挖掘出在垃圾短信中频繁出现的共现词组,并以此引入第2个加权公式,最后将引入的2种文本权重计算公式对每个短信文本进行复合加权处理,以区分各个训练样本对于判定隶属类别的影响程度,从而在分类决策规则上作出改进。实验结果表明,与未经过文本加权的KNN算法相比,该算法对垃圾短信和正常短信在分类准确率、召回率、F1值等指标上都有较大的提升。 In view of the drawback that the decision rules of classification regard for K Nearest Neighbor( KNN), the importance of every sample as the same, the classification results are not good. This paper proposes a method based on the text weighted KNN text classification algorithm and applies it to the classification of spam messages. After feature selection, considering the influence of frequency of feature words appearing in the text on text importance, the paper puts forward the first weighting formula. It uses association rule algorithm to mine frequent term sets from the spam message text and puts forward the second formula. Finally, it uses the two weighting formulas for the composite weighting on every message text so as to distinguish the influence of every training sample on category determination, thus improving on the classification decision rules. Experimental results show that the method has a promotion in accuracy, recall rate and F1 value which are important indexes compared with the un-improved KNN classification of spam filtering.
作者 黄文明 莫阳
出处 《计算机工程》 CAS CSCD 北大核心 2017年第3期193-199,共7页 Computer Engineering
基金 广西可信软件重点实验室研究课题(kx201106) 桂林电子科技大学研究生教育创新计划项目(2016YJCX64)
关键词 垃圾过滤 关联规则 特征选择 K最近邻算法 向量空间模型 spare filtering association rule feature selection K Nearest Neighbor ( KNN ) algorithm Vector Space Model (VSM)
  • 相关文献

参考文献3

二级参考文献62

  • 1张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:120
  • 2PANGNING T,MICHAEL S,著.数据挖掘导论[M].范明、范宏建,译.北京:人民邮电出版社,2006:5. 被引量:3
  • 3SALTON G, WANG A, YANG C S. A vector space model for auto- matic indexing [J]. Communication of the ACM, 1975, 18(5) :613 - 620. 被引量:1
  • 4LEWIS D D. Feature selection and feature extraction for text catego- rization [ C]//Proceedings of the Workshop on Speech and Natural Language. New York: Association for Computational Linguistics, 1992:212 -217. 被引量:1
  • 5LAN M, TAN C L, SU J, et al. Supervised and traditional term weighting methods for automatic text categorization [ J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(4): 721 -735. 被引量:1
  • 6GANIZ M C. Higher order Naive Bayes: a novel non-IID approach to text classification[ J]. IEEE Transactions on Knowledge and Data Engineering, 2011,23 (7) : 1022 - 1034. 被引量:1
  • 7ZHANG H J. Textual and visual content-based anti-phishing: a Bayesian approach [ J]. IEEE Transactions on Neural Networks, 2011,22(10) : 1532 - 1546. 被引量:1
  • 8WONG T-L, LAM W. Learning to adapt Web information extractionknowledge and discovering new attributes via a Bayesian approach [ J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(4) : 523 - 536. 被引量:1
  • 9BELEM D. Content filtering for SMS systems based on Bayesian classifier and word grouping[ C]// LANOMS 2011: The 7th Net- work Operations and Management Symposium. Piscataway: IEEE Press, 2011:1 -7. 被引量:1
  • 10UYSAL A K, GtJNAL S, ERIGIN S, et al. Detection of SMS spam messages on mobile phones[ C]//SIU: The 20th Signal Processing and Communications Applications Conference. Piscataway: IEEE Press. 2012:1 -4. 被引量:1

共引文献27

同被引文献135

引证文献19

二级引证文献92

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部