摘要
针对K最近邻(KNN)算法在文本分类决策规则上由于样本重要性相同而导致分类效果不佳的问题,提出一种基于文本加权的KNN文本分类算法,并将其应用于垃圾短信的分类问题。在提取出特征词之后,考虑到特征词在文本中出现的频率对文本重要性的影响,引入第1个加权公式,同时针对垃圾短信数据集,采用关联规则算法挖掘出在垃圾短信中频繁出现的共现词组,并以此引入第2个加权公式,最后将引入的2种文本权重计算公式对每个短信文本进行复合加权处理,以区分各个训练样本对于判定隶属类别的影响程度,从而在分类决策规则上作出改进。实验结果表明,与未经过文本加权的KNN算法相比,该算法对垃圾短信和正常短信在分类准确率、召回率、F1值等指标上都有较大的提升。
In view of the drawback that the decision rules of classification regard for K Nearest Neighbor( KNN), the importance of every sample as the same, the classification results are not good. This paper proposes a method based on the text weighted KNN text classification algorithm and applies it to the classification of spam messages. After feature selection, considering the influence of frequency of feature words appearing in the text on text importance, the paper puts forward the first weighting formula. It uses association rule algorithm to mine frequent term sets from the spam message text and puts forward the second formula. Finally, it uses the two weighting formulas for the composite weighting on every message text so as to distinguish the influence of every training sample on category determination, thus improving on the classification decision rules. Experimental results show that the method has a promotion in accuracy, recall rate and F1 value which are important indexes compared with the un-improved KNN classification of spam filtering.
出处
《计算机工程》
CAS
CSCD
北大核心
2017年第3期193-199,共7页
Computer Engineering
基金
广西可信软件重点实验室研究课题(kx201106)
桂林电子科技大学研究生教育创新计划项目(2016YJCX64)
关键词
垃圾过滤
关联规则
特征选择
K最近邻算法
向量空间模型
spare filtering
association rule
feature selection
K Nearest Neighbor ( KNN ) algorithm
Vector Space Model (VSM)