摘要
针对现有贝叶斯算法应用于垃圾邮件过滤时,贝努利模型精度低、不能区分文本特征重要性、多项式模型计算量大、无关特征项浪费计算时间、对出现次数少的特征项反应敏感等缺点,提出RSSI(remove similar and sensitive items)特征模型。通过计算并比较特征项出现的频率,去除无关和敏感特征项,减小运算量,增加正确率,减少过拟合。Matlab仿真结果表明,与现有的朴素贝叶斯算法(nave Bayes)和支持向量机(support vector machine,SVM)等算法相比,RSSI算法能显著减少分类时间,降低合法邮件被误判的概率。
When Bayesian algorithm is applied in spam filtering,Bernoulli model's accuracy is low and can not distinguish the importance of text features,and the multinomial model has larger computation.In addition,it is a waste of time in calculating unrelated feature elements and this model is sensitive to low frequency elements.For these shortcomings,an improved feature extraction algorithm named RSSI was proposed,which not only reduced the amount of computation,but also improved the classification performance by calculating and comparing the occurrence frequency of feature items,so that overfitting phenomenon was reduced.Experimental results show that compared with early nave Bayes algorithm and SVM algorithm,the RSSI algorithm can significantly reduce the classification time and the probability of misjudging legitimate emails.
出处
《计算机工程与设计》
北大核心
2015年第7期1790-1793,共4页
Computer Engineering and Design
基金
教育部高等学校博士学科点专项科研基金项目(20114101110005)
关键词
邮件分类
贝叶斯分类器
特征提取
多项式事件模型
过拟合
mail classification
Bayesian classifier
feature extraction
multinomial event model
overfitting