期刊文献+

一种改进TF-IDF的中文邮件识别算法研究 被引量:9

Research on improved TF-IDF Chinese mail recognition algorithm
下载PDF
导出
摘要 传统的TF-IDF算法没有很好地分配分词的权重,对一些能代表邮件类别出现频率较大的词语计算的IDF值反而较小,IDF值小说明单词的区分能力弱而不符合实际情况。为了提升垃圾邮件识别的准确率,提出一种改进TF-IDF算法和类中心向量的中文垃圾邮件识别方法。通过改进传统的TF-IDF计算方式,在传统的TF-IDF算法里面加入卡方统计量CHI和位置影响因子能够很好地改善一些重要词汇的权重问题,并结合逆向最大匹配算法的邮件文本分词和类中心向量算法的特征选择进行垃圾邮件分类。实验结果表明,所提算法相较于传统的TF-IDF算法对垃圾邮件识别的准确率提升了约3.6%,具有一定的实际应用价值。 A Chinese spam recognition method with improved TF-IDF algorithm and class centre vector is proposed to improve the accuracy of spam recognition. The traditional TF-IDF algorithm does not assign the weight of word segmentation well,and the calculated IDF value for some words that can represent the mail category and has higher frequency of occurrence is relatively small. The small IDF value indicates that the capacity of distinguishing the words is weak and does not accord with the actual demand. In this paper,the traditional TF-IDF calculation pattern is improved. The traditional TF-IDF algorithm adding the chi-square statistic CHI and position influence factor can improve the weight of some important words,and the spam classification can be performed by combining it with the feature selection of class center vector algorithm and mail text segmentation of the reverse maximum matching algorithm. The experimental results show that,in comparison with the traditional TF-IDF algorithm,this algorithm can increase the accuracy of spam identification by about 3.6%,which has a certain practical application value.
作者 吴小晴 万国金 李程文 林梦思 曹书强 WU Xiaoqing;WAN Guojin;LI Chengwen;LIN Mengsi;CAO Shuqiang(School of Information Engineering,Nanchang University,Nanchang 330031,China)
出处 《现代电子技术》 北大核心 2020年第12期83-86,共4页 Modern Electronics Technique
基金 国家自然科学基金项目(61661030)。
关键词 TF-IDF算法 邮件识别 卡方统计量 权重分配 邮件分类 仿真分析 TF-IDF algorithm mail recognition CHI weight allocation mail classification simulation analysis
  • 相关文献

参考文献11

二级参考文献39

共引文献170

同被引文献88

引证文献9

二级引证文献30

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部