摘要
为降低海量文本分类中词频信息和文本长度对分类结果的影响,提出归一化词频的贝叶斯分类模型。基于分布式计算框架MapReduce平台,采用文本中高词频特征的对数平均计算方法进行文本长度的归一化处理,解决朴素贝叶斯模型在文本分类中参数估计的不足。实验结果表明,该方法在分类准确率上优于朴素贝叶斯方法,具有良好的扩展性和伸缩性,能够应用于大数据的文本快速分类。
To reduce the impact of document length and the information of words frequencies on the classification performances,normalized term frequency Bayes was proposed.Based on a distributed computing framework of MapReduce,the logarithm of high word frequency was computed and the text length was normalized to solve rough parameter estimation of Naive Bayes.Experimental results show that the improved method is superior to the Naive Bayesian method on the classification accuracy,and it has good scalability and extensibility,which can be used to classify large-scale data.
出处
《计算机工程与设计》
北大核心
2016年第3期799-802,共4页
Computer Engineering and Design
关键词
文本分类
朴素贝叶斯
参数估计
词频特征
并行计算
text classification
Naive Bayes
parameter estimation
words frequency
parallel computing