摘要
针对传统随机森林算法在维度高、噪声大的文本分类上出现计算复杂度高和分类效果较差的问题,提出一种基于隐狄利克雷分配(LDA)主题模型的改进随机森林算法。该算法利用LDA主题模型对原始文本建立模型,将原始文本映射到主题空间上,保证了文本主旨与原始文本的一致性,同时也大大降低了文本噪声对分类的影响;并且针对随机森林中决策树特征的随机选择方法,提出在决策树生成过程中,利用对称不确定计算各个特征之间的相关性,从而可以降低不同决策树之间的关联度。最终在主题空间上利用改进的随机森林算法对文本进行分类。经过实验证明,该算法在文本分类上具有良好的优越性。
In view of some problem emerged in text classification which has high dimension and big noise, the traditional random forest algorithm has exposed the defect of the computational complexity and the poor classification performance. We present an improved random forest algorithm based on LDA. This algorithm uses the LDA to model the original text, maps the original text to the topic space, ensures the consistency of the purport between text and the original text, and greatly reduces the impact of text noise on the classification. Moreover, to solve the problem of the random selection method for the features of decision tree in random forests, a method which utilizes the symmetrical uncertainty to calculate the correlation between all features is presented during the generation process of decision trees and reduces the correlation between different decision trees. Finally, we used the improved random forests algorithm in topic space for text classification. The experiment shows that the algorithm has good superiority classification ability in text.
出处
《计算机应用与软件》
2017年第8期173-178,212,共7页
Computer Applications and Software
基金
江苏省产学研合作项目(BY2015019-30)
关键词
隐狄利克雷模型
主题模型
随机森林
特征评估
文本分类
Latent Diriehlet Allocation (LDA) Topic model Random forest Feature evaluation Text categoriza-tion