摘要
针对分类器在识别不平衡数据时少数类准确率不理想的问题,提出了一种双重代价敏感随机森林算法,双重代价敏感随机森林算法分别在随机森林的特征选择阶段和集成投票阶段引入代价敏感学习。在特征选择阶段提出了生成代价向量时间复杂度更低的方法,并将代价向量引入到了分裂属性的计算中,使其在不破坏随机森林随机性的同时更有倾向性地选择强特征;在集成阶段引入误分类代价,从而选出对少数类数据更敏感的决策树集合。在UCI数据集上的实验结果表明,提出的算法较对比方法具有更高的整体识别率,平均提高2.46%,对少数类识别率整体提升均在5%以上。
A Double Cost Sensitive Random Forest(DCS-RF)algorithm is proposed to solve the problem that the accuracy of a few classes is not ideal when the classifier identifies unbalanced data.The DCS-RF algorithm introduces the cost sensitive learning in the feature selection stage and the integrated voting stage of the random forest respectively.In the feature selection stage,the method of generating cost vector with lower time complexity is proposed,and the cost vector is introduced into the calculation of split attributes,so that it can select strong features more tendentiously without destroying the randomness of random forest;in the integration stage,the misclassification price is introduced to select the decision tree set which is more sensitive to a few types of data.The experimental results on UCI dataset show that the proposed algorithm has higher overall recognition rate than the comparison method,with an average improvement of 2.46%,and the overall improvement of recognition rate for minority classes is more than 5%.
作者
周炎龙
孙广路
ZHOU Yan-long;SUN Guang-lu(School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,China)
出处
《哈尔滨理工大学学报》
CAS
北大核心
2021年第5期44-50,共7页
Journal of Harbin University of Science and Technology
基金
国家自然科学基金(61702140)
黑龙江省留学归国人员科学基金(LC2018030)
黑龙江省普通高校基本科研业务费专项资金资助(JMRH2018XM04).
关键词
随机森林
不平衡数据
特征选择
代价敏感
random forest
imbalanced data
feature selection
cost-sensitive