摘要
不平衡数据分类经常面临样本严重不平衡、少数类样本分类精度低的问题,随着数据规模增大,分类效率也成为了瓶颈问题.针对以上问题,本文结合spark高效的数据处理能力,提出了一种Spark环境下基于综合权重的不平衡数据集成分类方法.该方法首先依照多数类样本中每类样本的权重以及少数类样本量获得的综合权重进行采样,并与少数类样本组成平衡规模的训练数据集;其次,采用基于相关性的特征选择方法选择最优的特征子集,并对随机森林算法进行改进优化以及利用其获得子分类器.最后在Spark环境下,以UCI数据集进行实验验证.实验结果表明本文方法不仅提高了整体分类精度,而且提升了分类效率.
Imbalanced data classification often faces the problem of severe sample imbalance and lowaccuracy of minority sample classification,and with the increase of data size,classification efficiency has also become a bottleneck problem. In viewof the above problems,combined with the efficient data processing ability of Spark,this paper proposes an integrated classification method of imbalanced data based on comprehensive weight in Spark environment. Firstly,the method samples by comprehensive weight which obtained by in accordance with weight of each class of samples in majority class samples and samples of minority class amount from the original sample. and form a balanced scale of training data set with samples of minority class;Secondly,we select the optimal feature subset based on the correlation based feature selection method to improve and optimize the random forest algorithm,and use it to get the sub classifiers;Finally,in the Spark environment,using UCI data set experimental verification. The experimental results showthat the proposed method not only improves the accuracy of the overall classification,but also improves the classification efficiency.
作者
丁家满
王思晨
贾连印
游进国
姜瑛
DING Jia-man;WANG Si-chen;JIA Lian-yin;YOU Jin-guo;JIANG Ying(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2019年第2期255-259,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(51467007
61562054
61462050)资助