摘要
在数据分类分析中,一些特别的类别里往往存在更重要的信息。提出一种基于集成学习,欠采样和代价敏感的类别不平衡数据分类算法(USCensemble),来解决传统算法处理类别不平衡数据分类任务时难以正确识别少数类样本的问题。该算法首先运用EasyEnsemble的算法结构,在前一组数据训练完毕后,运用欠采样方法选取权重大的多数类样本,并将其与少数类样本结合为临时训练数据以此平衡数据集并进行下一轮训练。同时赋予少数类样本更大的错分代价,快速提高错误分类的少数类的样本权重,降低多数类的样本权重,使算法更倾向少数类的正确分类,达到对少数类样本正确识别的目的。在10个uci的数据集生成的分类任务上进行了对比实验,实验结果表明,该算法能更好地识别少数类样本。
In data classification analysis,significant information often exists in some special classes.In this paper,a classification algorithm(USCensemble)is proposed based on ensemble learning,undersampling and cost-sensitiveness for class imbalanced data,with aim to solve the problem that it is difficult to identify the minority class correctly in class imbalance data by traditional classification algorithm.USCensemble algorithm adopts the structure of EasyEnsemble.After training the previous group of data,a new subset of the majority class is selected according to sample weights obtained in the previous training.Then combine the new subset with the minority class together and treat it as the temporary training data set for the next step of training.In the process of training,higher misclassification cost is given to the minority class.This manipulation will lead to bigger weights of misclassified minority class sample and smaller weights of majority class sample.As a result,USCensemble algorithm is inclined to classify minority class correctly with higher accuracy rate.Ten UCI data sets are analysed in comparative experiment and the experiment outcome shows that USCensemble algorithm is competitive and has good performance in class-imbalance classification.
作者
贺指陈
HE Zhichen(School of Applied Mathematics,Guangdong University of Technology,Guangzhou,Guangdong 510520,China)
出处
《信息记录材料》
2022年第1期18-22,共5页
Information Recording Materials
关键词
类别不平衡数据
分类
集成学习
欠采样
代价敏感
Class imbalance data
Classification
Ensemble learning
Undersampling
Cost-sensitiveness