摘要
【目的】消除分类问题中类不平衡数据对分类精度的影响。【方法】首先,使用自适应k均值聚类算法对多数类数据集进行聚类,找到并删除离群点;其次,计算数据与聚类中心加权距离并排序,根据簇密度对多数类数据顺序采样;最后,将采样得到的数据与少数类数据集合并,输入分类算法进行训练。【结果】实验结果表明,在25组不平衡数据集上算法最大AUC平均值达到0.912,相比较于其他方法最少提升了0.014,平均运行时间仅为1.377 s;应用在两组不平衡大数据集上,算法也有很好的表现。【局限】不适合多分类问题,仅适合解决二分类问题。【结论】算法能够找到最适k值,检测并删除离群点,解决类不平衡问题,提高分类精度。算法速度快,开销小,适合不平衡大数据集的应用。
[Objective] This study tries to reduce the impacts of imbalanced data on classification accuracy.[Methods] First, we used the adaptive k-means clustering algorithm to process the majority class and remove the outliers. Then, we calculated the weighted distance between data and the centers of the clusters to sort the weighted distances. We also sequentially sampled the majority class according to the density of the clusters.Finally, we trained the classification algorithm combining of the sampled data and the minority class. [Results]The average max AUC values reached 0.912 with 25 imbalanced datasets, which was at least 0.014 higher than other methods. Our new algorithm’s average running time was 1.377s, and worked well with imbalanced big data sets. [Limitations] The proposed model could not address the multi-classification issues. [Conclusions] This new algorithm could identify the optimal k-value, detect and remove the outliers, solve class imbalance problem, and improve classification accuracy. It is capable of processing imbalanced large data sets faster and cost-effectively.
作者
周倩
姚震
孙博
Zhou Qian;Yao Zhen;Sun Bo(College of Information Science and Engineering,Shandong Agricultural University,Taian 271018,China;Library of Shandong Agricultural University,Taian 271018,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2022年第5期127-136,共10页
Data Analysis and Knowledge Discovery
基金
山东省自然科学基金青年基金项目(项目编号:ZR2018QF002)
山东农业大学图书情报研究项目(项目编号:TQ201902)的研究成果之一。
关键词
类不平衡
聚类
距离加权
欠采样
Class Imbalance
Clustering
Weighted Distance
Undersampling