摘要
很多分类器难以识别不平衡数据中的少数类,给缺陷检测等众多应用带来了挑战。当前许多过采样方法虽可有效增加少数类样本,但却存在类重叠增加的风险。本研究提出了一种基于欧式距离聚类的自适应过采样方法。该方法采用基于欧式距离的构造覆盖算法对少数类进行聚类,然后自适应识别出相对安全和靠近分类边界的少数类聚类,再在此聚类内采用SMOTE采样步骤合成新少数类样本。通过10个不平衡数据集以及G-mean、F1-measure、AUC等指标验证。实验结果显示,与现有过采样方法相比,该方法在大多数数据集上,G-mean、F1-measure和AUC三个指标均最优。结果表明,本研究方法有效弥补现有分类的缺陷,获得较好分类结果。
It is hard for many classifiers to identify minority class samples in imbalanced data,which poses challenges for many applications such as defect detection.Many state-of-art oversampling methods can effectively generate synthetic samples of minority class,but they have the risk of increasing overlap between minority and majority classes.In this study,an adaptive oversampling method based on Euclidean distance clustering was proposed.With this method,all minority class samples were clustered into each sub-cluster based on Euclidean distance clustering.Then,all sub-clusters of the minority class were adaptively labelled as safe and borderline according to its distance to the border boundary of majority class.Finally,new synthetic samples of the minority class were generated using SMOTE oversampling within each selected sub-cluster.Ten imbalanced datasets,as well as G-mean,F1-measure,and AUC metrics were evaluated.The experimental results showed that,compared with some state-of-art oversampling methods this method performs best in the dataset when evaluated by G-mean,F1-measure,and AUC metrics.This result indicated that the proposed method effectively compensates for the shortcomings of the classifiers and achieves good classification results.
作者
董洪荣
付亚军
张帅
余亚强
陈军
谢德红
DONG Hong-rong;FU Li-jun;ZHANG Shuai;YU Ya-qiang;CHEN Jun;XIE De-hong(Jiangsu Jinjia New Style Packaging Material Co.Ltd.,Huaian 223005,China;Hubei Qiangda Packaging Industry Co.,Ltd,Hong’an 438400,China;College of Information Science and Technology,Nanjing Forestry University,Nanjing 210037,China)
出处
《印刷与数字媒体技术研究》
CAS
北大核心
2023年第5期26-41,共16页
Printing and Digital Media Technology Study
关键词
不平衡数据
分类
欧式距离
聚类
机器学习
Imbalanced Data
Classification
Euclidean distance
Clustering
Machine learning