摘要
不平衡数据是机器学习中普遍存在的问题并得到广泛研究,即少数类的样本数量远远小于多数类样本的数量.传统基于最小化错误率方法的不足在于:分类结果会倾向于多数类,造成少数类的精度降低,通常还存在时间复杂度较高的问题.为解决上述问题,提出一种基于样本空间分布的数据采样方法,伪负样本采样方法.伪负样本指被标记为负样本(多数类)但与正样本(少数类)有很大相关性的样本.算法主要包括3个关键步骤:1)计算正样本的空间分布中心并得到每个正样本到空间中心的平均距离;2)以同样的距离计算方法计算每个负样本到空间分布中心的距离,并与平均距离进行比较,将其距离小于平均距离的负样本标记为伪负样本;3)将伪负样本从负样本集中删除并加入到正样本集中.算法的优势在于不改变原始数据集的数量,因此不会引入噪声样本或导致潜在信息丢失;在不降低整体分类精度的情况下,提高少数类的精确度.此外,其时间复杂度较低.经过13个数据进行多角度实验,表明伪负样本采样方法具有较高的预测准确性.
Data imbalance is a very common problem that has been comprehensively studied in machine learning techniques, where the minority class contains very few samples compared with the majority class. The disadvantage of traditional methods based on minimizing the error lies in: they tend to be biased toward the majority class, so these models have low prediction accuracy for the minority class and might have high time complexity. To solve the above problems, a data sampling method based on spatial distribution, Pseudo-negative sampling is proposed.Pseudo-negative samples refer to samples marked as negative samples(majority class) but with a strong correlation with positive samples(minority class). The algorithm mainly includes three key steps:1) calculate the spatial center of the positive samples and figure out the average distance of positive samples to the spatial center;2) calculate the distance from each negative sample to the spatial center with similar distance calculation approach and compare it with the average distance, and then mark the negative sample as pseudo negative sample whose distance is less than the average distance;3) delete the pseudo negative samples from the negative samples and add them to the positive sample set. The advantage of the algorithm is that it does not change the number of original data sets, so it does not introduce noise samples or cause potential information loss;the accuracy of a few classes can be improved without decreasing the overall classification accuracy and the time cost is low. Extensive experiments are conducted on thirteen datasets from multiple aspects, and the results show that the pseudo-negative sampling method has high prediction accuracy.
作者
张永清
卢荣钊
乔少杰
韩楠
GUTIERREZ Louis Alberto
周激流
ZHANG Yong-Qing;LU Rong-Zhao;QIAO Shao-Jie;HAN Nan;GUTIERREZ Louis Alberto;ZHOU Ji-Liu(School of Computer Science,Chengdu University of Information Technology,Chengdu 610225,China;School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China;School of Software Engineering,Chengdu University of Information Technology,Chengdu 610225,China;School of Management,Chengdu University of Information Technology,Chengdu 610103,China;Department of Computer Science,Rensselaer Polytechnic Institute,New York 12180,USA)
出处
《自动化学报》
EI
CAS
CSCD
北大核心
2022年第10期2549-2563,共15页
Acta Automatica Sinica
基金
国家自然科学基金(61702058,61772091,61802035,61962006)
四川省科技计划项目(2021JDJQ0021,22ZDYF2680,2021YZD0009,2021ZYD0033)
成都市技术创新研发项目(2021-YF05-00491-SN)
成都市重大科技创新项目(2021-YF08-00156-GX)
成都市“揭榜挂帅”科技项目(2021-JB00-00025-GX)
四川音乐学院数字媒体艺术四川省重点实验室资助项目(21DMAKL02)
广东省基础与应用基础研究基金(2020B1515120028)资助。
关键词
不平衡数据
样本空间
机器学习
采样方法
空间中心
Imbalanced data
spatial distribution
machine learning
sampling method
spatial center