摘要
针对不平衡数据集在分类任务中表现不佳的问题,提出基于局部密度与集中度的过采样算法。针对数据集中所有的少数类样本点,分别利用高斯核函数与局部引力来计算局部密度与集中度;对于局部密度较小的部分有针对性地合成第一类新样本,解决类内不平衡问题。根据集中度的不同,区分出少数类样本的边界,有针对性地合成第二类新样本,达到强化边界的作用;同时,通过自适应生成新样本,有效解决大部分过采样算法没有明确过采样量或者盲目追求样本平衡度相等的问题。最后,在公开的12个不平衡数据集上进行了实验,实验结果表明,本算法在低不平衡数据集与高不平衡数据集上的应用均拥有良好的表现。
Inspired by the poor performance of imbalanced datasets in classification tasks,an oversampling algorithm based on local density and centrality is proposed.First,for all the minority sample points in the dataset,the Gaussian kernel function and local gravity are used to calculate the local density and centrality,respectively.Furthermore,the first type of new samples is synthesized for the portion with small local density to solve the imbalance problem within this kind.According to the difference of centrality,the boundaries of minority samples are distinguished,and the second kind of samples are specifically synthesized to strengthen the boundaries.Meanwhile,new samples are generated adaptively,which solves the problem that most oversampling algorithms fail to clearly define the oversampling quantity or blindly pursue the balance of the number of samples of two categories.Finally,experiments are conducted on 12 public imbalanced datasets and results reveal that the algorithm has good performance in low-and high-imbalanced datasets.
作者
冀常鹏
尚佳奇
代巍
JI Changpeng;SHANG Jiaqi;DAI Wei(School of Electronic and Information Engineering,Liaoning Technical University,Huludao 125105,China;Graduate School,Liaoning Technical University,Huludao 125105,China)
出处
《智能系统学报》
CSCD
北大核心
2024年第3期525-533,共9页
CAAI Transactions on Intelligent Systems