摘要
为有效解决欠采样技术在处理不平衡数据时的伪平衡问题,提出并设计一种基于欠采样的提升均衡集成学习算法。采用新的均衡采样机制,通过分箱操作协调数据的预测概率,生成高质量的训练子集,以此迭代训练分类器。基于基分类器在原始数据上的假阳性率和假阴性率,在迭代过程中自适应为其分配权重,避免性能较差的分类器影响整体决策,提高集成模型的泛化能力。新的算法能够在消除伪平衡的同时增加多数类样本的识别度,从而降低边界模糊对分类模型的影响。通过18组小型数据集和2组大型数据集的对比试验表明,该算法具有处理不平衡数据分类问题的优势。
In order to effectively solve the pseudo-balancing problem of the under-sampling technique in dealing with imbalanced data,a boosted equalization ensemble learning algorithm based on under-sampling was proposed.A new equalization sampling mechanism was used to train the classifier iteratively by coordinating the prediction probabilities of the data through the binning operation,so a high-quality training subset could be generated.Based on the false-positive and false-negative rates of the base classifiers on the original data,weights were assigned adaptively to them during the iterative process,so as to avoid poorly performing classifiers from influencing the overall decision and to improve the generalization ability of the ensemble model.The new algorithm was able to increase the recognition of majority class samples while eliminating pseudo-balancing,thus reducing the impact of boundary ambiguity on the classification model.Comparative experiments with 18 sets of smal datasets and 2 sets of large datasets showed that the algorithm had the advantage of dealing with imbalanced data classification problems.
作者
白琳
俱通
王浩
雷明珠
潘晓英
BAI Lin;JU Tong;WAND Hao;LEI Mingzhu;PAN Xiaoying(School of Computer Science and Technology,Xi'an University of Posts and Telecommunications,Xi'an 710121,Shaanxi,China;Shaanxi Province Key Laboratory of Network Data Analysis and Intelligent Processing,Xi'an 710121,Shaanxi,China)
出处
《山东大学学报(工学版)》
CAS
CSCD
北大核心
2024年第4期59-66,共8页
Journal of Shandong University(Engineering Science)
基金
陕西省重点研发计划资助项目(2023-YBSF-476)
西安邮电大学创新基金资助项目(CXJJYL2022043)。
关键词
欠采样
类不平衡
不平衡学习
集成学习
不平衡数据分类
under-sampling
class imbalance
imbalance learning
ensemble learning:imbalanced data classification