摘要
在处理不平衡数据时,即使训练集和测试集之间互不重叠,过采样技术仍然可能导致数据泄露。为了解决这一问题,提出了一种分层SMOTE交叉验证法(stratified SMOTE cross-validation),将训练集中各类别样本均匀地划分为K折,在每一折中,独立地使用SMOTE算法进行数据平衡,使得每一折内的少数类样本特征仅在该折内使用。这样做不仅确保了训练与验证数据之间的完全独立,规避了数据泄露的风险,而且分类器能够充分学习少数类样本的特征。此外,结合了集成学习和参数优化技术,以增强模型的分类和泛化能力。在UCI数据集上的实验结果显示,分层SMOTE交叉验证法在分类性能上并不逊色于现有方法,并且不同的K值导致的数据分布差异会对模型性能产生影响。该方法有效地提升了模型对不平衡数据的处理能力,为不平衡学习问题提供了一定的参考价值。
In addressing imbalanced data,even with distinct training and test sets,oversampling techniques may still lead to data leakage.To address this issue,this paper proposes a stratified SMOTE cross-validation(SSCV),which divides the training set into K folds without overlap,each sample category evenly distributed across the folds.Each fold independently uses the SMOTE algorithm to balance data,ensuring exclusive use of minority class features within that fold.This method secures training-validation independence,preventing data leakage and allowing classifiers to effectively learn minority class features.Furthermore,we integrated ensemble learning and parameter optimization techniques to enhance the model′s classification and generalization capabilities.Experiments on UCI datasets show SSCV matches or outperforms TS-SMOTE,with K values affecting data distribution and classifier performance.The proposed method enhances the model′s ability to handle imbalanced data and offers insights for imbalanced learning challenges.
作者
李佳静
林少聪
郑寒秀
LI Jiajing;LIN Shaocong;ZHENG Hanxiu(School of Mathematics and Statistics,Fujian Normal University,Fuzhou,Fujian 350117,China;College of Computer and Cyber Security,Fujian Normal University,Fuzhou,Fujian 350117,China;College of Computer and Data Science,Minjiang University,Fuzhou,Fujian 350108,China)
出处
《闽江学院学报》
2024年第2期56-68,共13页
Journal of Minjiang University