摘要
传统机器学习中研究的分类问题通常假定各类别是平衡的,但在很多场合各类别的出现概率相差很大,而且很多应用中需要区分重要而稀少的少数类。本文比较了3种基于AdaBoost集成学习方法,并推导出他们的精度几何平均(GMA)的下界。分析表明:类别越不平衡,这3种方法越难以通过提高基分类器准确率来提高GMA。在此结论的基础上,以Bagging为基础提出了单边Bagging算法,该算法只对多数类抽样,而保留所有少数类,因而每轮的训练集是类别平衡的,并通过UC I数据集验证了其有效性。
Assuming that the classes are well-balanced, there exists many domains. One class is represented by many examples while the other is represented by only a few, thus, in many applications it is necessary to classify important and rare classes. The lower bounds on geometric mean accuracy(GMA) for three AdaBoost based ensemble methods are presented. The analysis shows that if the more "imbalanced" classes are used, it is more difficult to increase GMA by improving the accuracy of base classifiers. A Bagging based ensemble method, called the single side Bagging(SSBagging) is proposed and the algorithm retains all minority examples and bootstraps majority examples from pool of training set to create "bags" of the example. Experiments with UCI datasets show the validity of SSBagging.
出处
《南京航空航天大学学报》
EI
CAS
CSCD
北大核心
2009年第4期520-526,共7页
Journal of Nanjing University of Aeronautics & Astronautics
基金
国家自然科学基金(60603029)资助项目