摘要
针对不均衡数据下分类超平面偏移、少数类识别率较低的问题,提出一种基于样本密度的不均衡数据分类算法。该算法首先计算样本密度和类样本密度,依据类样本密度之间的关系确定聚类类数,然后利用K-means聚类算法对多数类样本进行聚类,用聚类所得类中心作为样本集取代原多数类样本集,最后对新构造的训练集进行训练得到最终决策函数。其实验结果表明,该算法能够提高SVM在不均衡数据下的分类性能,尤其是少数类的分类性能。
In order to resolve the classifiers' over fitting phenomenon to enhance classification performance,a new algorithm based on sample density is proposed for imbalanced data classification. Firstly,it computes the density of samples and the density of every class. Then it works out the number of class with cluster algorithm according to the relation of sample density of every class. Then it clusters the samples of majority class using K-means algorithm with above class number. The cluster centers are treated as the new samples and then a new training dataset is constructed with the new samples and minority dataset. According to the new training dataset,we can get the decision function. The method may resolve the problem of imbalanced dataset and improve the classification performance of SVM. Results of experiments with artificial dataset and six groups of UCI dataset show that the algorithm is effective for imbalanced dataset,especially for the minority class samples.
出处
《西华大学学报(自然科学版)》
CAS
2015年第5期16-23,74,共9页
Journal of Xihua University:Natural Science Edition
基金
陕西省自然科学基金项目(2014JM2-6122)
陕西省教育厅科技计划项目(12JK0748)
商洛学院科学与技术研究项目(13sky024)
关键词
支持向量机
不均衡数据集
样本密度
欠取样
K-近邻
support vector machine
imbalanced dataset
sample density
under-sampling
K-nearest neighbor