期刊文献+

一种基于样本分层的双向过采样方法 被引量:5

Bi-directional Oversampling Method Based on Sample Stratification
下载PDF
导出
摘要 重采样技术由于简单、直观,逐渐成为解决非平衡数据分类问题的一个重要方向。但是在数据集很小的情况下,重采样技术中的欠采样可能会丢失数据集的重要信息,因此过采样是非平衡数据分类问题的研究重点。现有的过采样方法虽然有效地解决了类间不平衡问题,但是有可能造成少数类的密集区域更加密集,甚至引起样本重叠。此外,由于少数类样本可能存在噪音,现有的过采样方法可能会在噪音周围生成新样本,从而造成少数类样本的分布更加混乱。针对这些问题,文中提出了一种基于样本分层的双向过采样方法,该方法首先基于最高密度点和类内平均距离将少数类样本划分成密集层和稀疏层,然后对密集层边界区样本和稀疏层的样本进行双向过采样。为了验证所提算法的有效性,在9个UCI数据集上将提出的算法和其他过采样算法进行了比较。实验结果和Friedman等检验结果显示,提出的算法在处理非平衡数据分类问题时具有一定优势。 Resampling technology has gradually become an important direction to solve the problem of classification for imbalanced data because of its simplicity and intuition.However,in the case of small data sets,under-sampling in resampling technology may lose important information of data sets,so oversampling is the focus of classification for imba-lanced data.Although the existing oversampling methods effectively overcome the imbalance between classes,they may cause dense areas of minority class to be denser,even lead to overlapping of samples.In addition,due to the noise of minority class,the existing oversampling methods may generate new samples around the noise,which makes the distribution of minority class more confusing.Aiming at these problems,this paper proposed a bi-directional oversampling method based on sample stratification.It firstly divides the minority samples into dense area and sparse area based on the highest density point and the intra-class average distance.And then the bi-directional oversampling is performed in the boundary region of dense area and the sparse area.In order to verify the effectiveness of the proposed algorithm,comprehensive experiments were conducted on 9 data sets of UCI database.The experimental results and Friedman test results show the superiority of the proposed algorithm for the task of imbalanced data classification.
作者 周晓敏 曹付元 余丽琴 ZHOU Xiao-min;CAO Fu-yuan;YU Li-qin(School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China;Key Laboratory of Computational Intelligence and Chinese Information Processing(Shanxi University),Ministry of Education,Taiyuan 030006,China)
出处 《计算机科学》 CSCD 北大核心 2019年第12期83-88,共6页 Computer Science
基金 国家自然科学基金项目(61573229) 山西省重点研发计划项目(201803D31022) 山西省留学基金项目(2016-003) 山西省留学基金择优资助项目(2016-001)资助
关键词 非平衡数据 分类 双向过采样 密集层 稀疏层 Imbalanced data Classification Bi-directional oversampling Dense area Sparse area
  • 相关文献

参考文献8

二级参考文献71

  • 1闫明松,周志华.代价敏感分类算法的实验比较[J].模式识别与人工智能,2005,18(5):628-635. 被引量:14
  • 2凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量:35
  • 3薛安荣,鞠时光,何伟华,陈伟鹤.局部离群点挖掘算法研究[J].计算机学报,2007,30(8):1455-1463. 被引量:96
  • 4Bartlett P L, Traskin M. AdaBoost is consistent. Journal of Machine Learning Research, 2007, 8:2347-2368. 被引量:1
  • 5Schapire R E. The convergence rate of AdaBoost [open prob lem]//Proceedings of the 23rd Conference on Learning Theo ry. Haifa, Israel, 2010. 被引量:1
  • 6Japkowicz N. Learning from imbalanced data sets: A com parison of various strategies/ /Proceedings of the AAAI 2000 Workshop, 2000:10-15. 被引量:1
  • 7Chawla N V, Japkowicz N, Kotcz A. Workshop on learning from imbalanced data sets//Proceedings of the ICML' 2003. Washington, DC, USA, 2003. 被引量:1
  • 8Chawla N V, Japkowicz N, Kolez A. Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Ex- plorations Newsletter, 2004, 6 (1) : 1-6. 被引量:1
  • 9He Hai-Bo, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284. 被引量:1
  • 10Liu X Y, Zhou Z H. The influence of class imbalance on cost-sensitive learning: An empirical study//Proeeedings of the 6th International Conference on Data Mining(ICDM'06). Hong Kong, China, 2006 : 970-974. 被引量:1

共引文献138

同被引文献33

引证文献5

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部