期刊文献+

不平衡数据的无监督特征选择方法 被引量:8

Unsupervised Feature Selection Method for Imbalanced Data
下载PDF
导出
摘要 传统特征选择方法大部分是以分布均衡的数据为研究对象,以优化总体分类精度为基本目标,所以很少有方法在不平衡数据集上得到理想的学习效果.依据数据的分布特点,提出一种新的面向不平衡数据集的特征选择方法.该方法在无监督环境下,依据聚类簇大小的变化以通过在不同簇的相同特征上对其特征重要性度量函数分配不同的权重来调整数据分布的不均衡性.在多个UCI不平衡数据集上的实验结果表明,相比于其它几种经典的特征选择方法,所提出的方法在不降低总体分类精度的情况下,不仅可以有效选择更少的特征数目,而且还可以提高少数类在不同分类器上的分类精度、召回率及F-Measure值. The traditional feature selection methods handle data with balanced distribution,aim for getting optimal classification accuracy,so there exist very limited feature selection methods that perform well on imbalance data.This study proposes a new feature selection method based on the character of data distribution for imbalanced data sets.It modifies data distribution for balance by assigning different weights to the function of feature importance measurements according to the variation of the size of clusters in unsupervised learning.Experimental results on several UCI datasets show that the performance of the proposed method outperforms other classic feature selection algorithms.It not only maintains or enhances the classification performance and dimensionality reduction,but also improves the precision,recall and F-Measure of the minor classes on different classifiers.
出处 《小型微型计算机系统》 CSCD 北大核心 2013年第1期63-67,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61070061)资助 教育部人文社会科学研究青年项目(11YJCZH086)资助 广东外语外贸大学青年项目(11Q01)资助 广东省高层次人才项目资助
关键词 特征选择 不平衡数据集 聚类 特征重要性度量 feature selection imbalanced data clustering feature importance measure
  • 相关文献

参考文献24

  • 1Liu H, Setiono R. A probabilistic approach to feature selection-a filter solution [ C ]. In Proceedings of the 13th International Confer- ence on Machine Learning, San Francisco, CA: Morgan Kauf- mann, 1996: 319-327. 被引量:1
  • 2Asuncion A, Newman D. UCI machine learning repository [ DB/ OL ]. http ://www. ics. uci. edu/- rnlearn/MLRepository, html, 2007. 被引量:1
  • 3Jiang S, Wang L. Unsupervised feature selection based on cluste- ring[C]. In IEEE Fifth International Conference on Bio-Inspired Computing : Theories and Applications ( BIC-TA ), Washington : IEEE Computer Society, 2010, 9: 263-270. 被引量:1
  • 4Jiang S Y, Li X, Zheng Q, et al. Approximate equal frequency discretizafion method[ A]. Proceeding of Global Congress on In- telligent Systems [ C ], Washington : IEEE Computer Society, 2009: 514-518. 被引量:1
  • 5任双桥,傅耀文,黎湘,庄钊文.基于分类间隔的特征选择算法[J].软件学报,2008,19(4):842-850. 被引量:14
  • 6Zhou Z, Liu X. Training cost-sensitive neural networks with meth- ods addressing the class imbalance problem [ J ]. 1EEE Transac- tions on Knowledge and Data Engineering ,2006,18 (1) :63-77. 被引量:1
  • 7Lin Zhi-yong, Hao Zhi-feng, Yang Xiao-wei. Effects of several e- valuation metrics on imbalanced data learning[ J]. Journal of South China University of Technology (Natural Science Edition), 2010, 38(4) : 147- 155. 被引量:1
  • 8Alibeigi M, Hashemi S, Hamzeh A. Unsupervised feature selec- tion based on the distribution of features attributed to imbalanced data sets [ J ]. International Journal of Artificial Intelligence and Expert Systems, 2011, 2(1) : 136-144. 被引量:1
  • 9Barandels R, SANCHEZ J S, GARC V. Strategies for learning in class imbalance problems [ J ]. Pattern Recognition, 2003, 36 ( 3 ) : 849-851. 被引量:1
  • 10Fayyad U, Irani B. Multi-interval discretization of continuous val- ued attributes for classification leaning[ C]. In: Thirteenth Interna- tional Joint Conference on Artificial Intelligence, Morgan Kanf- mann, 1993: 1022-1027. 被引量:1

二级参考文献55

共引文献63

同被引文献79

  • 1徐启圣,李柱国.基于层次分析法的油液诊断特征属性的选择[J].上海交通大学学报,2006,40(8):1354-1359. 被引量:9
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:386
  • 3搜狗实验室.文本分类语料库[EB/OL].[2008-07-20].http://www.sogou.com/labs/dl/c.html. 被引量:5
  • 4杨淑莹.模式识别与智能计算[M].北京:电子工业出版社,2011. 被引量:7
  • 5Anand S,Griffths N.A Market-based Approach to Address the New Item Problem[C].In:Proceedings of the 5th ACM Conference on Recommender Systems,ACM Press,New York,2011:205-212. 被引量:1
  • 6Pan S,Yang Q.A Survey on Transfer Learning[J].IEEE Transactions on Knowledge and Data Engineering,2010,(22):1345-1359. 被引量:1
  • 7YANG Q, WU X D. 10 challenging problems in data mining research [ J]. International Journal of Inforamtion Technology & Decision Making, 2006, 5:597 -604. 被引量:1
  • 8BREIMAN L. Random forests [ J ]. Machine Learning, 2001, 45(1) : 5 -32. 被引量:1
  • 9GENUER R, POGGI J M, TULEAU-MALOT C. Varia- ble selection using random forests [ J ]. Pattern Recogni- tion Letters, 2010, 31(14): 2225-2236. 被引量:1
  • 10ASUNCION A, NEWMAN D. UCI machine learning re- pository [ G]. [ 2014 - 04 - 30 ]. http://archive. ics. uci. edu/ml/. 被引量:1

引证文献8

二级引证文献59

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部