期刊文献+

一种面向不均衡数据集的IG特征选择改进算法

An Improved IGFeature Selection Algorithm for Unbalanced Data Sets
下载PDF
导出
摘要 在文本分类中,各类别样本数目不等是普遍存在且备受关注的问题。本文从特征选择优化出发,分析了特征项在类内出现的频率、类内分散度、类间集中度以及不均衡数据集下文档的差异性对IG特征选择影响,引入了类内词频加权因子、类内词频分散度加权因子、类间词频集中度加权因子对传统信息增益特征选择模型进行改进,提出了一种改进的IG特征选择方法,并分别采用SVM和KNN两种算法进行分类实验。实验结果表明:在不均衡数据集上,本文所改进的特征选择方法具有更好的分类效果。 In text classification,this is a common and concerned problem that the number of samples in each category is different.From the aspect of feature selection optimization,this paper analyzes the influence of the frequency of feature items appearing within the class,the degree of dispersion within the class,the degree of concentration between the classes,and the influence of difference of documents under the unbalanced data set on the IG feature selection.It also introduces the weighting factors of the frequency of words within the class,the degree of dispersion within the class,and the degree of concentration between the classes to improve the traditional information gain feature selection model,and proposes an improved IG feature selection method.And an experiment via two kinds of classification algorithms—the K-nearest neighbor and the support vector machine classifier respectively—was conducted.The experimental results shows that improved feature selection method has better classification effect on the unbalanced data set.
作者 骆魁永 LUO Kuiyong(Information Engineering College,Xinyang Agriculture and Forestry University,Xinyang 464000,China)
出处 《信阳农林学院学报》 2021年第4期114-118,共5页 Journal of Xinyang Agriculture and Forestry University
基金 信阳农林学院青年基金项目(20200115)。
关键词 不均衡数据集 IG 特征项 特征选择 unbalanced data set IG feature item feature selection
  • 相关文献

参考文献7

二级参考文献30

  • 1申红,吕宝粮,内山将夫,井佐原均.文本分类的特征提取方法比较与改进[J].计算机仿真,2006,23(3):222-224. 被引量:28
  • 2李文斌,刘椿年,陈嶷瑛.基于特征信息增益权重的文本分类算法[J].北京工业大学学报,2006,32(5):456-460. 被引量:19
  • 3KUBAT M, HOLTE R C, MATWIN S. Machine learning for the detection of oil spills in satellite radar images [ J ]. Machine Learning, 1998 (30) : 195-215. 被引量:1
  • 4PHUA C, ALAHAKOON D. Minority report in fraud detection: classification of skewed data [ J ]. ACM SIGKDD Explorations Newsletter, 2004 (6) :50-59. 被引量:1
  • 5PEREZ J M, MUGUERZA J, ARBELAITZ O. Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance pattern recognition and data mining[M]. Berlin:Springer Press, 2005:381-389. 被引量:1
  • 6CASTILLO M D, SERRANO J I. A multistrategy approach for digital text categorization from imbalanced documents [ J ]. ACM SIGKDD Explorations Newsletter, 2004 (6) :70-79. 被引量:1
  • 7ZHENG Zhaohui, WU X, SRIHARI R K. Feature selection for text categorization on imbalanced data [ J ]. ACM SIGKDD Explorations Newsletter, 2004 (6) : 80-89. 被引量:1
  • 8CHEN Jianxun, CHENG T H, CHAN A L F. An application of classification analysis for skewed class distribution in therapeutic drug monitoring-the case of vancomycin [C]//Proceedings of the IDEAS Workshop on Medical Information Systems: The Digital Hospital. Beijing, China: IEEE Press, 2004:35-39. 被引量:1
  • 9YOON K, KWEK S. An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics [ J ]. Neural Comput & Applic, 2007 (16) :295-306. 被引量:1
  • 10RADIVOJAC P, KORAD U, SIVALINGAM K M.Learning from class-imbalanced data in wireless sensor networks [ C]//2003 IEEE 58^th Vehicular Technology Conference. Orlando, Florida, USA: IEEE Press, 2003 : 3030-3034. 被引量:1

共引文献37

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部