一种面向不均衡数据集的IG特征选择改进算法

An Improved IGFeature Selection Algorithm for Unbalanced Data Sets

下载PDF

导出

摘要在文本分类中,各类别样本数目不等是普遍存在且备受关注的问题。本文从特征选择优化出发,分析了特征项在类内出现的频率、类内分散度、类间集中度以及不均衡数据集下文档的差异性对IG特征选择影响,引入了类内词频加权因子、类内词频分散度加权因子、类间词频集中度加权因子对传统信息增益特征选择模型进行改进,提出了一种改进的IG特征选择方法,并分别采用SVM和KNN两种算法进行分类实验。实验结果表明:在不均衡数据集上,本文所改进的特征选择方法具有更好的分类效果。 In text classification,this is a common and concerned problem that the number of samples in each category is different.From the aspect of feature selection optimization,this paper analyzes the influence of the frequency of feature items appearing within the class,the degree of dispersion within the class,the degree of concentration between the classes,and the influence of difference of documents under the unbalanced data set on the IG feature selection.It also introduces the weighting factors of the frequency of words within the class,the degree of dispersion within the class,and the degree of concentration between the classes to improve the traditional information gain feature selection model,and proposes an improved IG feature selection method.And an experiment via two kinds of classification algorithms—the K-nearest neighbor and the support vector machine classifier respectively—was conducted.The experimental results shows that improved feature selection method has better classification effect on the unbalanced data set.

作者骆魁永 LUO Kuiyong(Information Engineering College,Xinyang Agriculture and Forestry University,Xinyang 464000,China)

机构地区信阳农林学院信息工程学院

出处《信阳农林学院学报》 2021年第4期114-118,共5页 Journal of Xinyang Agriculture and Forestry University

基金信阳农林学院青年基金项目(20200115)。

关键词不均衡数据集 IG 特征项特征选择 unbalanced data set IG feature item feature selection

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献7

1杨荣杰..文本特征选择算法的研究[D].辽宁师范大学,2012:
2郭颂,马飞.文本分类中信息增益特征选择算法的改进[J].计算机应用与软件,2013,30(8):139-142. 被引量：14
3李文斌,刘椿年,陈嶷瑛.基于特征信息增益权重的文本分类算法[J].北京工业大学学报,2006,32(5):456-460. 被引量：19
4陈小莉..基于信息增益的中文特征提取算法研究[D].重庆大学,2008:
5尤鸣宇,陈燕,李国正.不均衡问题中的特征选择新算法:Im-IG[J].山东大学学报（工学版）,2010,40(5):123-128. 被引量：9
6何铠..基于自然语言处理的文本分类研究与应用[D].南京邮电大学,2020:
7钟锦燕..基于深度学习的文本分类研究[D].电子科技大学,2020:

二级参考文献30

1申红,吕宝粮,内山将夫,井佐原均.文本分类的特征提取方法比较与改进[J].计算机仿真,2006,23(3):222-224. 被引量：28
2李文斌,刘椿年,陈嶷瑛.基于特征信息增益权重的文本分类算法[J].北京工业大学学报,2006,32(5):456-460. 被引量：19
3KUBAT M, HOLTE R C, MATWIN S. Machine learning for the detection of oil spills in satellite radar images [ J ]. Machine Learning, 1998 (30) : 195-215. 被引量：1
4PHUA C, ALAHAKOON D. Minority report in fraud detection: classification of skewed data [ J ]. ACM SIGKDD Explorations Newsletter, 2004 (6) :50-59. 被引量：1
5PEREZ J M, MUGUERZA J, ARBELAITZ O. Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance pattern recognition and data mining[M]. Berlin:Springer Press, 2005:381-389. 被引量：1
6CASTILLO M D, SERRANO J I. A multistrategy approach for digital text categorization from imbalanced documents [ J ]. ACM SIGKDD Explorations Newsletter, 2004 (6) :70-79. 被引量：1
7ZHENG Zhaohui, WU X, SRIHARI R K. Feature selection for text categorization on imbalanced data [ J ]. ACM SIGKDD Explorations Newsletter, 2004 (6) : 80-89. 被引量：1
8CHEN Jianxun, CHENG T H, CHAN A L F. An application of classification analysis for skewed class distribution in therapeutic drug monitoring-the case of vancomycin [C]//Proceedings of the IDEAS Workshop on Medical Information Systems: The Digital Hospital. Beijing, China: IEEE Press, 2004:35-39. 被引量：1
9YOON K, KWEK S. An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics [ J ]. Neural Comput & Applic, 2007 (16) :295-306. 被引量：1
10RADIVOJAC P, KORAD U, SIVALINGAM K M.Learning from class-imbalanced data in wireless sensor networks [ C]//2003 IEEE 58^th Vehicular Technology Conference. Orlando, Florida, USA: IEEE Press, 2003 : 3030-3034. 被引量：1

共引文献37

1张玉芳,陈小莉,熊忠阳.基于信息增益的特征词权重调整算法研究[J].计算机工程与应用,2007,43(35):159-161. 被引量：33
2李文斌,陈嶷瑛,刘椿年,刘泰峰.邮件过滤算法的比较[J].计算机工程与设计,2008,29(17):4433-4436. 被引量：3
3任克强,张国萍,赵光甫.基于相对文档频的平衡信息增益降维方法[J].江西理工大学学报,2008,29(5):68-71. 被引量：3
4杨玉珍,刘培玉,朱振方,邱烨.应用特征项分布信息的信息增益改进方法研究[J].山东大学学报（理学版）,2009,44(11):48-51. 被引量：14
5尤鸣宇,陈燕,李国正.不均衡问题中的特征选择新算法:Im-IG[J].山东大学学报（工学版）,2010,40(5):123-128. 被引量：9
6张娟,高克峰,张曦.可拓多过滤器融合方法[J].福建电脑,2011,27(1):115-116.
7陈金坦,康恒政,杨燕,周伟雄.一种用于不平衡数据的分类算法[J].山东大学学报（工学版）,2011,41(2):96-101. 被引量：1
8李霞,王连喜,蒋盛益.面向不平衡问题的集成特征选择[J].山东大学学报（工学版）,2011,41(3):7-11. 被引量：5
9张玉芳,王勇,熊忠阳,刘明.不平衡数据集上的文本分类特征选择新方法[J].计算机应用研究,2011,28(12):4532-4534. 被引量：8
10穆俊鹏,董魁锋,张明.基于动态特征库的电子邮件分类的研究[J].计算机与现代化,2012(7):120-123.

1时艳玲,刘子鹏,贾邦玲.样本不平衡下的海杂波弱目标分类研究[J].信号处理,2021,37(9):1781-1789. 被引量：3
2叶玉儿,李军依,曹萌,夏勇.双模式涡旋光束的轨道角动量的精确识别[J].激光与光电子学进展,2021,58(18):375-382. 被引量：6
3尹爱军,陈小敏,谭建,王昱.深度概率优化的VAE轴承状态评估[J].振动与冲击,2021,40(20):186-192. 被引量：2
4薛红艳,钱雪忠,周世兵.超簇加权的集成聚类算法[J].计算机科学与探索,2021,15(12):2362-2373. 被引量：4

信阳农林学院学报

2021年第4期

浏览历史

内容加载中请稍等...

一种面向不均衡数据集的IG特征选择改进算法

参考文献7

二级参考文献30

共引文献37

相关作者

相关机构

相关主题

浏览历史