摘要
在文本分类中,各类别样本数目不等是普遍存在且备受关注的问题。本文从特征选择优化出发,分析了特征项在类内出现的频率、类内分散度、类间集中度以及不均衡数据集下文档的差异性对IG特征选择影响,引入了类内词频加权因子、类内词频分散度加权因子、类间词频集中度加权因子对传统信息增益特征选择模型进行改进,提出了一种改进的IG特征选择方法,并分别采用SVM和KNN两种算法进行分类实验。实验结果表明:在不均衡数据集上,本文所改进的特征选择方法具有更好的分类效果。
In text classification,this is a common and concerned problem that the number of samples in each category is different.From the aspect of feature selection optimization,this paper analyzes the influence of the frequency of feature items appearing within the class,the degree of dispersion within the class,the degree of concentration between the classes,and the influence of difference of documents under the unbalanced data set on the IG feature selection.It also introduces the weighting factors of the frequency of words within the class,the degree of dispersion within the class,and the degree of concentration between the classes to improve the traditional information gain feature selection model,and proposes an improved IG feature selection method.And an experiment via two kinds of classification algorithms—the K-nearest neighbor and the support vector machine classifier respectively—was conducted.The experimental results shows that improved feature selection method has better classification effect on the unbalanced data set.
作者
骆魁永
LUO Kuiyong(Information Engineering College,Xinyang Agriculture and Forestry University,Xinyang 464000,China)
出处
《信阳农林学院学报》
2021年第4期114-118,共5页
Journal of Xinyang Agriculture and Forestry University
基金
信阳农林学院青年基金项目(20200115)。
关键词
不均衡数据集
IG
特征项
特征选择
unbalanced data set
IG
feature item
feature selection