摘要
特征选择算法对文本分类系统的精确度有很大影响,传统的信息增益特征选择算法通常会导致在指定类别中很少出现而在其他类别中频繁出现的特征被选择出来。为克服这一缺陷,在对传统算法和相关改进算法深入分析的基础上,引入特征分布差异因子、类内和类间加权因子的改进思路,提出一种基于特征分布加权的信息增益改进算法,并分别采用朴素贝叶斯和支持向量机两种分类算法进行实验。实验结果表明,该算法优于其他改进算法。
Feature selection algorithm has great impact on the precision of text classification system. Traditional information gain feature se- lection algorithm usually leads to some features to be selected which are low-frequency in designated category but high-frequency in other cate- gories. To overcome this shortage, based on in-depth analysis on traditional and related improved algorithms, we introduce the improving thoughts of feature distribution difference factor and the weighted factors of inter-category and intra-category, put forward an improved informa- tion gain algorithm based on feature distribution weighting, and experiment it using two kinds of classification algorithms, the naive Bayes clas- sifier and the support vector machine classifier respectively. Experimental results demonstrate that the algorithm proposed in the paper outper- forms other improved algorithms.
出处
《计算机应用与软件》
CSCD
北大核心
2013年第8期139-142,共4页
Computer Applications and Software
基金
河南省科技厅基础与前沿技术研究计划项目(122300410281)
关键词
文本分类
特征选择
信息增益
特征分布加权
Text classification Feature selection Information gain Feature distribution weighting