摘要
文本分类中特征质量的好坏,会直接影响到分类的准确率,从特征提取这一环节出发,实现了一种改进的基于基尼指数的特征提取方法Gini,提出一种全局和局部特征提取相融合的特征提取方法。当MI、IG、CE、WET、Gini与χ2这6种特征提取方法用于SVM分类实验时,发现Gini全局特征提取能力强,χ2方法适合局部特征提取;当Gini与χ2两种方法相融合进行特征提取时表现出较强的特征提取能力,明显优于全局和局部的提取效果.
The feature quality in the text categorization has a direct influence on the accuracy rate of categorization.From the link of feature extraction,one kind method of feature extraction based on Gini-Index named Gini was realized and a method for feature extration in chinese text by fusing global and local features was proposed.When the six kinds of feature extraction methods(MI,IG,CE,Wet,Gini and χ2)were used for categorization experiments,it was found that Gini had a capability to extract the global feature and χ2 was suitable for local feature extraction.When fused method of Gini and χ2 was used to extract feature,its stronger feature extraction capability had significantly better effects of than global and local extraction methods.
出处
《河北北方学院学报(自然科学版)》
2013年第3期35-38,共4页
Journal of Hebei North University:Natural Science Edition
关键词
基尼指数
特征提取
文本分类
Gini-Index
feature extraction
text categorization