摘要
本文在层次分类的环境下,首先实验比较了文档频率、信息增益、期望交叉熵、x^2统计、文本证据权、互信息6种常用的特征选择算法,结果是互信息的分类效果最差。然后对此作了分析,并在此基础上提出了一种改进型互信息算法。实验结果表明,改进型互信息算法要好于其他算法。单字词的去除使分类效果得到提高,说明词特征更能够比较完整地表达语义信息。
Under the environment of hierarchy classification, first, we do experiments to compare the six kinds of commonly used feature selection algorithm such as document frequency, information gain, expected cross entropy, 2 statistical, the weight of text and mutual information, res^tlng that the classifying effect of mutual information i~ worst. Then we analyze the reason and propose an improved mutual information algorithm. The experimental results show that the improved mutual information algorithm is better than others, and removing single word improves the classifying effects, which proves that words can express semantics information more completely.
出处
《情报学报》
CSSCI
北大核心
2006年第6期651-656,共6页
Journal of the China Society for Scientific and Technical Information
关键词
层次分类
特征选择
互信息
改进
hierarchy classification, feature selection, improved mutual information.