摘要
基于神经网络的中文文本分类需要解决的核心问题是特征的选择问题,特征选择涉及选择哪些特征和选择的特征维度两个问题。针对上述问题,提出了信息增益(IG)与主成分分析(PCA)相结合的特征选择方法。通过实验比较分析了不同特征选择方法与特征维度对分类性能的影响,证明了该特征选择方法在基于神经网络的中文文本分类中的优越性,并得出神经网络的特征输入维度在200左右的时候分类性能最佳。
The main problem in the Neural Network (NN) based Chinese text categorization is feature selection for textual data. Feature Selection involves what feature to select and how large the dim of the feature space should be, Aiming at the preceding problem, this paper puts forward a feature selection method using Information Gain (IG) and Principle Component Analysis(PCA). Compare and analyze the categorization performance of different feature selection methods and different feature dims in the experiments. Therefore, the superiority of the proposed feature selection method for NN based Chinese text categorization is proved. The experiments also show that the performance of the NN becomes highest when the feature dim is around 200.
出处
《计算机应用研究》
CSCD
北大核心
2006年第7期161-164,共4页
Application Research of Computers
基金
国家"863"计划资助项目(2002AA117010-10)
2005年教育部科技基础条件平台建设项目
关键词
文本分类
神经网络
主成分分析
特征选择
Text Categorization
Neural Network (NN)
Principle Component Analyze (PCA)
Feature Selection