摘要
提出一种基于特征选择和特征抽取的混合型文本特征降维方法,分析基于选择和抽取的特征降维方法各自的特点,借助特征项的类别分布差异信息对特征集进行初步选择。使用一种新的基于PCA的特征抽取方法对剩余特征集进行二次抽取,在最大限度减少信息损失的前提下实现了文本特征的有效降维。对文本的分类实验结果表明,该特征降维方法具有良好的分类效果。
A mixed method of reducing the text features based on feature selection and feature extraction is brought forward. The characteristics about feature selection and feature extraction are analyzed. Some features are chosen by using the sort distribution information. And a new way based on Principle Component Analysis(PCA) is used to extract the surplus features and realize the compression of features twice. In the precondition of the information loss least, the text feature decrease smart is completed. Test results show that this method has a better precision in the text categorization.
出处
《计算机工程》
CAS
CSCD
北大核心
2009年第2期194-196,共3页
Computer Engineering
基金
国家自然科学基金资助项目(70571087)
关键词
文本分类
特征选择
特征抽取
主成分分析
text classification
feature selection
feature extraction
Principle Component Analysis(PCA)