摘要
对高维的特征集进行降维是文本分类过程中的一个重要环节。在研究了现有的特征降维技术的基础上,对部分常用的特征提取方法做了简要的分析,之后结合类间集中度、类内分散度和类内平均频度,提出了一个新的特征提取方法,即CDF方法。实验采用K-最近邻分类算法(KNN)来考查CDF方法的有效性。结果表明该方法简单有效,能够取得比传统特征提取方法更优的降维效果。
Reducing the high dimension of feature vectors is an essential part of text categorization. After studying current dimension reduction technique and analyzing some normal methods of feature selection, a new approach, named CDF, for feature selection was proposed by comprehensively taking account of concentration among classes, distribution in class and average frequency in class. Experiment takes K-Nearest Neighbor (KNN) as the evaluation classifier. Experimental results prove that CDF approach is simple and effective, and has better performance than conventional feature selection methods in dimension reduction.
出处
《计算机应用》
CSCD
北大核心
2009年第7期1755-1757,共3页
journal of Computer Applications
基金
中国博士后科学基金资助项目(20070420711)
重庆市科委自然科学基金计划资助项目(2007BB2372)
关键词
文本分类
降维
特征提取
K-最近邻分类算法
评价函数
text categorization
dimension reduction
feature selection
K-Nearest Neighbor (KNN) algorithm
evaluation function