摘要
文本分类面临的一个主要问题就是如何降低文本巨大的特征维数,并且保持分类精度甚至提高分类精度。针对该问题,提出了一种基于信息论的特征再提取方法,旨在删除稀疏分布的特征、保留有利于分类的特征。使用该方法时配合特征选择方法,可进一步降低特征维数。实验结果表明,该方法能将特征维数降低到几百维,而且能提高分类器的性能。
How to reduce feature dimension while maintaining categorization accuracy is a key issue of text categorization.A new method based on information theory was proposed to solve this problem.This approach aims to eliminate sparsely distributed features and find features useful for categorization.Working with these feature reduction methods,it could further reduce the feature dimension.The performance of this proposed method was tested on benchmark text classification problems.The results showed that it could not only reduce the feature dimension to hundreds but also improve the performance.
出处
《山东大学学报(工学版)》
CAS
北大核心
2010年第4期8-11,18,共5页
Journal of Shandong University(Engineering Science)
基金
山东省自然科学基金资助项目(Q2008G06)
教育部留学归国人员科研启动基金资助项目
山东大学自主创新基金资助项目(2009TS033)
关键词
文本分类
特征选择
熵
互信息
信息增益
卡方统计
text categorization
feature selection
entropy
mutual information
information gain
CHI square statistics