期刊文献+

文本分类中一种新的特征选择方法 被引量:3

A new feature selection method for text categorization
原文传递
导出
摘要 文本分类面临的一个主要问题就是如何降低文本巨大的特征维数,并且保持分类精度甚至提高分类精度。针对该问题,提出了一种基于信息论的特征再提取方法,旨在删除稀疏分布的特征、保留有利于分类的特征。使用该方法时配合特征选择方法,可进一步降低特征维数。实验结果表明,该方法能将特征维数降低到几百维,而且能提高分类器的性能。 How to reduce feature dimension while maintaining categorization accuracy is a key issue of text categorization.A new method based on information theory was proposed to solve this problem.This approach aims to eliminate sparsely distributed features and find features useful for categorization.Working with these feature reduction methods,it could further reduce the feature dimension.The performance of this proposed method was tested on benchmark text classification problems.The results showed that it could not only reduce the feature dimension to hundreds but also improve the performance.
出处 《山东大学学报(工学版)》 CAS 北大核心 2010年第4期8-11,18,共5页 Journal of Shandong University(Engineering Science)
基金 山东省自然科学基金资助项目(Q2008G06) 教育部留学归国人员科研启动基金资助项目 山东大学自主创新基金资助项目(2009TS033)
关键词 文本分类 特征选择 互信息 信息增益 卡方统计 text categorization feature selection entropy mutual information information gain CHI square statistics
  • 相关文献

参考文献11

  • 1YANG Y M, PEDERSEN J O. A comparative study on feature selection in text categorization [ C ]// Proc of the 14th International Conference on Machine Learning ICML97. [ S. l. ] : [ s. n. ], 1997:412-420. 被引量:1
  • 2JOLLIFFE I T. Principal component analysis [ M ]. New York: Springer Verlag, 1986. 被引量:1
  • 3BAKER L D, MCCALLUM A K. Distributional clustering of words for text classification [ C]// Proc of 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia: [ s. n. ], 1998:96-103. 被引量:1
  • 4MARTINES A M, KAK A C. PCA versus LDA [ J ]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23 (2):228-233. 被引量:1
  • 5唐亮,段建国,许洪波,梁玲.基于互信息最大化的特征选择算法及应用[J].计算机工程与应用,2008,44(13):130-133. 被引量:35
  • 6SALTON G, WONG A, YANG C S. A vector space model for automatic indexing [ J ]. Communications of the ACM, 1975, 18:613-620. 被引量:1
  • 7陆玉昌,鲁明羽,李凡,周立柱.向量空间法中单词权重函数的分析和构造[J].计算机研究与发展,2002,39(10):1205-1210. 被引量:126
  • 8李凡,鲁明羽,陆玉昌.关于文本特征抽取新方法的研究[J].清华大学学报(自然科学版),2001,41(7):98-101. 被引量:78
  • 9Thomas Abeea, Yves Vab, Yvan Saeys. Java-ML: a machine learning library [ J 1. Journal of Machine Learning Research, 2009, 10: 931-934. 被引量:1
  • 10李荣陆.中文文本分类语料[EB/OL].[2008-01-20].http://www.nip.org.cn/docs/doclist.php. 被引量:2

二级参考文献8

  • 1Yang Yiming,Pedersen J O.A comparative study on feature selection in text categorization[C]//Proc of the 14th International Conference on Machine Learning ICML97,1997:412-420. 被引量:1
  • 2Karypis G,Han E.Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval[C]// Proc of the 9th ACM International Conference on Information and Knowledge Management CIKM-00.New York,US:ACM Press,2000: 228-233. 被引量:1
  • 3Baker L D,McCallum A K.Distributional clustering of words for text classification[C]//Proc of the 21st Annual International ACM SIGIR, 1998 :96-103. 被引量:1
  • 4谭松波语料库[DB/OL].http://lcc.software.ict.ac.cn/-tansongbo/corpusl.php. 被引量:1
  • 5Jolliffe I T.Principal component analysis[M].New York:Spriger Verlag, 1986. 被引量:1
  • 6Martinez A M,Kak A C.PCA versus LDA[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2001,23(2):228-233. 被引量:1
  • 7Yang Y,http://citeseernjneccom/yang97comparativehtml,1997年 被引量:1
  • 8李凡,鲁明羽,陆玉昌.关于文本特征抽取新方法的研究[J].清华大学学报(自然科学版),2001,41(7):98-101. 被引量:78

共引文献228

同被引文献44

  • 1牟廉明.k子凸包分类方法[J].山西大学学报(自然科学版),2011,34(3):374-380. 被引量:5
  • 2YOON K, KWEK S. A data reduction approach for resolving the imbalanced data issue in functional genomics [ J ]. Neural Comput & Applic, 2007 (16) :295-306. 被引量:1
  • 3ZHENG Zhaohui, WU Xiaoyun, ROHINI Srihari. Feature selection for text categorization on imbalanced data [J]. SIGKDD Explorations, 2004, 6( 1 ) :80-89. 被引量:1
  • 4JIANG Shengyi, WANG Lianxi. Unsupervised feature selection based on clustering [ C ]//Proceedings of IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA). Changsha: IEEE, 2010: 263-270. 被引量:1
  • 5YU L, LIU H. Efficient feature selection via analysis of relevance and redundancy [J]. Journal of Machine Learning Research, 2004, 5 : 1205-1224. 被引量:1
  • 6TSYMBAL A, PECHENIZKIY M, CUNNINGHAM P. Sequential genetic search for ensemble feature selection C ]//Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, San Francisco: Morgan Kaufmann, 2005: 877-882. 被引量:1
  • 7LIU X Y, WU J, ZHOU Z H. Exploratory under-sampiing for class-imbalance learning [ J ]. IEEE Transactions on Systems, Man and Cybernetics-part B, 2009, 39(2) :539-550. 被引量:1
  • 8ASUNCION A, NEWMAN D. UCI repository of machine learning databases [DB/OL ]. [ 2009-04-03 ]. http ://www. its. u ci. edu/-mlearn/MLRep-ository, html. 被引量:1
  • 9BARANDELA R, SANCHEZ J S, GARCIA V. Strategies for learning in class imbalance problems [J]. Pattern Recognition, 2003, 36 ( 3 ) : 849-851. 被引量:1
  • 10ELAZMEH W, JAPKOWICZ N, MATWIN S. Evaluating misclassification in imbalanced data [J ]. LNCS, 2006, 4212: 126-137. 被引量:1

引证文献3

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部