期刊文献+

新的CDF文本分类特征提取方法 被引量:11

New feature selection approach(CDF) for text categorization
下载PDF
导出
摘要 对高维的特征集进行降维是文本分类过程中的一个重要环节。在研究了现有的特征降维技术的基础上,对部分常用的特征提取方法做了简要的分析,之后结合类间集中度、类内分散度和类内平均频度,提出了一个新的特征提取方法,即CDF方法。实验采用K-最近邻分类算法(KNN)来考查CDF方法的有效性。结果表明该方法简单有效,能够取得比传统特征提取方法更优的降维效果。 Reducing the high dimension of feature vectors is an essential part of text categorization. After studying current dimension reduction technique and analyzing some normal methods of feature selection, a new approach, named CDF, for feature selection was proposed by comprehensively taking account of concentration among classes, distribution in class and average frequency in class. Experiment takes K-Nearest Neighbor (KNN) as the evaluation classifier. Experimental results prove that CDF approach is simple and effective, and has better performance than conventional feature selection methods in dimension reduction.
出处 《计算机应用》 CSCD 北大核心 2009年第7期1755-1757,共3页 journal of Computer Applications
基金 中国博士后科学基金资助项目(20070420711) 重庆市科委自然科学基金计划资助项目(2007BB2372)
关键词 文本分类 降维 特征提取 K-最近邻分类算法 评价函数 text categorization dimension reduction feature selection K-Nearest Neighbor (KNN) algorithm evaluation function
  • 相关文献

参考文献11

  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:384
  • 2李凡,鲁明羽,陆玉昌.关于文本特征抽取新方法的研究[J].清华大学学报(自然科学版),2001,41(7):98-101. 被引量:78
  • 3刘丽珍,宋瀚涛.文本分类中的特征选取[J].计算机工程,2004,30(4):14-15. 被引量:40
  • 4YANG YIMING, LIU XIN. A re-examination of text categorization methods[ C]// Proceedings of 22nd Annum International ACM SI- GIR Conference on Research and Development in Information Retrieval: SIGIR'99. New York: ACM, 1999:42-49. 被引量:1
  • 5BONG C H, NARAYANAN K. An empirical study of feature selection for text categorization based on term weightage[ C]// Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. Washington, DC: IEEE Computer Society, 2004:599 - 602. 被引量:1
  • 6QIU LIQING, ZHAO RUYI, ZHOU GANG, et al. An extensive empirical study of feature selection for text categorization[ C]//Proceedings of the 7th IEEE/ACIS International Conference on Computer and Information Science. Washington, DC: IEEE Computer Society, 2008:312 - 315. 被引量:1
  • 7NOVOVICOVA J, MALIK A. Information-theoretic feature selection algorithms for text classification [ C]// Proceedings of IEEE International Joint Conference on Neural Networks. Washington, DC: IEEE Computer Society, 2005:3272 - 3277. 被引量:1
  • 8YANG Y, PEDERSEN J Q. A comparative study on feature selection in text categorization[ C]//Proceedings of the 14th International Conference on Machine Learning: ICML'97. Nashville: Morgan Kaufmann Publishers, 1997: 412 - 420. 被引量:1
  • 9申红,吕宝粮,内山将夫,井佐原均.文本分类的特征提取方法比较与改进[J].计算机仿真,2006,23(3):222-224. 被引量:28
  • 10GALAVOTTI L, SEBASTIANI F. Feature selection and negative evidence in automated text categorization[ C]//6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2000:73 -76. 被引量:1

二级参考文献21

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 3James Auen.Natural Language Understandin[M].The Benjamin/Cummings Publishing Company, 1991-05. 被引量:1
  • 4Apte C,Damerau F J,Weiss S M.Automated Learning of Decision Rules for Text Categorization[J].ACM Trans On Inform Syst,12(3): 233-251. 被引量:1
  • 5Salton G,Buckley B.Term-weighting Approaches in Automatic Text Retrieval[J].Information Processing and Management, 1998 ; 24(5 ) :513 -523. 被引量:1
  • 6Larkey L S.A Patent Search and Classification System[C].In:proceedings of DL-99,4th ACM Conference on Digital Libraries Berkeley,CA,1999:179-187. 被引量:1
  • 7Salton G,Lesk M E.Computer Evaluation of Indexing and Text Processing[J].Association for Computing Machinery, 1968 ; 15 ( 1 ) : 8-36. 被引量:1
  • 8Yang Y,http://citeseernjneccom/yang97comparativehtml,1997年 被引量:1
  • 9Yi-Ming Yang,Jan O Pederson.A Comparative Study on Feature Selection in Text Categorization[C].Proc.of 14th International Conference on Machine Learning (ICML-97),1997,412-420. 被引量:1
  • 10T E Dunning.Accurate methods for the statistics of surprise and coincidence[J].Computational Linguistics,1993,19(1),61-74. 被引量:1

共引文献565

同被引文献134

引证文献11

二级引证文献49

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部