期刊文献+

基于偏斜数据集的文本分类特征选择方法研究 被引量:4

Feature Selection for Skewed Text Categorization
下载PDF
导出
摘要 对于不同类别样本数量差别很大的偏斜文本数据集,使用传统的特征选择方法所选出的特征绝大多数来自于大类,会使得分类器偏重大类而忽略小类,直接影响分类效果。该文首先针对偏斜文本数据集的数据特点,分析发现偏斜数据集中影响特征选择的两个重要因素,即特征项的类别分布和类间差异,其中类别分布因素反映的是特征项在整个数据集中的类别频率差异;而类别差异因素反映的是特征项在不同类别之间的相对文档频率差异。然后基于这两个重要因素构造形成一个新的尤其适用于偏斜文本分类的特征选择函数—相对类别差异(Relative Category Difference,RCD)。与传统的特征选择方法进行对比实验的结果表明,RCD特征选择方法对于偏斜文本分类效果更优。 The existing for feature selection methods are not appropriate for the skewed corpus in which most of sam- ples belong to a majority class and far fewer samples belong to a minority class. The reason is that these methods se- lect features without considering the relative distribution of each class. As a result, most of selected features using these methods come from the majority class, which tend to misclassify minority class samples. This paper analyzes the characters of the skewed corpus and finds two important factors which can influence feature selection on the skewed data: category distribution and category difference. The category distribution factor indicates category fre- quency difference in whole dataset, and the category difference factor indicates relative documents frequency differ- ence between classes. Then a new feature selection function called Relative Category Difference (RCD) is construc- ted based on the two factors. Experimental results show that the new feature selection method outperforms other methods for the skewed text categorization.
出处 《中文信息学报》 CSCD 北大核心 2014年第2期116-121,共6页 Journal of Chinese Information Processing
基金 国家242信息安全计划项目(2010A007) 国家863项目(2011AA01A203) 国家自然科学基金(60903047 61272361) 中国科学院先导专项项目(XDA06030200)
关键词 文本分类 偏斜数据集 特征选择 类别差异 text categorization skewed dataset feature selection category difference
  • 相关文献

参考文献9

二级参考文献44

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 3苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:386
  • 4黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38. 被引量:1
  • 5[2]Y Yang,JO Pedersen.A comparative study on feature selection in text categorization.In:Proc of the 14th Int'lConf on Machine Learning (ICML-97).San Francisco:Morgan Kaufmann Publishers,1997.412-420 被引量:1
  • 6[3]NV Chawla,N Japkowicz,A Kotcz.Editorial:Special issue on learning from imbalanced data sets.SIGKDD Explorations Newsletters,2004,6(1):1-6 被引量:1
  • 7[4]D Mladenic,M Grobelnk.Feature selection for unbalanced class distribution and naive bayes.In:Proc of the 16th Int'lConf on Machine Learning (ICML'99).San Francisco:Morgan Kaufmann Publishers,1999.258-267 被引量:1
  • 8[6]Bong,Chih How,K Narayanan.An empirical study of feature selection for text categorization based on term weightage.IEEE/WIC/ACM Int'lConf on Web Intelligence(WI'04),Beijing,2004 被引量:1
  • 9[7]Shoushan Li,Chengqing Zong.A new approach to feature selection for text categorization.IEEE Int'lConf on Natural Language Processing and Knowledge Engineering (NLP-KE),Wuhan,2005 被引量:1
  • 10[8]Castillo MDd,Serrano JI.A multistrategy approach for digital text categorization from imbalanced documents.SIGKDD Explorations Newsletter,2004,6(1):70-79 被引量:1

共引文献742

同被引文献33

  • 1张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:120
  • 2Wegener D, Mock W, Adranale D. Toolkit-based high-per- formance data mining of large data on MapReduce clusters [ C ]//IEEE International Conference on Data Mining Work- shops. 2009:296 - 301. 被引量:1
  • 3Tan P N, Steinbach M, Kumar V. Introduction to Data Mining [ M]. 北京:机械工业出版社,2010:89-120. 被引量:1
  • 4Pera M S, Ng Y K. A naive Bayes classifier for Web docu- ment summaries created by using word similarity and signifi- cant factors [ J ]. International Journal on Artificial Intelli- gence Tools,2010,19 (4) :465 - 486. 被引量:1
  • 5Malik H H, Fradkin D, Moerchen F. Single pass text classifi-cation by direct feature weighting [ J ]. Knowledge and Infor- mation Systems,2011,28 ( 1 ) :79 - 98. 被引量:1
  • 6Salton G, Clement T Y. On the construction of effective vo- cabularies for information retrieval [ C ]//Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval. 1973. 被引量:1
  • 7How B C, Narayanan K. An empirical study of feature selec- tion for text categorization based on term weightage [ C ]// Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence. 2004:599 - 602. 被引量:1
  • 8Chu C T, Kim S K, Lin Y A, et al. Map-reduce for machine learning on muhicore [ C ]//Proceedings of Neural Informa- tion Processing Systems Conference. 2006. 被引量:1
  • 9刘赫,刘大有,裴志利,高滢.一种基于特征重要度的文本分类特征加权方法[J].计算机研究与发展,2009,46(10):1693-1703. 被引量:24
  • 10邓维斌,王国胤,洪智勇.基于粗糙集的加权朴素贝叶斯邮件过滤方法[J].计算机科学,2011,38(2):218-221. 被引量:21

引证文献4

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部