期刊文献+

基于类别分布差异和特征熵的维吾尔语文本特征选择 被引量:5

Feature selection based on class distribution difference and term entropy for Uyghur text
下载PDF
导出
摘要 文本特征选择是在文本自动分类中最重要的一个环节。为了更好地解决维吾尔文文本分类中特征空间的高维性和文档表示向量的稀疏性问题,提出一种基于特征的类别分布差异和信息熵的维吾尔文文本特征选择方法。该方法不仅要考虑特征在类别间的分布情况,而且也要考虑特征在类别内的分布情况。采用本方法对维吾尔文文本语料进行了分类实验,并与一些传统的特征选择方法进行了比较。从结果来看,本方法在所选特征数更少的情况下,达到了比其他方法更高的分类MacroF1值85.3%,比传统的IG和CHI等方法在MacroF1值上分别高出了4.3%和6.1%。 Text feature selection is the most important phase in automatic text categorization. In order to solve the high dimen- sionality and sparsness of text vector in Uyghur text categorization, this paper proposed the new Uyghur text feature selection method based on class distribution difference and term entropy. The propesed method not only considered the inter-class distri- bution of the term, but also considered the inner-class distribution of the term. It conducted the categorization experiments on the Uyghur text corpus using proposed method and compared with the traditional feature selection methods. The experimental results show that the categorization MacroF, value is reached 85.3% and achieves the improvement of 4.3% and 6. 1% re- spectivly comparing to IG and CHI.
出处 《计算机应用研究》 CSCD 北大核心 2013年第10期2958-2961,共4页 Application Research of Computers
基金 国家自然科学基金资助项目(61063026 61063043 61163028 61262060)
关键词 特征选择 文本分类 特征熵 支持向量机 维吾尔语 feature selection text categorization term entropy SVM uyghur language
  • 相关文献

参考文献15

二级参考文献89

共引文献1026

同被引文献43

  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:384
  • 2刘华.基于关键短语的文本分类研究[J].中文信息学报,2007,21(4):34-41. 被引量:14
  • 3Pei Zhili,Shi Xiaohu,Maurizio Marchese,Liang Yanchun.An enhanced text categorization method based on improved text frequency approach and mutual information algorithm[J].Progress in Natural Science:Materials International,2007,17(12):1494-1500. 被引量:2
  • 4Lewis D D.An evaluation of phrasal and clustered representations on a text categorization task[C]//Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR 92).New York,NY,USA:ACM Press,1992:37-50. 被引量:1
  • 5Tan C M,Wang Y F,Lee C D.The use of Bigrams to enhance text categorization[J].Information Processing and Management,2002,38(4):529-546. 被引量:1
  • 6Bekkerman R,Allan J.Using Bigrams in text categorization[R].2005. 被引量:1
  • 7Caropreso M F,Matwin S,Sebastiani F.A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization[C]//Chin A G.Text Databases and Document Management:Theory and Practice,2001:78-102. 被引量:1
  • 8Forman G.An extensive empirical study of feature selection metrics for text classification[J].Journal of Machine Learning Research,2003,3(1). 被引量:1
  • 9Church,Ward K,Hanks P.Word association norms,mutual information,and lexicography[J].Computational Linguistics,1990,16(1):22-29. 被引量:1
  • 10Joachims T.Text categorization with support vector machines:learning with many relevant features[C]//European Conference on Machine Learning.[S.l.]:Springer Verlag,1998:137-142. 被引量:1

引证文献5

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部