期刊文献+

基于类别加权和方差统计的特征选择方法 被引量:11

Feature Selection Method Based on Category-weighted and Variance Statistics
下载PDF
导出
摘要 为提高不均衡文本分类的准确率和稳定性,提出了一种基于类别加权和方差统计的联合特征选择方法.首先,基于类别文档数大小对特征选择的影响,给出了一种类别加权策略以强化小类别的特征;其次,在探究特征类别区分能力的基础上,设计了类别方差统计策略来凸显含有丰富类别信息的特征;最后,将2种策略相融合,实现了一种联合特征选择的新算法.在Reuters-21578和复旦大学语料这2个不均衡语料上的实验都表明:该算法有效,特别是在小类别的分类效果上远远好于IG、CHI和DFICF等流行的通用算法. To improve the accuracy and stability of text classification on unbalanced datasets, a feature selection method based on category-weighted strategy and variance statistics strategy was proposed. First, larger weights to rare categories was assigned, these features that characterize rare categories would be strengthened,and the performance on rare categories could be improved. Then, a method of variance statistics was presented to develop feature selection. Finally,based on the two strategies, a new feature selection algorithm combined with Information Gain (IG) and χ2-statistic (CHI) was developed. Experiments on Reuters-21578 corpus and Fudan corpus (unbalanced datasets) show that new algorithm has better performances on MicroF1 and MacroF1 than those of IG, CHI and DFICF.
出处 《北京工业大学学报》 CAS CSCD 北大核心 2014年第10期1593-1602,共10页 Journal of Beijing University of Technology
基金 国家自然科学基金资助项目(61375059)
关键词 文本分类 不均衡数据集 特征选择方法 类别加权 方差统计 text categorization unbalanced datasets feature selection method category-weighted variance statistics
  • 相关文献

参考文献24

二级参考文献66

共引文献270

同被引文献105

  • 1谭琪辉,周兰江,刘畅.融合文本特征的汉老双语句子相似度计算方法[J].中文信息学报,2021,35(10):64-72. 被引量:1
  • 2朱津蓉.天然牙颜色的测量与匹配的研究进展[J].国外医学(口腔医学分册),1996,23(5):280-284. 被引量:13
  • 3白立文.烤瓷牙的比色分析[J].山西职工医学院学报,2006,16(2):35-36. 被引量:1
  • 4王永,廖健.电脑比色仪在氟斑牙患者瓷修复中的临床应用[J].贵阳医学院学报,2007,32(4):405-407. 被引量:4
  • 5蒋健.文本分类中特征提取和特征加权方法研究[D].重庆:重庆大学,2010. 被引量:4
  • 6HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge [ C ] // The 18th ACM Conference on Information and Knowledge Management. New York: ACM, 2009: 919-928. 被引量:1
  • 7HU X, TANG L, LIU H. Enhancing accessibility of microblogging messages using semantic knowledge [ C ]// International Conference on Information and Knowledge Management. Glasgow : ACM, 2011 : 2465-2468. 被引量:1
  • 8LIU Z T, YU W C, CHEN W, et al. Short text feature selection and classification for microblog mining [ C ] // International Conference on Computational Intelligence and Software Engineering. Wuhan: ACM, 2010: 1-4. 被引量:1
  • 9SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in Twitter to improve information filtering [ C]//The 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva: ACM, 2010: 841-842. 被引量:1
  • 10CHURCHILL A, LIODAKIS E, YES. Twitter relevance filtering via joint bayes classifiers from user clustering [ R]. Stanford: University of Stanford, 2010. 被引量:1

引证文献11

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部