期刊文献+

文本分类中特征选择的约束研究 被引量:26

A Study on Constraints for Feature Selection in Text Categorization
下载PDF
导出
摘要 特征选择在文本分类中起重要的作用.文档频率(DF)、信息增益(IG)和互信息(MI)等特征选择方法在文本分类中广泛应用.已有的实验结果表明,IG是最有效的特征选择算法之一,DF稍差而MI效果相对较差.在文本分类中,现有的特征选择函数性能的评估均是通过实验验证的方法,即完全是基于经验的方法,为此提出了一种定性地评估特征选择函数性能的方法,并且定义了一组与分类信息相关的基本的约束条件.分析和实验表明,IG完全满足该约束条件,DF不能完全满足,MI和该约束相冲突,即一个特征选择算法的性能在实验中的表现与它是否满足这些约束条件是紧密相关的. Text categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. Due to the increased availability of documents in digital form and the rapid growth of online information, TC has become a key technique for handling and organizing text data. One of the most important issues in TC is feature selection (FS). Many FS methods have been put forward and widely used in the TC field, such as information gain (IG), document frequency thresholding (DF) and mutual information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI) . A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies. In this paper, a theoretical performance evaluation function for FS methods is put forward in text categorization, Some basic desirable constraints that any reasonable FS function should satisfy are defind and then these constraints on some popular FS methods are checked, including IG, DF and MI. It is found that IG satisfies these constraints, and that there are strong statistical correlations between DF and the constraints, whilst MI does not satisfy the constraints. Experimental results on Reuters 21578 and OHSUMED corpora show that the empirical performance of a feature selection method is tightly related to how well it satisfies these constraints.
出处 《计算机研究与发展》 EI CSCD 北大核心 2008年第4期596-602,共7页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60473002,60603094) 北京自然科学基金项目(4051004)
关键词 特征选择 文本分类 信息检索 信息增益 互信息 feature selection text categorization information retrieval information gain mutual information
  • 相关文献

参考文献14

  • 1尚文倩,黄厚宽,刘玉玲,林永民,瞿有利,董红斌.文本分类中基于基尼指数的特征选择算法研究[J].计算机研究与发展,2006,43(10):1688-1694. 被引量:38
  • 2Y Yang, J O Pedersen. A comparative study on leature selection in text categorization [C]. In: D H Fisher, ed. Proc of the 14th Int'l Conf on Machine Franicisco: Morgan Kaufmann, Learning ( ICML-97 ) . San 1997. 412-420 被引量:1
  • 3单松巍,冯是聪,李晓明.几种典型特征选取方法在中文网页分类上的效果比较[J].计算机工程与应用,2003,39(22):146-148. 被引量:76
  • 4Ying Liu. A comparative study on feature selection methods for drug discovery [J]. Chemical Information Computer Science, 2004, 44:1823-1828 被引量:1
  • 5Stewart M Yang, Xiao-Bin Wu, Zhi-Hong Deng, et al. Modification of feature selection methods using relative term frequency [C]. ICMLC-2002, Beijing, 2002 被引量:1
  • 6J R Quinlan. Induction of decision trees [J]. Machine Learning, 1986, 1(1): 81-106 被引量:1
  • 7Fabrizio Sebastiani. Machine learning in automated text categorization [ J ]. ACM Computing Surveys, 2002, 34 ( 1 ) : 1 -47 被引量:1
  • 8Kenneth Ward Church, Patrick Hanks. Word norms, mutual information and lexicography [C] Annual Meeting on Association for Computational (ACL 27), Vancouver, Canada, 1989 association The 27th 被引量:1
  • 9S R S Varadhan. Probability Theory [M]. New York: New York University Publisher, 2000 被引量:1
  • 10Andrew Moore. Statistical Data Mining Tutorials [OL]. http: //www. autonlab. org/tutorials/, 2006-06-16 被引量:1

二级参考文献30

  • 1李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 2冯是聪 单松巍 张志刚 等.一个中文网页数据集及其分类体系[A]..海峡两岸技术交流会[C].南京,2002-10.121-129. 被引量:1
  • 3Yiming Yang,Jan O Pedersen.A comparative Study on Feature Selection in Text Categorization[C].In :Proceedings of the Fourteenth International Conference on Machine Leaming(ICML'97), 1997. 被引量:1
  • 4Yiming Yang,Xin Liu.A re-examination of text categorization methods[C].In:Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR'99,1999:42---49. 被引量:1
  • 5Yiming Yang.A study on thresholding strategies for text categorization[C].In:Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'01),2001. 被引量:1
  • 6T M Cover,P E Hart.Nearest neighbor pattern classification[J].IEEE Trans on Information Theory,1967,IT-13(1):21-27 被引量:1
  • 7Y Yang.An evaluation of statistical approaches to text categorization[J].Information Retrieval,1999,1(1/2):67 -88 被引量:1
  • 8Y Yang,X Lin.A re-examination of text categorization methods[C].The 22nd Annual Int'l ACM SIGIR Conf on Research and Development in the Information Retrieval,Berkeley,California,USA,1999 被引量:1
  • 9B Masand,G Lino,D Waltz.Classifying news stories using memory based reasoning[C].The 15th Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval,Copenhagen,Denmark,1992 被引量:1
  • 10D D Lewis.Naive (Bayes) at forty:The independence assumption in information retrieval[C].The 10th European Conf on Machine Learning,Heidelberg,Germany,1998 被引量:1

共引文献110

同被引文献260

引证文献26

二级引证文献185

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部