期刊文献+

基于Labeled-LDA模型的文本分类新算法 被引量:103

Text Classification Based on Labeled-LDA Model
下载PDF
导出
摘要 LDA(Latent Dirichlet Allocation)模型是近年来提出的一种能够提取文本隐含主题的非监督学习模型.通过在传统LDA模型中融入文本类别信息,文中提出了一种附加类别标签的LDA模型(Labeled-LDA).基于该模型可以在各类别上协同计算隐含主题的分配量,从而克服了传统LDA模型用于分类时强制分配隐含主题的缺陷.与传统LDA模型的实验对比表明:基于Labeled-LDA模型的文本分类新算法可以有效改进文本分类的性能,在复旦大学中文语料库上micro-F1提高约5.7%,在英文语料库20newsgroup的comp子集上micro-F1提高约3%. LDA(Latent Dirichlet Allocation) is a recently proposed model which extracts latent topics from text data. In this paper, Labeled-LDA is proposed to enhance the traditional LDA to integrate the class information. Based on Labeled-LDA, a new algorithm is introduced to figure out the latent topics' quantities of each class synergistical]y. In such a way, Labeled-LDA model avoids compulsive allocation behaviors of the traditional LDA when it is used as a component in classification frame. Experiments on fudan corpus and the comp subset of 20newsgrop corpus show the new method can improve text classification effectiveness: On micro_F1 measure, it approaches an improvement of 5.7% on fudan corpus and 3% on the comp subset of 20newsgrop corpus.
出处 《计算机学报》 EI CSCD 北大核心 2008年第4期620-627,共8页 Chinese Journal of Computers
基金 国家自然科学基金项目(60773027) 国家自然科学基金重点项目(60736044) 国家“八六三”高技术研究发展计划重点项目基金(2006AA010108)资助~~
关键词 文本分类 图模型 隐含狄利克雷分配 变分推断 text classification graphical model Latent Dirichlet Allocation (LDA) variationalinference
  • 相关文献

参考文献19

  • 1Fabrizio Sebastiani. Text categorization//Alessandro Zanasi. Text Mining and its Applications. Southampton, UK: WIT Press, 2005:109-129 被引量:1
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:384
  • 3Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47 被引量:1
  • 4Moschitti A, Basili R. Complex linguistic features for text classification: A comprehensive study//McDonald S, Tait J. Proceedings of the ECIR-04. Sunderland: Springer-Verlag. Sunderland, U. K., 2004:181-196 被引量:1
  • 5Kehagias A, Petridis V, Kaburlasos V G, Fragkou P. A comparison of word- and sense- based text categorization using several classification algorithms. Journal of Intelligent Information Systems, 2003, 21(3): 227-247 被引量:1
  • 6Deerwester S, Dumais S T, Furnas et al. Indexing by latent semantic indexing. Journal of the American Society for Information Science, 1990, 41(6): 391-407 被引量:1
  • 7Thomas Hofmann. Probabilistic latent semantic indexing// Proceedings of the SIGIR. Berkeley, CA, USA, 1999:50-57 被引量:1
  • 8Schutze H, Hull D A et al, A comparison of classifiers and document representations for the routing problem//Proceedings of the SIGIR-95. Seattle, Washington, USA, 1995: 229-237 被引量:1
  • 9Chen L, Tokuda N, Nagai A. A new differential LSI spacebased probabilistic document classifier. Information Processing Letters, 2003, 88(5): 203-212 被引量:1
  • 10Blei D, Ng A, Jordan M. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3:993-1022 被引量:1

二级参考文献12

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 3[1]Sebastiani F. Machine learning in automated text categorization [J]. ACM Computing Survey, 2002,34 (1):1 -47. 被引量:1
  • 4[2]Deerwester S,Dumais S T,Furnas G W,et al. Indexing by latent semantic analysis [J]. Journal of the American Society of Information Science, 1990,41 (6) :391 - 407. 被引量:1
  • 5[3]Dumais S T. Using LSI for information filtering [A].Harman D. The Third Text Retrieval Conference ( TREC - 3) [C]. USA: National Institute of Standards and Technology Special Publication, 1995. 被引量:1
  • 6[4]Baker L D,McCallum A K. Distributional clustering of words for text classification [A]. Proc. ACM-SIGIR-98[C]. Australia: ACM Press, 1998. 96 - 103. 被引量:1
  • 7[5]Park H,Howland P,Jeon M. Cluster structure preserving dimension reduction based on the generalized singular value decompositon [J]. SIAM Journal on Matrix Analysis and Applications ,2003,25 (1): 165 - 179. 被引量:1
  • 8[6]Wold H. Encyclopedia of Statistical Science [M]. New York: Wiley, 1985. 被引量:1
  • 9[7]Tenenhaus M. La Régreesion PLS. Théorie et Pratique [M]. Paris: éditions Technip, 1998. 被引量:1
  • 10Hull D A.Improving text retrieval for the routing problem using latent semantic indexing[].Proceedings of the th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.1994 被引量:1

共引文献428

同被引文献1051

引证文献103

二级引证文献856

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部