期刊文献+

基于LDA特征选择的文本聚类 被引量:3

A Feature Selection Algorithm Based on LDA for Texts Clustering
下载PDF
导出
摘要 特征选择在文本聚类中起着至关重要的作用,将产生式模型Latent Dirichlet Allocation(LDA)引入基于K-means算法的文本聚类中,通过提取特征与隐含主题的关系进行特征选择。在第2届中文倾向性分析评测的语料上的实验结果表明,当选择2%的特征时,相对于单词贡献度(TC,Term Contribution)方法的纯度和F值分别提高了0.15和0.16,相对于LDA直接得到文本与主题的关系的实验结果的纯度和F值分别提高了0.14和0.13。 Feature selection plays an important role in texts clustering.In this paper,we used Latent Dirichlet Allocation(LDA),a production model,in K-means cluster algorithm,which select the features by extracting the relation between features and the implication topics.On the corpus of COAE2009,the experiments indicate that when we select two percent of the whole features,purity and F-measure are increased 0.15 and 0.16 compared with the TC feature select algorithm,0.14 and 0.13 compared with the clustering results of LDA,respectively.
出处 《电脑开发与应用》 2012年第1期1-5,共5页 Computer Development & Applications
基金 国家自然科学基金资助项目(60875040 60970014) 教育部高等学校博士点基金(200801080006) 山西省自然科学基金资助项目(2010011021-1) 山西省科技攻关项目(20110321027-02) 太原市科技局明星专项(09121001)
关键词 文本聚类 特征选择 LATENT DIRICHLET ALLOCATION text clustering feature selection latent dirichlet allocation
  • 相关文献

参考文献11

  • 1Kowalski G. Information Retrieval Systems : Theory and Implementation [ M ]. Kluwer Academic Publishers, 1997. 被引量:1
  • 2Zamir O, Etzioni O, Madani O, et al. Fast and Intuitive Clustering of Web Documents [C]// Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997: 287- 290. 被引量:1
  • 3Zeng H, He Q,Chen Z, et al. Learning to Cluster Web Search Results [ C ] / / Proceedings of the 2 7 thAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2004, 210-217. 被引量:1
  • 4Koller D, Sahami M. Hierarchically Classifying Documents Using Very Few Words[C]//ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning, 1997:170-178. 被引量:1
  • 5Charu C Aggarwal, Philip S Yu. Finding Generalized Projected Clusters in High Dimensional Spaces[R]. The SIGMOD' 00, Dallas, A2000. 被引量:1
  • 6Yang Y, Pedersen I O. A Comparative Study on Feature Selection in Text Categorization[C]//Proc of International Conference on Machine Learning. San Francisco : Morgan Kaufmann Publishers, 1997 : 412- 420. 被引量:1
  • 7Liu T, Liu S P. An Evaluation on Feature Selection for Text Clustering [C]//Proc of International Conference on Machine Learning. San Francisco, Morgan Kaufmann Publishers, 2003: 53-58. 被引量:1
  • 8Wilbur J W, Sirotkin K. The Automatic Identification of Stop Words [J]. Journal of Information Science, 1992, 18(1), 45-55. 被引量:1
  • 9王卫玲,刘培玉,刘克非.一种用于Web文本聚类的特征选择方法[J].计算机应用与软件,2007,24(1):154-156. 被引量:2
  • 10刘涛,吴功宜,陈正.一种高效的用于文本聚类的无监督特征选择算法[J].计算机研究与发展,2005,42(3):381-386. 被引量:37

二级参考文献7

  • 1C. C. Aggrawal, P. S. Yu. Finding generalized projected clustersin high dimensional spaces. The SIGMOD'00, Dallas, 2000. 被引量:1
  • 2M. Dash, H. Liu. Feature selection for clustering. The PAKDD-00, Kyoto, 2000. 被引量:1
  • 3F. Sebastiani. Machine learning in automated text categorization.ACM Computin Surveys, 2002, 34(1): 1--47. 被引量:1
  • 4Y. Yang, J. O. Pedersen. A comparative study on featureselection in text categorization. The ICML97, Nashville, 1997. 被引量:1
  • 5M. Rogati, Y. Yang. High performance feature selection for text categorization. The CIKM-02, Mclean, 2002. 被引量:1
  • 6L. Tao, L. Shengping, C. Zheng, et al.An evaluation on feature selection for text clustering. The ICML03, Washington,2003. 被引量:1
  • 7陆玉昌,鲁明羽,李凡,周立柱.向量空间法中单词权重函数的分析和构造[J].计算机研究与发展,2002,39(10):1205-1210. 被引量:126

共引文献36

同被引文献31

引证文献3

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部