期刊文献+

一种基于动态词汇表的在线LDA算法

Online LDA on Dynamic Vocabulary
下载PDF
导出
摘要 目前的在线潜在狄利克雷分布模型(LDA)算法大多是基于固定的词汇表,在实际应用中经常会出现词汇表和处理的语料不匹配的情况,影响了模型的实用性。针对这个现象,在置信传播算法(BP)的框架下,使主题单词分布服从狄利克雷过程,重新推导公式,使得词汇表在模型运行之前为空,并且在处理时不断向词汇表中增加发现的新词。实验证明,这种新的基于动态词汇表的算法不仅使得词汇表与语料的贴合度更高,而且使其在混淆度以及互信息指数这两个指标上能够比基于固定词汇表的LDA模型表现得更加优越。 Most of the online LDA algorithms are based on the fixed vocabulary table currently. The vocabulary table may not often match the processed corpus in practice which has a bad effect on the precision of LDA. To solve this problem,we let the topic words distribution subject to the dirichlet process (DP) and re-deduce the model under the framework of BP algorithm. So that we can make the vocabulary table empty before the algorithm running and it can continually add new words to table. Results from the experiments show that, our new algorithm can make the vocabula- ry table match the corpus better and the dynamic vocabulary table makes the new algorithm achieve better performance on perplexity and PMI compared with other state-of-the-art fixed vocabulary online algorithms.
出处 《计算机科学》 CSCD 北大核心 2016年第12期120-124,134,共6页 Computer Science
基金 国家自然科学基金(61373092 61572339 61272449) 江苏省科技支撑计划重点项目(BE2014005)资助
关键词 潜在狄利克雷分配 动态词汇表 狄利克雷过程 流处理 Latent dirichlet allocation, Dynamic vocabulary,Dirichlet process, Streaming process
  • 相关文献

参考文献2

二级参考文献81

  • 1Li X, Bilmes J. A Bayesian divergence prior for classifier adaptation. J Mach Learn Res, 2007, 2:275-282. 被引量:1
  • 2Ferguson T. A Bayesian Anal of some nonparametric problems. Ann Stat, 1973, 1:209-230. 被引量:1
  • 3Sethuraman J. A constructive definition of Dirichlet priors. Stat Sinica, 1994, 4:639-650. 被引量:1
  • 4Ferguson T. Prior distributions on spaces of probability measures. The Ann Stat, 1974, 2:615-629. 被引量:1
  • 5Pitman J. Some developments of the Blackwell-MacQueen urn scheme. Statistics, Probability and Game Theory, 1996, 30:245-267. 被引量:1
  • 6Muliere P, Tardella L. Approximating distributions of random functionals of Ferguson-Dirichlet priors. Can J Sta, 1998, 26:283-297. 被引量:1
  • 7Liu J. Nonparametric hierarchical Bayes via sequential imputations. The Ann Stat, 1996, 24:911-930. 被引量:1
  • 8Ishwaran H, Zarepour M. Markov Chain Monte Carlo in approximate Dirichlet and Beta two-parameter process hierarchical models. Biometrika, 2000, 87:371-390. 被引量:1
  • 9Ishwaran H, James L. Some further developments for stick-breaking priors: Finite and infinite clustering and classification. Sankhya Set A, 2003, 65:577-592. 被引量:1
  • 10Ishwaran H, James L. Generalized weighted Chinese restaurant process for species sampling mixture models. Stat Sin, 2003, 13: 1211-1235. 被引量:1

共引文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部