摘要
目前的在线潜在狄利克雷分布模型(LDA)算法大多是基于固定的词汇表,在实际应用中经常会出现词汇表和处理的语料不匹配的情况,影响了模型的实用性。针对这个现象,在置信传播算法(BP)的框架下,使主题单词分布服从狄利克雷过程,重新推导公式,使得词汇表在模型运行之前为空,并且在处理时不断向词汇表中增加发现的新词。实验证明,这种新的基于动态词汇表的算法不仅使得词汇表与语料的贴合度更高,而且使其在混淆度以及互信息指数这两个指标上能够比基于固定词汇表的LDA模型表现得更加优越。
Most of the online LDA algorithms are based on the fixed vocabulary table currently. The vocabulary table may not often match the processed corpus in practice which has a bad effect on the precision of LDA. To solve this problem,we let the topic words distribution subject to the dirichlet process (DP) and re-deduce the model under the framework of BP algorithm. So that we can make the vocabulary table empty before the algorithm running and it can continually add new words to table. Results from the experiments show that, our new algorithm can make the vocabula- ry table match the corpus better and the dynamic vocabulary table makes the new algorithm achieve better performance on perplexity and PMI compared with other state-of-the-art fixed vocabulary online algorithms.
出处
《计算机科学》
CSCD
北大核心
2016年第12期120-124,134,共6页
Computer Science
基金
国家自然科学基金(61373092
61572339
61272449)
江苏省科技支撑计划重点项目(BE2014005)资助
关键词
潜在狄利克雷分配
动态词汇表
狄利克雷过程
流处理
Latent dirichlet allocation, Dynamic vocabulary,Dirichlet process, Streaming process