期刊文献+

自然语言处理中主题模型的发展 被引量:233

The Development of Topic Models in Natural Language Processing
下载PDF
导出
摘要 主题模型在自然语言处理领域受到了越来越多的关注.在该领域中,主题可以看成是词项的概率分布.主题模型通过词项在文档级的共现信息抽取出语义相关的主题集合,并能够将词项空间中的文档变换到主题空间,得到文档在低维空间中的表达.作者从主题模型的起源隐性语义索引出发,对概率隐性语义索引以及LDA等在主题模型发展中的重要阶段性工作进行了介绍和分析,着重描述这些工作之间的关联性.LDA作为一个概率生成模型,很容易被扩展成其它形式的概率模型.作者对由LDA派生出的各种模型作了粗略分类,并选择了各类的代表性模型简单介绍.主题模型中最重要的两组参数分别是各主题下的词项概率分布和各文档的主题概率分布,作者对期望最大化算法在主题模型参数估计中的使用进行了分析,这有助于更深刻理解主题模型发展中各项工作的联系. Topic models are receiving extensive attention in natural language processing. In this field, a topic is regarded as probabilistic distribution of terms. Topic models extract semantic topics using co-occurrence of terms in document level, and are used to transform documents locating in term space to the ones in topic space, obtaining the low dimensional representation of docu- ments. This paper starts from Latent Semantic Indexing (LSI), the origin of topic models, and describes pLSI and LDA, the fundamental works in the development of topic models, with focus on the relationship among these works. As a generative model, LDA can be easily extended to other models. This paper makes a simple categorization on topic models derived from LDA, and representative models of each category are introduced. Furthermore, EM algorithms in parameter estimation of topic models are analyzed, which help to understand the relationship of works during the development of topic models.
作者 徐戈 王厚峰
出处 《计算机学报》 EI CSCD 北大核心 2011年第8期1423-1436,共14页 Chinese Journal of Computers
基金 国家自然科学基金(91024009 60973053 90920011)资助~~
关键词 自然语言处理 主题模型 隐性语义索引 LDA 期望最大化算法 GIBBS采样 natural language processing topic model latent semantic indexing latent dirichletallocation expectation maximization algorithm Gibbs sampling
  • 相关文献

参考文献62

  • 1Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990. 被引量:1
  • 2Hofmann T. Probabilistic latent semantic indexing//Proceedings of the 22nd Annual International SIGIR Conference. New York: ACM Press, 1999:50-57. 被引量:1
  • 3Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022. 被引量:1
  • 4Griffiths T L, Steyvers M. Finding scientific topics//Proceedings of the National Academy of Sciences, 2004, 101: 5228 5235. 被引量:1
  • 5Steyvers M, Gritfiths T. Probabilistic topic models. Latent Semantic Analysis= A Road to Meaning. Laurence Erlbaum, 2006. 被引量:1
  • 6曹娟,张勇东,李锦涛,唐胜.一种基于密度的自适应最优LDA模型选择方法[J].计算机学报,2008,31(10):1780-1787. 被引量:83
  • 7Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical dirichlet processes. Technical Report 653. UC Berkeley Statistics, 2004. 被引量:1
  • 8石晶,胡明,石鑫,戴国忠.基于LDA模型的文本分割[J].计算机学报,2008,31(10):1865-1873. 被引量:54
  • 9Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1977, B39(1): 1-38. 被引量:1
  • 10Bishop C M. Pattern Recognition and Machine Learning. New York, USA: Springer, 2006. 被引量:1

二级参考文献55

共引文献221

同被引文献1794

引证文献233

二级引证文献1206

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部