期刊文献+

基于MB-HDP模型的微博主题挖掘 被引量:31

Topic Mining from Microblogs Based on MB-HDP Model
下载PDF
导出
摘要 主题模型是挖掘微博潜在主题的重要工具.然而,现有的主题模型多由Latent Dirichlet Allocation(LDA)派生,它需要用户预先指定主题数目.为了自动挖掘微博主题,作者提出了一个基于分层Dirichlet过程(Hierarchical Dirichlet Process,HDP)的非参数贝叶斯模型MB-HDP.首先,针对微博应用场景,假设消息是不可交换的;接着,利用微博的时间信息、用户兴趣以及话题标签,聚合主题相关的消息以解决微博短文本的数据稀疏问题;然后,扩展Chinese Restaurant Franchise(CRF)对微博数据进行主题建模;最后,设计一个相应的Markov Chain Monte Carlo(MCMC)采样方法,推导MB-HDP模型的分布参数.实验表明,在生成主题质量、内容困惑度和模型复杂度等指标上,MB-HDP模型明显优于LDA和HDP两种模型. Topic models have become important tools to mine latent topics from microblogs.However,most existing models are derived from Latent Dirichlet Allocation(LDA)and require a pre-determined number of topics.In order to mine topics from microblogs automatically,we propose a hierarchical Bayesian nonparametric model named MicroBlog-Hierarchical Dirichlet Process(MB-HDP).Firstly,our model assumes non-exchangeability of data which is suitable for the microblog application.Secondly,to tackle the sparsity problem caused by the short tweets,the temporal information,user's interests,and semantic #hashtags are integrated to aggregate topic-related tweets into lengthy pseudo-documents.Thirdly,the Chinese Restaurant Franchise(CRF)extension is adopted in modeling topics.Finally,we present a Markov Chain Monte Carlo(MCMC)sampling for posterior inference in the MB-HDP.Experimental results show that the MB-HDP clearly outperformed both LDA and HDP from three different perspectives:the quality of generated latent topics,the perplexity of held-out content and the model complexity.
出处 《计算机学报》 EI CSCD 北大核心 2015年第7期1408-1419,共12页 Chinese Journal of Computers
基金 国家自然科学基金(61033010 61272065 61472453 U1401256) 广东省自然科学基金(S2011020001182 S2012010009311) 广东省科技计划项目(2011B040200007 2011B031700004 2012A010701013)资助~~
关键词 主题挖掘 微博 分层Dirichlet过程 MB-HDP topic mining microblog hierarchical Dirichlet process MB-HDP
  • 相关文献

参考文献26

  • 1Goorha S, Ungar L. Discovery of significant emerging trends//Proceedings of the 16th International Conference on Knowledge Discovery and Data Mining. Washington, USA,2010:57-64. 被引量:1
  • 2Mathioudakis M, Koudas N. TwitterMonitor: Trend detection over the twitter stream//Proceedings of the 29th International Conference on Management of Data. Indianapolis, USA, 2010:1155-1158. 被引量:1
  • 3Lin C X, Zhao B, Mei Q Z, Han J W. PET: A statistical model for popular events tracking in social communities// Proceedings of the 16 th International Conference on Knowledge Discovery and Data Mining. Washington, USA, 2010: 929- 938. 被引量:1
  • 4Budak C, Agrawal D, E1 Abbadi A. Structural trend analysis for online social networks. Proceedings of the VLDB Endow- ment, 2011, 4(10): 646-656. 被引量:1
  • 5Meng X, Wei F, Liu X, et al. Entity-centric topic-oriented opinion summarization in twitter//Proceedings of the 18th International Conference on Knowledge Discovery and Data Mining. Beijing, China, 2012:379-387. 被引量:1
  • 6Angel A, Koudas N, Sarkas N, Srivastava D. Dense sub- graph maintenance under streaming edge weight updates for real-time story identification. Proceedings of the VLDB Endowment, 2012, 5(6): 574-585. 被引量:1
  • 7Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3 (3) 993-1022. 被引量:1
  • 8Diao Q, Jiang J, Zhu F, Lim E. Finding bursty topics from microblogs//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea, 201: 536-544. 被引量:1
  • 9Xu Z, Zhang Y, Wu Y, Yang Q. Modeling user posting behavior on social media//Proceedings of the 35th International Conference on Research and Development in Information Retrieval. Portland, USA, 2012 545-554. 被引量:1
  • 10Zhang C Y, Sun J L. Large scale microblog mining using dis- tributed MB-LDA//Proceedings of the 21st International Conference Companion on World Wide Web. Lyon, France, 2012:1035-1042. 被引量:1

二级参考文献160

  • 1Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990. 被引量:1
  • 2Hofmann T. Probabilistic latent semantic indexing//Proceedings of the 22nd Annual International SIGIR Conference. New York: ACM Press, 1999:50-57. 被引量:1
  • 3Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022. 被引量:1
  • 4Griffiths T L, Steyvers M. Finding scientific topics//Proceedings of the National Academy of Sciences, 2004, 101: 5228 5235. 被引量:1
  • 5Steyvers M, Gritfiths T. Probabilistic topic models. Latent Semantic Analysis= A Road to Meaning. Laurence Erlbaum, 2006. 被引量:1
  • 6Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical dirichlet processes. Technical Report 653. UC Berkeley Statistics, 2004. 被引量:1
  • 7Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1977, B39(1): 1-38. 被引量:1
  • 8Bishop C M. Pattern Recognition and Machine Learning. New York, USA: Springer, 2006. 被引量:1
  • 9Roweis S. EM algorithms for PCA and SPCA//Advances in Neural Information Processing Systems. Cambridge, MA, USA: The MIT Press, 1998, 10. 被引量:1
  • 10Hofmann T. Probabilistic latent semantic analysis//Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden, 1999:289- 296. 被引量:1

共引文献270

同被引文献297

引证文献31

二级引证文献195

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部