摘要
随着微博的日趋流行,Twitter等微博网站已成为海量信息的发布体,对微博的研究也需要从单一的用户关系分析向微博本身内容的挖掘进行转变.在数据挖掘领域,尽管传统文本的主题挖掘已经得到了广泛的研究,但对于微博这种特殊的文本,因其本身带有一些结构化的社会网络方面的信息,传统的文本挖掘算法不能很好地对它进行建模.提出了一个基于LDA的微博生成模型MB-LDA,综合考虑了微博的联系人关联关系和文本关联关系,来辅助进行微博的主题挖掘.采用吉布斯抽样法对模型进行推导,不仅能挖掘出微博的主题,还能挖掘出联系人关注的主题.此外,模型还能推广到许多带有社交网络性质的文本中.在真实数据集上的实验表明,MB-LDA模型能有效地对微博进行主题挖掘.
As microblog grows more popular, services like Twitter have become information providers on a web scale. Early work on mieroblog focused more on its user relationship and community structure, without considering the value of content. So the research on mieroblog requires a change from solely user's relationship analysis to its content mining. Although traditional text mining methods have been studied well, no algorithm is designed specially for microblog data, which contain structured information on social network besides plain text. In this paper, we propose a novel probabilistie generative model based on LDA, called MB-LDA, which is suitable to model the microblog data and takes both contact relation and document relation into consideration to help topic mining in microblog. We present a Gibbs sampling implementation for inference of our model, and find not only the topics of mieroblog, but also the topics focused by contactors according to the final results. Besides, our model can be extended to many texts associated with social networking such as E-mails and forum posts. Experimental results on actual dataset show that MB-LDA model can offer an effective solution to topic mining for microblog.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2011年第10期1795-1802,共8页
Journal of Computer Research and Development
基金
"核高基"国家科技重大专项基金项目(2010ZX01042-002-003)
关键词
微博
主题挖掘
LDA
概率生成模型
社交网络
microblog
topic mining
LDA
probabilistic generative model
social network