摘要
在微博热点话题发现中,微博文本短、词量少、时效性高,传统的话题检测方法不再适用。针对这些新的特点,提出一种基于微博文本和元数据的话题发现方法。首先利用微博发布时间、用户信息、微博转发评论等元数据构造描述微博词汇能量的复合权值,进而提取出话题的主题词汇,然后基于上下文关系构造主题词汇簇,最后对微博文本进行二次聚类,从而得到微博中的隐含话题以及相关微博文本。在真实微博数据上的实验表明,该方法能有效发现热门话题,提高话题检测的准确率和查全率。
Traditional topic detection method is no longer applicable on hot microblogging topic discovery,because microblogs are too short in text with fewer words and high timeliness. For these new characteristics,in this paper we present a topic discovery method which is based on microblogging text and metadata. First,we make use of the metadata,such as posting time of microblogs,users information,and forwarding and comments of microblogs,to construct the composite weight value of microblogging vocabulary energy,and then extract themes vocabulary of topics. After that we construct the themes vocabulary clusters based on the context. At last,we conduct secondary clustering on microblogging texts so that to get the implicit topics in microblogs and the related microblogging texts. Experiments on real microblogging data show that this method can effectively find the hot topics and improve the accuracy rate and recall rate of topics detection.
出处
《计算机应用与软件》
CSCD
2016年第3期67-70,86,共5页
Computer Applications and Software
基金
国家自然科学基金项目(61103046)
关键词
微博
元数据
聚类
话题检测
Microblog
Metadata
Cluster
Topic detection