摘要
针对在海量微博数据中提取热点话题效率较低的问题,在对用户角色分类的基础上,提出了一种新的热点话题检测方法。首先,根据用户关注度进行用户角色定位,过滤掉部分用户的噪声数据;其次,采用结合语义相似度的TF-IDF函数计算特征权重,降低语义表达形式带来的误差;然后,用改进的Single-Pass聚类算法进行话题聚类,提取出微博话题;最后,根据微博转发数、评论数等对话题热度进行评估排序,从而发现热点话题。实验表明,所提出的方法使漏检率和误检率分别平均降低12.09%和2.37%,有效地提高了话题检测的正确率,验证了该方法的可行性。
To solve the low extraction efficiency for extracting hot topics in huge amounts of micro-blog data, a new topics detection method based on user role orientation was proposed. Firstly, some noise data of parts of users were filtered out by user role orientation. Secondly, the feature weight was calculated by the Term Frequency-Inverse Document Frequency (TF- IDF) function combined with semantic similarity to reduce the error caused by semantic expression. Then, the improved Single-Pass clustering algorithm was used to extract the topics of micro-biog. Lastly, the heat evaluation of miero-blog topics was made according to the number of reposts and comments, thus the hot topics were found. The results show that the average missing rate and false detection rate respectively decrease by 12.09% and 2.37%, and further indicate the topic detection accuracy rate is effectively improved and the method is feasible.
出处
《计算机应用》
CSCD
北大核心
2013年第11期3076-3079,共4页
journal of Computer Applications