摘要
微博作为当代生活中信息传播的重要平台,对其进行热点话题挖掘成为当今重要的研究方向之一。针对传统的热点话题发现方法在处理微博文本时存在文本表示缺乏语义信息、挖掘热点话题效果差等问题,本文提出一种基于频繁词集和BERT语义的文本双表示模型(Text dual representation model based on frequent word sets and BERT semantics,FWS-BERT),通过该模型计算加权文本相似度对微博文本进行谱聚类,进一步基于改进相似性度量的affinity propagation(AP)聚类算法进行微博话题挖掘,最后通过引入文献计量学中的H指数提出一种话题热度评估方法。实验表明,本文提出的方法在轮廓系数及Calinski-Harabasz(CH)指标值上均高于基于频繁词集的单一文本表示方法和K-means方法,并且能准确地对微博数据进行话题表示和热度评估。
Microblog is an important platform for information dissemination in contemporary life,mining hot topics on microblog has become one of the important research directions nowadays.In view of the problems of traditional hot topic discovery methods in dealing with microblog text,such as lack of semantic information in text representation,poor effect of mining hot topics and so on,this paper proposes a text dual representation model based on frequent word sets and BERT semantics(FWS-BERT),which calculates the weighted text similarity to perform spectral clustering on microblog text,further,microblog topic mining is carried out based on affinity propagation(AP)clustering algorithm with improved similarity measurement.Finally,a topic heat evaluation method is proposed by introducing the H index in bibliometrics.Experiments show that the proposed method is higher than the single text representation method based on frequent word set and K-means method in contour coefficient and Calinski-Harabasz(CH)index value,and can accurately represent the topic and Evaluate-the popularity of microblog data.
作者
刘梦颖
王勇
LIU Meng-ying;WANG Yong(Faculty of Information Technology,Beijing University of Technology,Beijing 100124,China)
出处
《计算机与现代化》
2021年第12期110-115,122,共7页
Computer and Modernization
关键词
微博
频繁词集
BERT
聚类
热点话题
microblog
frequent word sets
BERT
clustering
hot topics