摘要
[目的/意义]针对微博短文本数据存在的高维稀疏和上下文语义缺失等问题,提出一种融合主题模型和词向量的文本特征表达方式,以期提高微博主题聚类的效果。[方法/过程]以新浪微博为数据源,结合LDA文档—主题分布特征和加权Word2Vec词向量特征构建微博短文本的融合特征,基于K-means算法进行主题聚类,并与单一特征聚类、标准LDA主题模型的实验结果进行对比,根据F1值评估主题聚类方法的优劣。[结果/结论]相较于其他方法,融合特征主题聚类模型表现最佳,其F1值达到83.7%。实验表明,融合特征能够更加全面、准确地描述文本的语义信息,能更有效地表征微博文本。
[Purpose/Significance]Aiming at the problems of high-dimensional sparseness and lack of contextual semantics in microblog short text data,this paper proposes a text feature expression method that combines topic model and word embedding in order to improve the effect of microblog topic clustering.[Method/Process]This paper took Sina Weibo as the data source,combined LDA s document-topic distribution features and the weighted Word2Vec s word embedding features to construct the fusion features of microblog short text,and used K-means algorithm for topic clustering.The results were compared with single feature clustering and standard LDA topic model,and the advantages and disadvantages of the topic clustering methods were evaluated according to F1 value.[Result/Conclusion]Compared with other methods,the fusion feature topic clustering model performed best,and its F1 value reaches 83.7%.Experiment showed that the fusion features can describe the semantic information of the text more comprehensively and accurately,and can more effectively characterize the microblog text.
作者
颜端武
梅喜瑞
杨雄飞
朱鹏
Yan Duanwu;Mei Xirui;Yang Xiongfei;Zhu Peng(School of Economics and Management,Nanjing University of Science and Technology,Nanjing 210094,China)
出处
《现代情报》
CSSCI
2021年第10期67-74,共8页
Journal of Modern Information
基金
国家自然科学基金面上项目“个体调节定向与信息瀑布演进交互作用机制研究”(项目编号:71874082)
江苏省2011社会公共安全协同创新中心
江苏省研究生科研与实践创新计划项目“基于专利技术路线图和社交媒体挖掘的新兴技术监测研究”(项目编号:KYCX20_0403)。