摘要
随着在线社交网络平台(微信、微博等)和APP(网易、学习强国)的快速发展和应用,产生了海量短文本。针对这些海量短文本,传统的文本聚类方法存在聚类性能较差的问题。本文融合TF-IDF方法和词向量,提出了一种短文本聚类方法。首先,使用TF-IDF方法,提取短文本中TFIDF值靠前的TOP-N关键词作为短文本的特征词集合;其次,在Word2Vec工具的支持下,使用Skip-gram模型在海量语料中训练得到特征词的向量表示;最后,使用WMD距离计算短文本间的相似度。将所提方法应用于4个数据集,实验结果表示,该方法比传统的文本聚类算法具有更好的效果。
With the rapid development and application of online social network platforms(wechat,Weibo,etc.)and APP(Netease,learning power),a large number of short texts have been generated.For the massive short text,the traditional text clustering method has the problem of poor clustering performance.In this paper,we propose a short text clustering algorithm based on TF⁃IDF and word embedding.Firstly,TF⁃IDF method is used to extract TOP⁃N keywords in short texts,which are the TOP⁃N of TF⁃IDF value,as the feature words for these texts.Secondly,with the support of Word2Vec tool,Skip gram model is used to gain the vector representations of feature words by training in large scale corpus.Finally,WMD distance is used to calculate the similarity between two short texts.The experiments show that our algorithm has better effect than the traditional text clustering algorithms.
作者
赵晓平
黄祖源
黄世锋
王永和
ZHAO Xiaoping;HUANG Zuyuan;HUANG Shifeng;WANG Yonghe(Information Center,Yunnan Power Grid Co.,Ltd.,Kunming 650011,China;Yunnan Yundian Tongfang Technology Co.,Ltd.,Kunming 650220,China)
出处
《电子设计工程》
2020年第21期5-9,共5页
Electronic Design Engineering
基金
国家自然科学基金青年项目(61702442)。