期刊文献+

一种结合TF-IDF方法和词向量的短文本聚类算法 被引量:12

Short text clustering based on TF⁃IDF and word embedding
下载PDF
导出
摘要 随着在线社交网络平台(微信、微博等)和APP(网易、学习强国)的快速发展和应用,产生了海量短文本。针对这些海量短文本,传统的文本聚类方法存在聚类性能较差的问题。本文融合TF-IDF方法和词向量,提出了一种短文本聚类方法。首先,使用TF-IDF方法,提取短文本中TFIDF值靠前的TOP-N关键词作为短文本的特征词集合;其次,在Word2Vec工具的支持下,使用Skip-gram模型在海量语料中训练得到特征词的向量表示;最后,使用WMD距离计算短文本间的相似度。将所提方法应用于4个数据集,实验结果表示,该方法比传统的文本聚类算法具有更好的效果。 With the rapid development and application of online social network platforms(wechat,Weibo,etc.)and APP(Netease,learning power),a large number of short texts have been generated.For the massive short text,the traditional text clustering method has the problem of poor clustering performance.In this paper,we propose a short text clustering algorithm based on TF⁃IDF and word embedding.Firstly,TF⁃IDF method is used to extract TOP⁃N keywords in short texts,which are the TOP⁃N of TF⁃IDF value,as the feature words for these texts.Secondly,with the support of Word2Vec tool,Skip gram model is used to gain the vector representations of feature words by training in large scale corpus.Finally,WMD distance is used to calculate the similarity between two short texts.The experiments show that our algorithm has better effect than the traditional text clustering algorithms.
作者 赵晓平 黄祖源 黄世锋 王永和 ZHAO Xiaoping;HUANG Zuyuan;HUANG Shifeng;WANG Yonghe(Information Center,Yunnan Power Grid Co.,Ltd.,Kunming 650011,China;Yunnan Yundian Tongfang Technology Co.,Ltd.,Kunming 650220,China)
出处 《电子设计工程》 2020年第21期5-9,共5页 Electronic Design Engineering
基金 国家自然科学基金青年项目(61702442)。
关键词 文本聚类 短文本 TF-IDF 词向量 自然语言处理 text clustering short text TF⁃IDF word embedding natural language process
  • 相关文献

参考文献11

二级参考文献67

  • 1徐凤亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用,2005,41(1):181-184. 被引量:56
  • 2高茂庭,王正欧.几种文本特征降维方法的比较分析[J].计算机工程与应用,2006,42(30):157-159. 被引量:16
  • 3夏云庆,黄锦辉,张普.中文网络聊天语言的奇异性与动态性研究[J].中文信息学报,2007,21(3):83-91. 被引量:8
  • 4彭京,杨冬青,唐世渭,付艳,蒋汉奎.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8):1354-1363. 被引量:44
  • 5Banerjee S, Ramanathan K, Gupta A, et al. Clustering Short Texts Using Wikipedia[C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam, Holland: [s. n.], 2007: 788-789. 被引量:1
  • 6Wu Shunyao, Wang Jinlong, Vu H Q, et al. Text Clustering with Important Words Using Normalization[C]//Proceedings of the 10th Annual Joint Conference on Digital Libraries. Gold Coast, Australia: [s. n.], 2010: 393-394. 被引量:1
  • 7Wang Jinlong, Wu Shunyao, Li Gang, et al. Integrating Instance-level and Attribute-level Knowledge into Document Clustering[J]. Computer Science and Information Systems, 2011, 8(3): 635-651. 被引量:1
  • 8Hu Yeming, Milios E E, Blustein J, et al. Enhancing Semi-supervised Document Clustering with Feature Super- vision[C]//Proceedings of the 27th Annual ACM Symposiumon Applied Computing. Trento, Italy: ACM Press, 2012: 929-936. 被引量:1
  • 9Sun Jun, Zhao Wenbo, Xue Jiangwei, et al. Clustering with Feature Order Preferences[J]. Intelligent Data Analysis, 2010, 14(4): 479-495. 被引量:1
  • 10Hotho A, Staab S, Stumme G. Explaining Text Clustering Results Using Semantic Structures[C]//Proceedings of the 7th European Conference on Principles and "Practice of Knowledge Discovery in Databases. Cavtat-Dubrovnik, Croatia: [s. n.], 2003: 217-228. 被引量:1

共引文献162

同被引文献105

引证文献12

二级引证文献32

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部