期刊文献+

基于迭代训练的古文短文本聚类方法研究

Research on Clustering Method of Ancient Chinese Short Texts Using Iterative Training
下载PDF
导出
摘要 传统短文本聚类存在特征关键词稀疏、特征维度高,且忽略文本语义等特点,基于古文《四库全书》和《太平御览》抽取的短文本词条数据集,提出了一种基于BERT+K-means+迭代训练的融合模型对短文本数据集进行聚类研究。使用BERT预训练模型来获取词条短文本的向量表示,将该向量表示作为Kmeans算法的输入得到初始聚簇结果,利用离群值检测算法将聚簇结果划分为离群值和非离群值集合,使用非离群值训练出的分类器对离群值进行再次划分,迭代进行,直至达到停止标准。将BERT词向量模型与TFIDF以及Word2vec词向量模型进行对比实验,对比结果证明BERT预训练模型相较TF-IDF和Word2vec两种词向量表示效果有显著的提升,实验还证明了迭代训练对于本文古文短文本数据集的有效性。 Traditional short text clustering has the characteristics of sparse feature keywords,high feature dimensions,and ignoring text semantics. Based on the short text entry data set extracted from the ancient texts Complete Book Collection in Four Sections and Imperial Readings of the Taiping Era,a fusion model based on BERT(Bidirectional Encoder Representation from Transformers)+ K-means + iterative training is proposed to cluster the short text data sets. Use the BERT pre-training model to obtain the vector representation of the short text of the term,use the vector representation as the input of the K-means algorithm to obtain the initial clustering result,and use the outlier detection algorithm to divide the clustering result into outliers and non-outliers.Use non-outlier training to obtain a classifier,and then use the classifier to divide the outliers again,and iteratively,until the stopping criterion is reached. We compare the BERT word vector model with the TF-IDF and Word2vec word vector models. The comparison results prove that the Bert pre-training model has a significant improvement in the expression effect of the TF-IDF(Term Frequency-Inverse Document Frequency)and Word2vec word vector,and the experiment also proved the effectiveness of iterative training on the short text data set of this article.
作者 李晓璐 赵庆聪 齐林 Li Xiaolu;Zhao Qingcong;Qi Lin(School of Information Management,Beijing Information Science and Technology University,Beijing 100192;School of Economics and Management,Beijing Information Science and Technology University,Beijing 100192;Beijing Key Laboratory of Big Data Decision-making for Green Development,Beijing 100192;Beijing World Urban Circular Economy System(Industry)Collaborative Innovation Center,Beijing 100192)
出处 《现代计算机》 2022年第2期37-43,共7页 Modern Computer
基金 国家重点研发计划(2017YFB1400400)。
关键词 古文 短文本聚类 BERT模型 K-MEANS聚类 迭代训练 ancient chinese texts short text clustering BERT model K-means iterative training
  • 相关文献

参考文献7

二级参考文献78

共引文献168

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部