摘要
目的:将词语的多种语义信息融合,提出多语义复合文本表示模型和基于该模型的文本聚类算法。方法:首先,利用高斯混合模型构建词语的多语义空间,计算词语的不同语义概率权重;其次,运用所有的语义概率加权词嵌入复合形成文本向量;最后,借助文本向量的多语义结构识别文本数据中的离群点,通过剔除离群点提升K-means算法的聚类性能。结果:多语义复合文本向量能够有效地去除冗余,突出文本的语义结构特征;实验表明,与其他文本聚类算法相比,本文提出的算法能够提高约3.57%~44.88%的聚类性能。结论:基于多语义复合表示模型的去离群点文本聚类算法具有更优性能。
Aims:A multi-semantic composite text representation model was proposed by combining multiple semantic information of words and a text clustering algorithm.Methods:Firstly,the multi-semantic spaces of words were constructed by the gaussian mixture model;and the semantic probability weights of words were calculated.Secondly,all the semantic probability weighted words were used to embed compound to form text vectors.Finally,the multi-semantic structure of text vectors was utilized to identify the outlier in text data;and the clustering performance was improved by the K-means algorithm based on removing outliers.Results:Multi-semantic composite document vectors can effectively eliminate redundancy and highlight the semantic structure of texts.Compared with other text representation methods,the clustering performance was improved by 3.57%~44.88%.Conclusions:Experimental results of two datasets show that the proposed model and algorithm have better performance.
作者
顾永春
武娇
金世举
顾兴全
尹雪婷
刘雅萱
GU Yongchun;WU Jiao;JIN Shiju;GU Xingquan;YIN Xueting;LIU Yaxuan(College of Sciences,China Jiliang University,Hangzhou 310018,China;College of Standardization,China Jiliang University,Hangzhou 310018,China)
出处
《中国计量大学学报》
2021年第3期414-420,438,共8页
Journal of China University of Metrology
基金
国家自然科学基金项目(No.61302190)
浙江省自然科学基金项目(No.Y201738417)。
关键词
词嵌入
文本表示
文本聚类
K均值聚类
离群点
word embedding
text representation
text clustering
K-means
outliers