期刊文献+

基于多语义复合表示模型的去离群点文本聚类 被引量:3

Research on clustering with removing outliers based on multi-semanticcomposite text representation
下载PDF
导出
摘要 目的:将词语的多种语义信息融合,提出多语义复合文本表示模型和基于该模型的文本聚类算法。方法:首先,利用高斯混合模型构建词语的多语义空间,计算词语的不同语义概率权重;其次,运用所有的语义概率加权词嵌入复合形成文本向量;最后,借助文本向量的多语义结构识别文本数据中的离群点,通过剔除离群点提升K-means算法的聚类性能。结果:多语义复合文本向量能够有效地去除冗余,突出文本的语义结构特征;实验表明,与其他文本聚类算法相比,本文提出的算法能够提高约3.57%~44.88%的聚类性能。结论:基于多语义复合表示模型的去离群点文本聚类算法具有更优性能。 Aims:A multi-semantic composite text representation model was proposed by combining multiple semantic information of words and a text clustering algorithm.Methods:Firstly,the multi-semantic spaces of words were constructed by the gaussian mixture model;and the semantic probability weights of words were calculated.Secondly,all the semantic probability weighted words were used to embed compound to form text vectors.Finally,the multi-semantic structure of text vectors was utilized to identify the outlier in text data;and the clustering performance was improved by the K-means algorithm based on removing outliers.Results:Multi-semantic composite document vectors can effectively eliminate redundancy and highlight the semantic structure of texts.Compared with other text representation methods,the clustering performance was improved by 3.57%~44.88%.Conclusions:Experimental results of two datasets show that the proposed model and algorithm have better performance.
作者 顾永春 武娇 金世举 顾兴全 尹雪婷 刘雅萱 GU Yongchun;WU Jiao;JIN Shiju;GU Xingquan;YIN Xueting;LIU Yaxuan(College of Sciences,China Jiliang University,Hangzhou 310018,China;College of Standardization,China Jiliang University,Hangzhou 310018,China)
出处 《中国计量大学学报》 2021年第3期414-420,438,共8页 Journal of China University of Metrology
基金 国家自然科学基金项目(No.61302190) 浙江省自然科学基金项目(No.Y201738417)。
关键词 词嵌入 文本表示 文本聚类 K均值聚类 离群点 word embedding text representation text clustering K-means outliers
  • 相关文献

参考文献3

二级参考文献30

  • 1刘英姿,吴昊.客户细分方法研究综述[J].管理工程学报,2006,20(1):53-57. 被引量:85
  • 2Adamic L A,Zhang J,Bakshy E,Ackerman M S. Knowledge sharing and yahoo answers:everyone knows something[A].2008.665-674. 被引量:1
  • 3Hotho A,Staab S,Stumme G. Wordnet improves text document clustering[A].2003.541-544. 被引量:1
  • 4Reforgiato Recupero D. A new unsupervised method for document clustering by using WordNet lexical and conceptual relations[J].Informarion Retrieval,2007,(06):563-579.doi:10.1007/s10791-007-9035-7. 被引量:1
  • 5Hu J,Fang L,Cao Y,Zeng H J,Li H,Yang Q,Chen Z. Enhancing text clustering by leveraging Wikipedia semantics[A].2008.179-186. 被引量:1
  • 6Hu X,Zhang X,Lu C,Park E K,Zhou X. Exploiting Wikipedia as external knowledge for document clustering[A].2009.389-396. 被引量:1
  • 7Blei D M,Ng A Y,Jordan M I. Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003.993-1022. 被引量:1
  • 8Hofraann T. Probabilistic latent semantic indexing[A].1999.50-57. 被引量:1
  • 9Xu W,Liu X,Gong Y. Document clustering based on non-negative matrix factorization[A].2003.267-273. 被引量:1
  • 10Lin C J. Projected gradient methods for non-negative matrix factorization[J].Neural Computation,2007,(10):2756-2779.doi:10.1162/neco.2007.19.10.2756. 被引量:1

共引文献21

同被引文献29

引证文献3

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部