期刊文献+

一种基于特征加权的文本相似度计算算法 被引量:4

A Text Similarity Computing Algorithm Based on Feature Weighting
下载PDF
导出
摘要 文本相似度的计算是文本挖掘的基础。传统的基于向量空间模型(VSM)的文本相似度计算方法把文本映射成词向量,再利用余弦距离公式来计算相似度,这样存在文本向量维数过高以及语义敏感度差的问题。针对以上问题,通过对词性以及权值大小的过滤可以缩减特征词规模,在一定程度上可以减少高维稀疏的情况发生,并且引入LDA模型的文本隐含主题特征,增加文本表示的语义背景,通过线性加权的方式结合VSM模型的特征词特征和LDA模型的主题特征,计算文本相似度。实验表明,与单独使用VSM模型和LDA模型比较,利用加权特征计算文本相似度有着更好的效果。 The calculation of text similarity is the basis of text mining.The traditional text similarity calculation method based on vector space model(VSM)maps the text into word vectors,and then uses the cosine distance formula to calculate the similarity,which has the problems of high dimension and poor semantic sensitivity.Through the filtering of the part of speech and the filtering of the word weight,it is possible to reduce the size of the feature words,which can reduce the occurrence of high-dimensional sparseness.The thematic features of the LDA(Latent Dirichlet Allocation)model were introduced,which can increase the semantic background of the text representation.The text similarity was calculated by combining the feature words of the VSM model with the thematic features of the LDA model.Experiments show that compared with the VSM model and the LDA model alone,combining these two features to calculate the text similarity has a better effect.
作者 邱先标 陈笑蓉 QIU Xianbiao;CHEN Xiaorong(College of Computer Science and Technology,Guizhou University,Guiyang 550025,China)
出处 《贵州大学学报(自然科学版)》 2018年第1期63-68,共6页 Journal of Guizhou University:Natural Sciences
基金 国家自然科学基金项目资助(61363028)
关键词 文本相似度 向量空间模型 LDA模型 特征加权 文本挖掘 text similarity VSM LDA feature weighting text mining
  • 相关文献

参考文献4

二级参考文献97

  • 1苏祺,昝红英,胡景贺,项锟.词性标注对信息检索系统性能的影响[J].中文信息学报,2005,19(2):58-65. 被引量:8
  • 2刘涛,吴功宜,陈正.一种高效的用于文本聚类的无监督特征选择算法[J].计算机研究与发展,2005,42(3):381-386. 被引量:37
  • 3谭松波,王月粉.中文文本分类语料库-TanCorpv1.0[EB/OL].(2007-08-29)[2008-01-20].http://www.searehforum:org.cn/tansongbo/corpus.htm. 被引量:11
  • 4Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990. 被引量:1
  • 5Hofmann T. Probabilistic latent semantic indexing//Proceedings of the 22nd Annual International SIGIR Conference. New York: ACM Press, 1999:50-57. 被引量:1
  • 6Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022. 被引量:1
  • 7Griffiths T L, Steyvers M. Finding scientific topics//Proceedings of the National Academy of Sciences, 2004, 101: 5228 5235. 被引量:1
  • 8Steyvers M, Gritfiths T. Probabilistic topic models. Latent Semantic Analysis= A Road to Meaning. Laurence Erlbaum, 2006. 被引量:1
  • 9Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical dirichlet processes. Technical Report 653. UC Berkeley Statistics, 2004. 被引量:1
  • 10Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1977, B39(1): 1-38. 被引量:1

共引文献273

同被引文献38

引证文献4

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部