摘要
文本相似度的计算是文本挖掘的基础。传统的基于向量空间模型(VSM)的文本相似度计算方法把文本映射成词向量,再利用余弦距离公式来计算相似度,这样存在文本向量维数过高以及语义敏感度差的问题。针对以上问题,通过对词性以及权值大小的过滤可以缩减特征词规模,在一定程度上可以减少高维稀疏的情况发生,并且引入LDA模型的文本隐含主题特征,增加文本表示的语义背景,通过线性加权的方式结合VSM模型的特征词特征和LDA模型的主题特征,计算文本相似度。实验表明,与单独使用VSM模型和LDA模型比较,利用加权特征计算文本相似度有着更好的效果。
The calculation of text similarity is the basis of text mining.The traditional text similarity calculation method based on vector space model(VSM)maps the text into word vectors,and then uses the cosine distance formula to calculate the similarity,which has the problems of high dimension and poor semantic sensitivity.Through the filtering of the part of speech and the filtering of the word weight,it is possible to reduce the size of the feature words,which can reduce the occurrence of high-dimensional sparseness.The thematic features of the LDA(Latent Dirichlet Allocation)model were introduced,which can increase the semantic background of the text representation.The text similarity was calculated by combining the feature words of the VSM model with the thematic features of the LDA model.Experiments show that compared with the VSM model and the LDA model alone,combining these two features to calculate the text similarity has a better effect.
作者
邱先标
陈笑蓉
QIU Xianbiao;CHEN Xiaorong(College of Computer Science and Technology,Guizhou University,Guiyang 550025,China)
出处
《贵州大学学报(自然科学版)》
2018年第1期63-68,共6页
Journal of Guizhou University:Natural Sciences
基金
国家自然科学基金项目资助(61363028)