摘要
为了提高文本相似度检测算法的准确度,提出一种结合潜在狄利克雷分布(Latent Dirichlet Allocation,LDA)与Doc2Vec模型的文本相似度检测方法,并把该算法得到的模型命名为HybridDL模型。该算法通过Doc2Vec对文档训练得到文档向量,再利用LDA模型得到文档主题与各个主题下特征词出现的概率,对文档中各主题及特征词计算概率加权和,映射到Doc2Vec文档向量中。实验结果表明,新算法模型比传统的Doc2Vec模型对相似文本的判断更加敏感,在文本相似度检测上具有更高的准确度。
In order to improve the accuracy of text similarity detection algorithm,this paper proposes a text similarity detection method combining latent Dirichlet Allocation(LDA)and Doc2Vec model,and names the model obtained by the algorithm HybridDL model.This algorithm obtains the document vector through Doc2Vec training of the document,and then obtains the probability of the occurrence of the document topic and the feature words under each topic with the LDA model,calculates the probability weighted sum of each topic and feature words in the document,and maps them to the Doc2Vec document vector.Experimental results show that the new algorithm model is more sensitive to the judgment of similar text than the traditional Doc2Vec model,and has higher accuracy in the detection of text similarity.
作者
肖晗
毛雪松
朱泽德
Xiao Han;Mao Xuesong;Zhu Zede(School of Information Science and Engineering,Wuhan University of Science and Technology,Wuhan 430081,China;Institute of Technology Innovation,Hefei Institutes of Physical Science,Chinese Academy of Sciences,Hefei 230031,China)
出处
《电子技术应用》
2020年第6期28-31,35,共5页
Application of Electronic Technique
基金
国家自然科学基金(61806187)。