摘要
基于向量空间的文档相似度算法假设特征元素间关系为正交,当2篇文档采用了具有相近语义的不同术语描述时,该方法不能准确反映二者的相似性.针对这种情况,文章利用词语的同义关系,在给出术语与术语组相似度、术语组和术语组间相似度的概念及算法的基础上,给出一种基于词语相似关系的文档相似度计算方法.实验采用科技文献类文档和新闻报道类文档作为测试集合,比较新方法和向量空间算法的分类性能,结果显示新方法可提高文档分类的准确性.
Because vector space model(VSM)assumes that terms in different documents is orthogonal,when different documents are described by different terms,VSM can’t accurately reflect the similarity between them.For this problem,based on giving definition and computing method of similarity between two terms set,this paper gives a quantification method to calculate similarity between two documents.Our experiments adopt science and technology literature documents and news stories to test the classification accuracy of VSM and the new method,results indicate that the new method can improve classification accuracy.
出处
《河北大学学报(自然科学版)》
CAS
北大核心
2017年第1期108-112,共5页
Journal of Hebei University(Natural Science Edition)
基金
河北省自然科学基金资助项目(F2015201142)
河北省社会科学基金资助项目(HB15SH064)
关键词
同义词
词语相似度
文档相似度
synonymous
similarity between two terms
similarity between two documents