摘要
科技文献资料之间的相似度计算可以帮助人们从中挖掘更多的科学知识。但是科技文献资料中的复杂的同义词关系却明显地影响了相似度的准确性。尤其在计算生物医学科技资料时其准确度常常受到领域专业词汇的影响而降低。因此本文提出了一种基于TF-IDF方法结合生物医学同义词的文本相似度计算方法。该方法首先识别生物医学专业词汇及其同义词关系并建立同义词库,之后根据同义词权重规则修改TF-IDF中更能体现文本特征的权重值,最后计算文本的相似度。实验表明该方法有效提高了生物医学文本相似度计算的稳定性和准确度,是一种相较于传统TF-IDF更为有效的文本相似性计算法。
The calculation of similarity between scientific and technical literature can help people to explore more scientific knowledge.However,the complex synonym relationship in scientific literature has significantly affected the accuracy of similarity calculation.Especially when calculating biomedical data,the accuracy is often reduced by the complexity of professional vocabulary.Therefore,this paper proposes a text similarity calculation method based on TF-IDF method combined with biomedical synonyms.The method firstly identifies the biomedical professional vocabulary and its synonym relationship and establishes a synonym database.Then,according to the synonym weight rule,the weight value in the TF-IDF that better reflects the text feature is modified.Finally the similarity of the text is calculated.Experiments show that this method effectively improves the stability and accuracy of biomedical text similarity calculation,and is a more effective calculation method than traditional TF-IDF.
作者
郝淼
谭红
张成梅
于杰
黄伟
HAO Miao;TAN Hong;ZHANG Chengmei;YU Jie;HUANG Wei(Guizhou Academy of Testing and Analysis,Guiyang 550000,China)
出处
《贵州科学》
2019年第6期91-96,共6页
Guizhou Science