摘要
为解决包含重复字符的文本相似度计算问题,提出了一种新的计算方法来获取两文本之间的相似度。首先根据单字符的对比情况统计重复字符数量;其次通过分析总的对比结果剔除重复字符的干扰;然后借助公式计算出正确的文本相似度,并拓展单字节字符和多字节字符混合时的相似度计算方法;最后编写算法代码来进行仿真分析,多组测试结果表明,用该方法计算得到的文本相似度与理论值相吻合。
In order to solve the problem of text similarity calculation with repeated characters,a new method is proposed to obtain the similarity between two texts.First,the number of repeated characters is counted according to the comparison of single characters.Then,the interference of repeated characters is eliminated by analyzing the total comparison results.And then,the correct text similarity is calculated by the formula,and the similarity calculation method of single-byte characters and multi-byte characters mixed is expanded.Finally,the algorithm code is compiled for simulation analysis,and several groups of test results show that the text similarity calculated by this method is consistent with the theoretical value.
作者
汪亚东
Wang Yadong(School of Instrument and Electronics,North University of China,Taiyuan,Shanxi 030051,China)
出处
《计算机时代》
2023年第6期87-91,共5页
Computer Era
关键词
自然语言处理
文本相似度
重复字符
计算算法
natural language processing
text similarity
repeated character
computing algorithm