摘要
在信息时代中,文档的相似性检测技术得到了广泛的应用,包括在数字化图书馆、搜索引擎、论文查重等许多领域,取得了巨大的成功。但基于词频统计的文档相似性检测技术准确率低,基于字符串对比的文档相似性检测技术无法实现复杂场景下的应用。为了解决这些问题,在近年来产生了大量基于相似度估计的文档相似性检测技术。其中shingle算法,minwise哈希算法是一种相对成熟,性能稳定的文档相似性检测算法。具体地,本文将根据基于词频统计的方法和基于字符串对比的方法的不足,总结出基于相似度估计的方法的优点,详细描述shingle算法,minwise哈希算法的思想、优点以及后续发展,强调文档相似性检测技术目前存在的问题和未来研究方向。
In the information age,document similarity detection technology has been widely used,including in digital library,search engine,paper retrieval and many other fields,and has achieved great success.However,the accuracy of document similarity detection based on word frequency statistics is low,and the application of document similarity detection based on string comparison cannot be achieved in complex scenes.In order to solve these problems,a large number of document similarity detection techniques based on similarity estimation have been developed in recent years.Among them,shingle algorithm and minwise hash algorithm arethe relatively mature and stable document similarity detection algorithms.Specifically,this paper summarizes the advantages of the similarity estimation based on the disadvantages of the word frequency statistics method and the string comparison method,describes the ideas,advantages,and subsequent developments of shingle algorithm and minwise hash algorithm in detail,and emphasizes the existing problems and future research directions of document similarity detection technology including minwise hash algorithm.
作者
王钰宁
刘晓霞
周绍军
Wang Yuning;Liu Xiaoxia;Zhou Shaojun(Department of Information Engineering,Sichuan Water Conservancy Vocational College,Chongzhou Sichuan,611231)
出处
《电子测试》
2021年第14期40-42,共3页
Electronic Test
基金
四川水利职业技术学院科研项目(KY2020-30)资助。
关键词
重复率
相似度
估计
检测算法
Repetition Rate
Similarity
Estimation
Detection Algorithm