摘要
[目的/意义]:为了实现在海量文本中更加高效准确检测出相似文本。[方法]:本文对基于Simhash算法的相似文档识别技术进行研究改进,对Simhash签名值的计算方法作出改进,分词阶段使用ICTCLAS分词系统,文本特征词的权重计算方法采用TF-IDF技术,同时将特征词的词性、词长、是否为标志词与是否被包含在标题中几大方面作为权重计算的考虑因素。最后使用汉明距离对文档签名值进行比较,从海量文档中精确地找出相似文档。[结论]:通过改进TF-IDF权重,使得改进的Simhash算法在相似文档识别准确率上优于其他算法。
[Purpose/Significance]: In order to achieve more efficient in mass text accurately detect the similar text. [Method]: This paper based on Simhash algorithm similar document identification technology improvement, research on Simhash signature value calculation method to make improvements, participle stage using ICTCLAS segmentation system, the text of key method to calculate the weights of the TF-IDF technology, at the same time, the key parts of speech, word length, whether marked word and are included in the title of several major aspects as weighting factor. Finally, the hamming distance is used to compare the document signature value, and the similar documents can be accurately found from the mass documents. [Conclusion]: By improving the TF-IDF weight, the improved Simhash algorithm is better than other algorithms in the recognition accuracy of similar documents.
出处
《计算机科学与应用》
2020年第2期371-378,共8页
Computer Science and Application
基金
国家自然科学基金(61272044,61602019,61801008),北京市自然科学基金(3182028).