期刊文献+

基于改进的Simhash算法的相似文档识别技术 被引量:3

Similar Document Recognition Technology Based on the Improved Simhash Algorithm
下载PDF
导出
摘要 [目的/意义]:为了实现在海量文本中更加高效准确检测出相似文本。[方法]:本文对基于Simhash算法的相似文档识别技术进行研究改进,对Simhash签名值的计算方法作出改进,分词阶段使用ICTCLAS分词系统,文本特征词的权重计算方法采用TF-IDF技术,同时将特征词的词性、词长、是否为标志词与是否被包含在标题中几大方面作为权重计算的考虑因素。最后使用汉明距离对文档签名值进行比较,从海量文档中精确地找出相似文档。[结论]:通过改进TF-IDF权重,使得改进的Simhash算法在相似文档识别准确率上优于其他算法。 [Purpose/Significance]: In order to achieve more efficient in mass text accurately detect the similar text. [Method]: This paper based on Simhash algorithm similar document identification technology improvement, research on Simhash signature value calculation method to make improvements, participle stage using ICTCLAS segmentation system, the text of key method to calculate the weights of the TF-IDF technology, at the same time, the key parts of speech, word length, whether marked word and are included in the title of several major aspects as weighting factor. Finally, the hamming distance is used to compare the document signature value, and the similar documents can be accurately found from the mass documents. [Conclusion]: By improving the TF-IDF weight, the improved Simhash algorithm is better than other algorithms in the recognition accuracy of similar documents.
机构地区 北京工业大学
出处 《计算机科学与应用》 2020年第2期371-378,共8页 Computer Science and Application
基金 国家自然科学基金(61272044,61602019,61801008),北京市自然科学基金(3182028).
  • 相关文献

参考文献11

二级参考文献103

  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:388
  • 2熊文新,宋柔.信息检索用户查询语句的停用词过滤[J].计算机工程,2007,33(6):195-197. 被引量:16
  • 3Fung B C M,Wang K,Ester M.Hierarchical document clustering//Wang John ed.The Encyclopedia of Data Warehousing and Mining,idea Group.2005:970-975. 被引量:1
  • 4Salton G.The SMART Retrieval System-Experiments in Automatic Document Processing.Englewood Cliffs,New Jersey:Prentice Hall Inc,1971. 被引量:1
  • 5Wang Y,Julia H.Document clustering with semantic analysis//Proceedings of the 39th Hawaii International Conferences on System Sciences.Hawaii,US,2006:54-63. 被引量:1
  • 6Hotho A,Staab S,Stumme G.Wordnet improves text document clustering//Proceedings of the Semantic Web Workshop at SIGIR-2003,26th Annual International ACM SIGIR Conference.Toronto,Canada,2003:541-550. 被引量:1
  • 7Hall P,Dowling G.Approximate string matching.Computing Survey,1980,12(4):381-402. 被引量:1
  • 8Coelho T,Calado P,Souza L,Ribeiro-Neto B,Muntz R.Image retrieval using multiple evidence ranking.IEEETransactions on Knowledge and Data Engineering,2004,16(4):408-417. 被引量:1
  • 9Ko Y,Park J,Seo J.Improving text categorization using the importance of sentences.lnformation Processing and Management,2004,40(1):65-79. 被引量:1
  • 10Erkan G,Radev D.Lexrank:Graph-based lexical centrality as salience in text summarization.Journal of Artificial Intelligence Research,2004,22(7):457-479. 被引量:1

共引文献331

同被引文献13

引证文献3

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部