期刊文献+

基于语义指纹的中文文本快速去重 被引量:5

Fast Duplicate Detection for Chinese Texts Based on Semantic Fingerprint
原文传递
导出
摘要 针对中文文本,抽取出文本内容特征,结合Simhash算法生成中文文本的语义指纹,通过语义指纹的海明距离判断文本间相似程度。整合Single-Pass快速聚类算法对语义指纹快速聚类,所得的语义指纹聚类即为文本去重的最终结果,从而实现面向中文文本的快速去重流程。实验过程中,通过与Shingle算法对比,可以体现该方法在算法精确度、鲁棒性等方面的优势,同时该方法的运行速度优势也能较好地支持大数据量文本的去重操作。 Oriented to Chinese texts, text features are firstly extracted to generate semantic fingerprints by performing the Simhash algorithm. The Hamming Distances between semantic fingerprints are applied to determine the similarity between texts. Then, as the last step of the entire process of detecting duplicates for Chinese text, the Single - Pass clustering algorithm is integrated to cluster the generated semantic fingerprints, after which the clusters of fingerprints are the final results. By comparing with the Shingle algorithm, the experiment shows that the Simhash approach is superior at both pre- cise and robustness, and the Simhash approach is capable to process large amount of texts due to its rapidness.
出处 《现代图书情报技术》 CSSCI 北大核心 2013年第9期41-47,共7页 New Technology of Library and Information Service
基金 国家自然科学基金项目"科研团队动态演化规律研究"(项目编号:71273196)的研究成果之一
关键词 语义指纹 Simhash Single—Pass 文本去重 Semantic fingerprint Simhash Single- Pass Duplicate detection
  • 相关文献

参考文献22

二级参考文献55

共引文献58

同被引文献70

引证文献5

二级引证文献20

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部