摘要
针对当前海量数据的相似重复问题,提出了MapReduce下通过SimHash算法检测相似文档的方法:即首先将存储在分布式文件系统的海量文档集进行分类,然后进行特征提取,由SimHash算法生成SimHash指纹和生成Sequence File;最后,计算相似度产生检测结果;通过实验测试可知,提出的检测方法和设计的相似度算法能很好适应海量数据相似检测,并能有效地提高工作效率。
For the question of similar duplication of big data,this paper offers an approach to find similar document by using SimHash algorithm and MapReduce.The approach consists of several steps.First,massive documents which stored in the DFS(Distribute File System) are classified; then,the characteristics of data are extracted and Simhash fingerprint and Sequence file are produced by SimHash algorithm; finally,detection result is generated through computing similarity.The experiments prove that the approach presented and similarity designed well suit near-duplicate detection for big data,can improve work efficiency greatly.
出处
《实验室研究与探索》
CAS
北大核心
2014年第9期132-136,共5页
Research and Exploration In Laboratory
基金
河南省科技攻关计划项目(132102210123)
河南省高等学校矿山信息化重点学科开放实验室项目
河南理工大学博士基金(B2009-21)