一种基于Q-sample的局部相似连接并行算法被引量：1

Q-sample-based Local Similarity Join Parallel Algorithm

下载PDF

导出

摘要局部相似连接能快速找出数据集间的局部相似记录对,是基因序列比对、剽窃检测和数据清洗等研究领域的基本操作。文中主要研究基于MapReduce框架的并行相似连接技术,提出了一种基于Q-sample的局部相似连接算法,解决了局部相似连接的定位问题。该算法采用了过滤验证二阶段模式:在过滤阶段,所提算法使用Q-sample分割方案拆分字符串集,在不丢失任何匹配的基础上生成了高质量的子串,抛弃了大量的无关字符串对;在验证阶段,所提算法优化了LS-Join算法的双向扩展验证方法,通过去除冗余匹配、合并连续匹配和合并非连续匹配等技术提高了算法的验证效率。通过实验对比了不同数据集和编辑距离参数下算法的性能表现,结果显示所提算法在大数据集上的局部相似连接速度快于当前的优秀算法LS-Join。理论分析和实验结果证明,所提算法的相关技术提高了局部相似的连接性能。 Local similarity join can finds all local similar pairs from sets quickly,which is a basic operation in many areas,such as gene sequence alignment,near duplicate detection,data cleaning and so on.This paper focused on designing similarity join parallel algorithm with MapReduce,and proposed a Q-sample-based algorithm to solve the locating problem of local similarity join.The proposed algorithm employs a filter-verify based framework.In filter stage,a Q-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs,and then more dissimilar string pairs are discarded.In verify stage,the LS-Join’s backward-forward verification method is improved with the technique of removing redundant match,combining consecutive match and combining non-consecutive match.In the experiments,the performances of the proposed algorithm with different size of datasets or different value of edit distances are scaled.Experimental results show that the proposed algorithm outperforms the current excellent algorithm LS-Join on big dataset.Theoretical analysis and experimental result demonstrate that the performance of local similarity join is improved by using the techniques of the proposed algorithm.

作者王晓霞孙德才 WANG Xiao-xia;SUN De-cai(College of Information Science and Technology,Bohai University,Jinzhou,Liaoning 121013,China)

机构地区渤海大学信息科学与技术学院

出处《计算机科学》 CSCD 北大核心 2019年第12期38-44,共7页 Computer Science

基金教育部人文社会科学研究青年基金项目(15YJC870021) 国家自然科学基金青年基金项目(61602056) 国家社会科学基金项目(19BTQ028) 辽宁省自然科学基金(20170540015) 辽宁省社会科学基金(L18AXW001) 辽宁省教育厅科学研究项目(L2015010)资助

关键词相似连接 Q-sample MAPREDUCE 数据清洗大数据 Similarity join Q-sample MapReduce Data cleaning Big data

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1Minghe YU,Guoliang LI,Dong DENG,Jianhua FENG.String similarity search and join： a survey[J].Frontiers of Computer Science,2016,10(3):399-417. 被引量：4
2庞俊,于戈,许嘉,谷峪.基于MapReduce框架的海量数据相似性连接研究进展[J].计算机科学,2015,42(1):1-5. 被引量：16

二级参考文献134

1李国杰.大数据研究的科学价值[J].中国计算机学会通讯,2012,8(9):8-15. 被引量：122
2CIO时代网.浅析大数据的特点[EB/OL]. (2012-05-08). [2013-5-20] .http://www.ciotimes.com /baike/62989.html. 被引量：1
3Viktor M S, Kenneth C. Big data: A Revolution that will trans-formhowwelive,workandthink[M].盛杨燕,周涛,译.杭州:浙江人民出版社,2013:9-10. 被引量：1
4Miller R. 'Digital universe' nears a zettabyte[EB/OL]. [2012- 11-24]. http://www, datacenterknowledge, corn/archives/ 2010/05/04/digital-universe-nears-a-zettabyte/. 被引量：1
5Broder A Z, Glassman S C, Manasse M S, et al. Syntactic cluste- ring of the web[J]. Computer Networks and ISCN Systems, 1997,29(8-13) : 1157-1166. 被引量：1
6Hoad T C, Zobel J. Methods for identifying versioned and plagi- arized doeuments[J]. Journal of the American Society for Infor- mation Science and Technology, 2003,54(3) :203-215. 被引量：1
7Kolb L,Thor A, Rahm E. Load balancing for MapReduce-based entity resolution [C]//Proc of ICDE. Washington: IEEE, 2012: 618-629. 被引量：1
8Cho J, Shivakumar N, Garcia-Molina H. Finding replicated web collections [C]//Proc of SIGMOD. New York: ACM,2000:355- 366. 被引量：1
9相似图片搜索引擎TinEyeEEB/OL].[2012-11-24].http://www.ipc.me/tineye.html. 被引量：1
10Dean J, Ghemawat S. MapReduce large clusters[C]//Proc of OSDI. simplified data processing on New York: ACM, 2004: 137- 150. 被引量：1