摘要
局部相似连接能快速找出数据集间的局部相似记录对,是基因序列比对、剽窃检测和数据清洗等研究领域的基本操作。文中主要研究基于MapReduce框架的并行相似连接技术,提出了一种基于Q-sample的局部相似连接算法,解决了局部相似连接的定位问题。该算法采用了过滤验证二阶段模式:在过滤阶段,所提算法使用Q-sample分割方案拆分字符串集,在不丢失任何匹配的基础上生成了高质量的子串,抛弃了大量的无关字符串对;在验证阶段,所提算法优化了LS-Join算法的双向扩展验证方法,通过去除冗余匹配、合并连续匹配和合并非连续匹配等技术提高了算法的验证效率。通过实验对比了不同数据集和编辑距离参数下算法的性能表现,结果显示所提算法在大数据集上的局部相似连接速度快于当前的优秀算法LS-Join。理论分析和实验结果证明,所提算法的相关技术提高了局部相似的连接性能。
Local similarity join can finds all local similar pairs from sets quickly,which is a basic operation in many areas,such as gene sequence alignment,near duplicate detection,data cleaning and so on.This paper focused on designing similarity join parallel algorithm with MapReduce,and proposed a Q-sample-based algorithm to solve the locating problem of local similarity join.The proposed algorithm employs a filter-verify based framework.In filter stage,a Q-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs,and then more dissimilar string pairs are discarded.In verify stage,the LS-Join’s backward-forward verification method is improved with the technique of removing redundant match,combining consecutive match and combining non-consecutive match.In the experiments,the performances of the proposed algorithm with different size of datasets or different value of edit distances are scaled.Experimental results show that the proposed algorithm outperforms the current excellent algorithm LS-Join on big dataset.Theoretical analysis and experimental result demonstrate that the performance of local similarity join is improved by using the techniques of the proposed algorithm.
作者
王晓霞
孙德才
WANG Xiao-xia;SUN De-cai(College of Information Science and Technology,Bohai University,Jinzhou,Liaoning 121013,China)
出处
《计算机科学》
CSCD
北大核心
2019年第12期38-44,共7页
Computer Science
基金
教育部人文社会科学研究青年基金项目(15YJC870021)
国家自然科学基金青年基金项目(61602056)
国家社会科学基金项目(19BTQ028)
辽宁省自然科学基金(20170540015)
辽宁省社会科学基金(L18AXW001)
辽宁省教育厅科学研究项目(L2015010)资助