指纹极值的双层重复数据删除算法

Double Layer Deduplication Algorithm Based on Fingerprint Extremum

下载PDF

导出

摘要为提高重复数据删除算法的重删率,减少CDC算法边界硬分块,使重复数据删除率和性能之间得到平衡,提出了指纹极值的双层重复数据删除算法(DDFE).首先在第一层重复数据删除模型中使用较大的分块大小,保证重删操作的速度;然后将第一层模型重删后的不重复数据输入到分块大小较小的第二层重复数据删除模型,保证重复数据删除的精度.数据分块时,在可容忍范围内,提出了指纹极值的分块算法,减少了硬分块对重复删除的影响.在多种分块组合下的实验结果表明,与任何传统的单层重复数据删除算法相比,DDFE能够较好地防止硬分块、平衡性能和时间,在大量小数据块和频繁变化的数据间有效地消除更多的重复数据. In order to improve the deduplication rate of the deduplication algorithm,reduce the forced chunking of CDC,balancing deduplication rate and performance. Thus,double layer deduplication algorithm based on fingerprint extremum（ DDFE） is proposed. Firstly,a large chunking size is used in the first layer deduplication model to ensure the speed of deduplication operation; then the reduplicated data of the first layer model import the second layer deduplication model with smaller chunking size to ensure the accuracy of deduplication. During data chunking,in the range of tolerance,chunking algorithm of fingerprint extremum is proposed,which reduces the effect of forced chunking on deduplication. The experimental results on a variety of chunking assemble show that DDFE can effectively prevent forced chunking,balance performance and time,and eliminate more duplicate data between a large number of small data blocks and frequently changing datas compared with any traditional single layer deduplication algorithm.

作者王青松葛慧 WANG Qing-song;GE Hui(College of Information,Liaoning University,Shenyang 110036,China)

机构地区辽宁大学信息学院

出处《辽宁大学学报（自然科学版）》 CAS 2018年第3期201-207,共7页 Journal of Liaoning University：Natural Sciences Edition

基金国家自然科学基金资助项目(61502215)

关键词重复数据删除指纹极值备份系统 Hadoop 数据存储 deduplication Fingerprint extremum standby system Hadoop data storage

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献3

1李锋,陆婷婷,郭建华.一种基于重复数据删除的镜像文件存储方法研究[J].计算机科学,2016,43(S2):495-498. 被引量：5
2姚文斌,叶鹏迪,李小勇,常静坤.基于压缩近邻的查重元数据去冗算法设计[J].通信学报,2015,36(8):1-7. 被引量：3
3刘青,付印金,倪桂强,梅建民.基于Hadoop平台的分布式重删存储系统[J].计算机应用,2016,36(2):330-335. 被引量：16

二级参考文献32

1付印金,肖侬,刘芳,鲍先强.基于重复数据删除的虚拟桌面存储优化技术[J].计算机研究与发展,2012,49(S1):125-130. 被引量：12
2ZHU B, LI K, PATTERSON H. Avoiding the disk bottleneck in the data domain deduplication file system[A]. Proceedings of the 6th USENIX Conference on File and Storage Technologies, USENIX As- sociation[C]. 2008,1-14. 被引量：1
3LILLIBRIDGE M, ESHGHI K, BHAGWAT D, et aL Sparse indexing: large scale, inline deduplication using sampling and locality[A]. Proc- eedings of the 7th Conference on File and Storage Technologies, USENIX Association[C]. 2009. 111-123. 被引量：1
4BHAGWAT D, ESHGHI K, LONG D, et al. Extreme binning: scalable, parallel deduplication for chunk-based file backup[A]. In Modeling, Analysis & Simulation of Computer and Telecommunication Systems, IEEE International Symposium[C]. IEEE, 2009,1-9. 被引量：1
5XIA W, JIANG H, FENG D, et al. SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput[A]. Proceedings of the 2011 USENIX Annual Technical Conference (ATC), USENIX Association[C],2011,26-28. 被引量：1
6ARONOVICH L, ASHER R, BACHMAT E, et al. The design of a similar- ity based deduplication system[A]. Proceedings of SYSTOR 2009, The Is- raeli Experimental Systems Conference[C]. ACM, 2009. 1-14. 被引量：1
7ROMAIQSK1 B, HELDT L, KILIAN W, et al. Anchor-driven sub- chunk deduplication[A]. Proceedings of the 4th Annual International Conference on Systems and Storage[C]. 201 l. 16-28. 被引量：1
8ZHANG Z, BHAGWAT D, LITWIN W, et al. Improved deduplication through parallel binning[A]. Performance Computing and Communications Conference (IPCCC), 2012 IEEE 31st International[C]. 2012. 130-141. 被引量：1
9DOUGLIS F, IYENGAR A. Application-specific deltaencoding via resemblance detection[A]. Proceedings of the 2003 USENIX Annual Technical Conference[C]. San Antonio, Texas, 2003. 113-126. 被引量：1
10BRODER A Z, MITZENMACHER M. Network applications of Bloom filters: a survey[J]. Interact Mathematics, 2004, 1(4): 485-509. 被引量：1

共引文献19

1王凯.智能变电站二次设备运行信息存储系统[J].自动化与仪器仪表,2018,0(12):212-215. 被引量：6
2高继梅.隐私保护数据库中自适应重复数据删除仿真[J].计算机仿真,2019,36(1):239-242. 被引量：1
3刘亚龙,殷若鹏,尤冬石.管线长距离输送天然气的闭环控制模型研究[J].智能计算机与应用,2016,6(3):25-28. 被引量：1
4常莲,刘健.云计算环境下的海量光纤数据存储模型仿真分析[J].激光杂志,2016,37(11):89-93.
5鲁伟.基于Hadoop交通视频大数据分析组件的设计与应用[J].中国交通信息化,2017(3):98-101. 被引量：5
6何诚刚.大规模电子通信信息存储效率管理仿真[J].计算机仿真,2017,34(9):175-178. 被引量：16
7朱雄军.LED光斑异常散色数据的存储模型设计[J].激光杂志,2017,38(10):131-134.
8王青松,葛慧.相似聚类的二级索引重复数据删除算法[J].小型微型计算机系统,2017,38(12):2797-2801. 被引量：2
9唐燕,刘仁权,王苹.基于Hadoop的高校大数据平台的设计与实现[J].信息技术,2017,41(12):105-109. 被引量：30
10梁小宇,陈宁江,闫承鑫,刘文斌.面向虚拟机镜像的改进相似度分组去重优化方法[J].广西大学学报（自然科学版）,2017,42(6):2154-2162.

1张曙光,咸鹤群,刘红燕,侯瑞涛,张曼.云存储中加密数据的自适应重复删除方法[J].计算机应用研究,2018,35(9):2772-2776. 被引量：4
2李会琼,朱桂玲,郭召.单指标众数模型的统计诊断及在波士顿房价分析中的应用[J].数理统计与管理,2017,36(6):1091-1105. 被引量：4
3刘彤,孟祥雨.基于BEC故障模型下的极化码SC译码算法研究[J].应用科技,2017,44(6):32-35.
4钱雪忠,秦静,宋威.改进的并行随机森林算法及其包外估计[J].计算机应用研究,2018,35(6):1651-1654. 被引量：4
5胡银丰,黄迪.基于Hash索引的声纳数据分布式存储策略[J].电子世界,2018,0(18):5-7.
6张蜀男,蔡英,范艳芳,夏红科.云存储中高效密文检索的中文数据加密方案[J].计算机科学,2018,45(6):124-129. 被引量：10
7王青松,葛慧.Winnowing指纹串匹配的重复数据删除算法[J].计算机应用,2018,38(3):677-681. 被引量：6
8李鹏,王小明,张立臣,卢俊岭,朱腾蛟,张丹.机会网络视频数据的分块渐进传输新方法[J].电子学报,2018,46(9):2165-2172. 被引量：2
9李敬强,李康,王勇,赵宁.管制员信息加工水平鉴别技术及影响机制[J].中国安全科学学报,2018,28(6):25-30. 被引量：3

辽宁大学学报（自然科学版）

2018年第3期

浏览历史

内容加载中请稍等...

指纹极值的双层重复数据删除算法

参考文献3

二级参考文献32

共引文献19

相关作者

相关机构

相关主题

浏览历史