摘要
重复数据删除对于文件增量同步、云存储和容灾备份等研究具有十分重要的作用和意义,能够大大地提高磁盘存储的效率。结合现有的文件级和块级去重算法的优势,并针对基于内容的分块算法CDC容易因超大块而导致块大小方差变化大的问题,提出了一种融合文件及内容分块的重复数据删除算法DMix。DMix采用了面向文件级和块级的两阶段重复数据检测及删除方法,并在快速双极值分块算法RDE的基础上,提出包含最大块阈值的内容分块算法RDEL,使得RDEL在保持良好的低熵字符串处理能力和抗字节偏移能力的同时,进一步降低了块大小方差。算法分析及实验结果表示,DMix及RDEL能够有效提升重复数据删除的效率,并能有效地降低CDC算法的块大小方差。
Data deduplication plays a very important role and significance in research such as file incremental synchronization,cloud storage,and disaster recovery backup,which can greatly improve the efficiency of disk storage.Combining the advantages of existing file level and block level deduplication algorithms,and addressing the problem of large block size variance caused by super large blocks in content based blocking algorithm CDC,a deduplication algorithm DMix that integrates file and content blocking is proposed.DMix adopts a two-stage duplicate data detection and deletion method for file level and block level,and proposes a content partitioning algorithm RDEL that includes the maximum block threshold based on the fast bipolar partitioning algorithm RDE.This allows RDEL to maintain good low-entropy string processing ability and anti-byte-offset ability,while further reducing block size variance.Algorithm analysis and experimental results indicate that DMix and RDEL can effectively improve the efficiency of deduplication and reduce the block size variance of the CDC algorithm.
作者
朱建平
黄恒
周积
陈海茂
黄利君
ZHU Jianping;HUANG Heng;ZHOU Ji;CHEN Haimao;HUANG Lijun(Guangdong Changying Technology Co.,Ltd.,Maoming Guangdong 525000)
出处
《软件》
2023年第12期53-59,86,共8页
Software