Apart from high space efficiency,other demanding requirements for enterprise de-duplication backup are high performance,high scalability,and availability for large-scale distributed environments.The main challenge is ...Apart from high space efficiency,other demanding requirements for enterprise de-duplication backup are high performance,high scalability,and availability for large-scale distributed environments.The main challenge is reducing the significant disk input/output(I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks.Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck,thus suffering from degradation under poor duplicate locality workload.This paper presents Chunkfarm,a post-processing de-duplication backup system designed to improve capacity,throughput,and scalability for de-duplication.Chunkfarm performs de-duplication backup using the hash join algorithm,which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os,hence achieving high write throughput not influenced by workload locality.More importantly,by decentralizing fingerprint lookup and update,Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel;it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.展开更多
基金supported by the National Basic Research Program (973) of China (No.2004CB318201)the National High-Tech Research and Development Program (863) of China (No.2008AA01A402)the National Natural Science Foundation of China (Nos.60703046 and 60873028)
文摘Apart from high space efficiency,other demanding requirements for enterprise de-duplication backup are high performance,high scalability,and availability for large-scale distributed environments.The main challenge is reducing the significant disk input/output(I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks.Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck,thus suffering from degradation under poor duplicate locality workload.This paper presents Chunkfarm,a post-processing de-duplication backup system designed to improve capacity,throughput,and scalability for de-duplication.Chunkfarm performs de-duplication backup using the hash join algorithm,which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os,hence achieving high write throughput not influenced by workload locality.More importantly,by decentralizing fingerprint lookup and update,Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel;it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.