Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage system...Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fin- gerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplica- tion system. Leach is motivated by the characteristics of real- world I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to or- ganize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cache mem- ory, with a goal to service a majority of duplicated data de- tection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fin- gerprint index and cache updates. In comprehensive experi- ments on several real-world datasets, Leach outperforms con- ventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.展开更多
文摘Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fin- gerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplica- tion system. Leach is motivated by the characteristics of real- world I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to or- ganize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cache mem- ory, with a goal to service a majority of duplicated data de- tection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fin- gerprint index and cache updates. In comprehensive experi- ments on several real-world datasets, Leach outperforms con- ventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.