Fatman： Building Reliable Archival Storage Based on Low-Cost Volunteer Resources

Fatman： Building Reliable Archival Storage Based on Low-Cost Volunteer Resources

导出

摘要 We present Fatman, an enterprise-scale archival storage based on volunteer contribution resources from underutilized web servers, usually deployed on thousands of nodes with spare storage capacity. Fatman is specifically designed for enhancing the utilization of existing storage resources and cutting down the hardware purchase cost. Two major concerned issues of the system design are maximizing the resource utilization of volunteer nodes without violating service level objectives （SLOs） and minimizing the cost without reducing the availability of archival system. Fatman has been widely deployed on tens of thousands of server nodes across several datacenters, providing more than 100 PB storage capacity and serving dozens of internal mass-data applications. The system realizes an efficient storage quota consolidation by strong isolation and budget limitation, to maximally support resource contribution without any degradation on host-level SLOs. It novelly improves data reliability by applying disk failure prediction to minish failure recovery cost, named fault-aware data management, dramatically reduces the mean time to repair （MTTR） by 76.3% and decreases file crash ratio by 35% on real-life product workload. We present Fatman, an enterprise-scale archival storage based on volunteer contribution resources from underutilized web servers, usually deployed on thousands of nodes with spare storage capacity. Fatman is specifically designed for enhancing the utilization of existing storage resources and cutting down the hardware purchase cost. Two major concerned issues of the system design are maximizing the resource utilization of volunteer nodes without violating service level objectives （SLOs） and minimizing the cost without reducing the availability of archival system. Fatman has been widely deployed on tens of thousands of server nodes across several datacenters, providing more than 100 PB storage capacity and serving dozens of internal mass-data applications. The system realizes an efficient storage quota consolidation by strong isolation and budget limitation, to maximally support resource contribution without any degradation on host-level SLOs. It novelly improves data reliability by applying disk failure prediction to minish failure recovery cost, named fault-aware data management, dramatically reduces the mean time to repair （MTTR） by 76.3% and decreases file crash ratio by 35% on real-life product workload.

作者覃安胡殿明刘俊杨文君谭待

机构地区 Baidu Inc.

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第2期273-282,共10页 计算机科学技术学报（英文版）

关键词 volunteer storage failure prediction failure recovery reliability archival storage volunteer storage failure prediction, failure recovery, reliability, archival storage

分类号 TP333 [自动化与计算机技术—计算机系统结构] G276 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

参考文献29

1Sathiamoorthy M, Asteris M, Papailiopoulos D S, Dimakis A G, Vadali R, Chen S, Borthakur D. XORing elephants:Novel erasure codes for big data. In Proe. the 39th VLDB, Aug. 2013, pp.325-336. 被引量：1
2Huang C, Simitci H, Xu Y, Ogus A, Calder B, Gopalan P, Li J, Yekhanin S. Erasure coding in windows Azure storage. In Proc. USENIX ATC, Jun. 2012. 被引量：1
3Vrable M, Savage S, Voelker G M. Cumulus: Filesystem backup to the chmd. In Proc. the 7th USENIX Conf. File and Storage Technologies, Feb. 2009, pp.225-238. 被引量：1
4Vrable M, Savage S, Voelker G M. BlueSky: A cloud-backed file system for the enterprise. In Proe. the 10th USENIX Conf. File and Stor'agc Technologies, Feb. 2012, pp.19:119:14. 被引量：1
5Reed I S, Solomon G. Polynomial codes over certain fi- nite fields. Jouval of the Society .for IndustTgal and Applied Mathematics, 1960, 8(2): 300-304. 被引量：1
6Khan O, Burns A, Plank J, Pierce W, Huang C. Rethinking erasure codes for cloud file systems: Minimizing I/O for re- covery and degraded reads. In Proc. the 10th USENIX Conf. File and Stornge Technologies, Feb. 2012, pp.20:120:14. 被引量：1
7Cipar J, Corner M D, Berger E D. TFS: A transparent file system for contributory storage. In Proc. the 5th USENIX Conf. File and Storage Technologies, Feb. 2007, pp.215-229. 被引量：1
8McKusick M K, Joy W N, Leffler S J, Fabry R S. A fast file system for UNIX. ACM Trans. Co,input. Syst., 1984, 2(3): 181-197. 被引量：1
9Hoelzle U, Barroso L A. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (lst edition). Morgan and Claypool Publishers, 2009. 被引量：1
10Sehroeder B, Gibson G A. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proc. the 5th USENIX Conf. File and Stora9e Technologies, Fet). 2007, pp.1:1 1:16. 被引量：1

1蒋林.浅议光盘技术在档案工作应用中存在的几个问题[J].贵州档案,2000(1):27-29.
2李春霞.光盘在档案管理中应用前景初探[J].现代声像档案,2001,3(2):20-22.
3郜大鹏.档案信息化与计算机探析[J].黑龙江科技信息,2016(30):218-218. 被引量：1
4金淑红.数字化技术与档案存储解决方案[J].科技创新与应用,2014,4(6):54-54. 被引量：6
5刘德.基于FPGA的档案存储介质转换处理系统的研制[J].山东轻工业学院学报（自然科学版）,2012,26(4):8-10. 被引量：1
6任伟.医院局域网管理与维护探析[J].无线互联科技,2013,10(4):190-190. 被引量：1
7李志民,姜玉梅.基于Agent技术的虚拟装配系统[J].黑龙江科技信息,2008(7):75-75.
8吕冬梅.关于档案信息网络化的思考[J].贵州档案,2000(2):33-35. 被引量：1
9用户IT需求问答[J].信息与电脑,2013(10):66-68.
10格洛丽亚.李,内森.斯威夫特.SLO访谈[J].世界建筑导报,2012(3):49-53.

Journal of Computer Science & Technology

2015年第2期

浏览历史

内容加载中请稍等...

Fatman： Building Reliable Archival Storage Based on Low-Cost Volunteer Resources

参考文献29

相关作者

相关机构

相关主题

浏览历史