Data streaming applications, usually composed of sequential/parallel data processing tasks organized as a workflow, bring new challenges to workflow scheduling and resource allocation in grid environments. Due to the ...Data streaming applications, usually composed of sequential/parallel data processing tasks organized as a workflow, bring new challenges to workflow scheduling and resource allocation in grid environments. Due to the high volumes of data and relatively limited storage capability, resource allocation and data streaming have to be storage aware. Also to improve system performance, the data streaming and processing have to be concurrent. This study used a genetic algorithm (GA) for workflow scheduling, using on-line measurements and predictions with gray model (GM). On-demand data streaming is used to avoid data overflow through repertory strategies. Tests show that tasks with on-demand data streaming must be balanced to improve overall performance, to avoid system bottlenecks and backlogs of intermediate data, and to increase data throughput for the data processing workflows as a whole.展开更多
As a typical erasure coding choice, Reed-Solomon (RS) codes have such high repair cost that there is a penaltyfor high reliability and storage efficiency, thereby they are not suitable in geo-distributed storage sys...As a typical erasure coding choice, Reed-Solomon (RS) codes have such high repair cost that there is a penaltyfor high reliability and storage efficiency, thereby they are not suitable in geo-distributed storage systems. We present anovel family of concurrent regeneration codes with local reconstruction (CRL) in this paper. The CRL codes enjoy threebenefits. Firstly, they are able to minimize the network bandwidth for node repair. Secondly, they can reduce the numberof accessed nodes by calculating parities from a subset of data chunks and using an implied parity chunk. Thirdly, they arefaster than existing erasure codes for reconstruction in geo-distributed storage systems. In addition, we demonstrate howthe CRL codes overcome the limitations of the Reed-Solomon codes. We also illustrate analytically that they are excellent inthe trade-off between chunk locality and minimum distance. Furthermore, we present theoretical analysis including latencyanalysis and reliability analysis for the CRL codes. By using quantity comparisons, we prove that CRL(6, 2, 2) is only0.657x of Azure LRC(6, 2, 2), where there are six data chunks, two global parities, and two local parities, and CRL(10,4, 2) is only 0.656x of HDFS-Xorbas(10, 4, 2), where there are 10 data chunks, four local parities, and two global paritiesrespectively, in terms of data reconstruction times. Our experimental results show the performance of CRL by conductingperformance evaluations in both two kinds of environments: 1) it is at least 57.25% and 66.85% more than its competitorsin terms of encoding and decoding throughputs in memory, and 2) it has at least 1.46x and 1.21x higher encoding anddecoding throughputs than its competitors in JBOD (Just a Bunch Of Disks). We also illustrate that CRL is 28.79% and30.19% more than LRC on encoding and decoding throughputs in a geo-distributed environment.展开更多
基金Supported by the National Natural Science Foundation of China(No. 60803017)the National High-Tech Research and Development (863) Program of China (Nos. 2006AA10Z237,2007AA01Z179, and 2008AA01Z118)+1 种基金the Scientific Research Foundation for the Returned Overseas Chinese Scholars,State Education Ministrythe FIT Foundation of Tsinghua University
文摘Data streaming applications, usually composed of sequential/parallel data processing tasks organized as a workflow, bring new challenges to workflow scheduling and resource allocation in grid environments. Due to the high volumes of data and relatively limited storage capability, resource allocation and data streaming have to be storage aware. Also to improve system performance, the data streaming and processing have to be concurrent. This study used a genetic algorithm (GA) for workflow scheduling, using on-line measurements and predictions with gray model (GM). On-demand data streaming is used to avoid data overflow through repertory strategies. Tests show that tasks with on-demand data streaming must be balanced to improve overall performance, to avoid system bottlenecks and backlogs of intermediate data, and to increase data throughput for the data processing workflows as a whole.
文摘As a typical erasure coding choice, Reed-Solomon (RS) codes have such high repair cost that there is a penaltyfor high reliability and storage efficiency, thereby they are not suitable in geo-distributed storage systems. We present anovel family of concurrent regeneration codes with local reconstruction (CRL) in this paper. The CRL codes enjoy threebenefits. Firstly, they are able to minimize the network bandwidth for node repair. Secondly, they can reduce the numberof accessed nodes by calculating parities from a subset of data chunks and using an implied parity chunk. Thirdly, they arefaster than existing erasure codes for reconstruction in geo-distributed storage systems. In addition, we demonstrate howthe CRL codes overcome the limitations of the Reed-Solomon codes. We also illustrate analytically that they are excellent inthe trade-off between chunk locality and minimum distance. Furthermore, we present theoretical analysis including latencyanalysis and reliability analysis for the CRL codes. By using quantity comparisons, we prove that CRL(6, 2, 2) is only0.657x of Azure LRC(6, 2, 2), where there are six data chunks, two global parities, and two local parities, and CRL(10,4, 2) is only 0.656x of HDFS-Xorbas(10, 4, 2), where there are 10 data chunks, four local parities, and two global paritiesrespectively, in terms of data reconstruction times. Our experimental results show the performance of CRL by conductingperformance evaluations in both two kinds of environments: 1) it is at least 57.25% and 66.85% more than its competitorsin terms of encoding and decoding throughputs in memory, and 2) it has at least 1.46x and 1.21x higher encoding anddecoding throughputs than its competitors in JBOD (Just a Bunch Of Disks). We also illustrate that CRL is 28.79% and30.19% more than LRC on encoding and decoding throughputs in a geo-distributed environment.
文摘持久性内存(persistent memory,PMEM)同时具备内存的低时延字节寻址和磁盘的持久化特性,将对现有软件架构体系产生革命性的变化和深远的影响.分布式存储在云计算和数据中心得到了广泛的应用,然而现有的以Ceph BlueStore为代表的后端存储引擎是面向传统机械盘和固态硬盘(solid state disk,SSD)设计的,其原有的优化设计机制不适合PMEM特性优势的发挥.提出了一种基于持久性内存和SSD的后端存储MixStore,通过易失区段标记和待删除列表技术实现了适用于持久性内存的并发跳表,用于替代RocksDB实现元数据管理机制,在保证事务一致性的同时,消除了BlueStore的compaction所引发的性能抖动等问题,同时提升元数据的并发访问性能;通过结合元数据管理机制的数据对象存储优化设计,把非对齐的小数据对象存放在PMEM中,把对齐的大块数据对象存储在SSD上,充分发挥了PMEM的字节寻址、持久性特性和SSD的大容量低成本优势,并结合延迟写入和CoW(copy-on-write)技术实现数据更新策略优化,消除了BlueStore的WAL日志引起的写放大,提升小数据写入性能.测试结果表明,在同样的硬件环境下,相比BlueStore,MixStore的写吞吐提升59%,写时延降低了37%,有效地提升了系统的性能.