摘要
现有针对MapReduce的负载均衡调度的研究均未考虑中间数据的分布特点及网络传输的开销,导致额外的网络传输代价与系统效率的下降。为解决上述问题,提出了一种数据本地性感知的负载均衡策略。充分利用YARN中资源管理的新特性,在Map阶段对内存数据溢写的同时进行统计以获取数据分布,根据数据分布情况及各节点的计算能力进行任务调度,减少网络传输开销的同时尽量保证各节点的负载平衡。此外,通过引入细粒度分区与分区的自适应分裂策略,进一步提高在数据倾斜时调度策略的性能。对比实验结果表明,提出的负载均衡调度策略能有效提升性能,同时较好地降低网络总开销。
Abstract Intermediate data distribution characteristics and network traffic overhead are not considered in any existing research on load balancing strategy on MapReduce, resulting in additional network traffic overhead and decrease of sys- tem efficiency. To solve this problem , this paper presented a locality-aware load balancing strategy. By taking advantage of the new features of resource management brought by YARN, the strategy can obtain the data distribution when the buffered data are written to local disk. The strategy schedules the reduce tasks according to the data distribution along with the processing speed of each node to decrease network overhead while maximizing load balancing of each node. In addition, to further improve the performance of scheduling strategy with data skew, this paper introduced the strategy of fine-grained partitioning and self-adaption fragmentation. The comparative experimental results show that the presented strategy can improve the performance effectively,and reduce the total network traffic overhead.
出处
《计算机科学》
CSCD
北大核心
2015年第10期50-56,共7页
Computer Science
基金
国家自然科学基金项目(61373015,61300052)
国家教育部高等学校博士学科点专项科研基金(20103218110017)
江苏高校优势学科建设工程资助项目(PAPD)
中央高校基本科研业务费专项项目(NP2013307,NZ2013306)资助
关键词
数据本地性
数据倾斜
负载均衡
MapReduce, Data locality, Data skew, Load balance