摘要
MapReduce是Hadoop的核心模型之一,广泛应用于大数据处理。MapReduce模型将计算分为Map和Reduce两个处理阶段。但由于其自身的分区机制,导致在Reduce阶段处理数据时,会出现负载不平衡的数据倾斜问题。为了解决数据倾斜问题,提出利用离散粒子群算法解决Reduce阶段数据负载平衡问题。将数据分区策略与粒子群算法相结合,提高系统的稳定性。通过设置使数据分区均衡的目标函数,利用离散粒子群算法求解目标函数。试验结果证明,当设置不同数量的Reduce时,离散粒子群分区方式的运行时间均为最短,可有效解决数据分区的不平衡问题,并大大提升系统的计算效率。
MapReduce is one of the core models of Hadoop,and is widely used in big data processing.The MapReduce model divides the computation into two stages:Map and Reduce.However,due to its own partition mechanism,the problem of load unbalanced data skew occurs when data is processed in the Reduce phase.In order to solve the problem of data skew,discrete particle swarm optimization algorithm is proposed to resolve data load balancing of Reduce phase.By combining the data partitioning strategy with particle swarm optimization algorithm,the stability of the system is improved.By setting the target function of data partition equilibrium,the discrete particle swarm algorithm is used to solve the target function.The experimental results show that when different number of reduce are set,the running time of discrete particle swarm partition way is the shortest,which effectively solve the unbalance of data partition,and greatly improve the computational efficiency of the system.
作者
李安颖
陈群
宋荷
LI Anying;CHEN Qun;SONG He(School of Computer Science and Engineering,Northwestern Polytechnical University,Xi’an 710072,China)
出处
《自动化仪表》
CAS
2018年第12期56-59,共4页
Process Automation Instrumentation
关键词
分布式计算
离散粒子群优化算法
数据倾斜
数据平衡
分区
Distributed calculation
Discrete particle swarm optimization algorithm
Data skew
Data balance
Partition