摘要
针对Hadoop集群中应用执行的低效率、高成本问题,首先,通过对Hadoop分布式存储技术和并行编程模型的分析,发现数据集采用单文件还是多文件方式,以及数据块划分的大小是影响其性能的主要因素.其次,设计实验探讨了不同规模集群环境下,两类数据集以及不同大小的数据块对应用性能的影响程度.最后,综合实验结果发现,在数据量一定的情况下,随着数据块的增大,map任务数的变化导致大文件数据集的执行效率越来越高于小文件数据集的执行效率;另外,两类数据集在小规模集群(1个Slave)上的执行效率大约均是大规模集群(10个Slave)下的2倍.故此,在Hadoop集群环境中为了提高应用性能,应通过增大数据块等方法来减少map任务数,而不应盲目增大集群规模.该结论可对Hadoop集群环境下应用效率的优化提供参考.
Focusing on the low efficiency and high cost of applications on Hadoop clusters,the distributed storage mechanism and parallel programming model is analized.It is found that the factors affecting the performance of applications include whether the dataset is orgnized into one unique or more files,and the policy of data splitting.Then,the impacts of two kinds of datasets and the various volumes of data block on the performance of applications is discussed through the expriments under various scales of clusters.At last,based on the whole analyses of expriment results,it shows that as the data block becomes bigger,the performance of an applciation with a unique file becomes better than that of the application with multiple files owing to the reduce of Map tasks.Moreover,given either a dataset of one unique file or multiple files,the execution time of an application under a cluster with only one Slave node is a half of that under a cluster with 10 Savle nodes.Above all,to improve the performance of an application on a Hadoop cluster,it doesn′t advitise to only increase the machine nodes,but to find other measures such as increasing the data blocks to reduce the number of Map tasks.The conclusion can give a candidate direction for application optimization on Hadoop clusters.
作者
马生俊
陈旺虎
郭宏乐
乔保民
李新田
MA Sheng-jun, CHEN Wang-hu, GUO Hong-le, QIAO Bao-min, LI Xin-tian(College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, Chin)
出处
《小型微型计算机系统》
CSCD
北大核心
2018年第4期719-724,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61462076)资助