摘要
随着"大数据"时代的到来,Hadoop等大数据处理平台也应运而生。但其存储载体——Hadoop分布式文件系统却在海量小文件存储方面存在着很大缺陷,存储海量小文件会导致整个集群的负载增高、运行效率下降。为了解决这一针对小文件的存储缺陷,通常的方法是将小文件进行合并,将合并后的大文件进行存储,但以往方法并未将文件体积大小分布加以利用,未能进一步提升小文件合并效果。本文提出一种基于数据块平衡的小文件合并算法,优化合并后的大文件体积分布,有效降低HDFS数据分块,从而减少集群主节点内存消耗、降低负载,使数据处理过程可以更高效的运行。
With the advent of " BIG data",big data processing platform such as Hadoop has emerged. But its storage carrier-- Hadoop distributed file system has many significant flaws on the storage of mass small files,storing massive amounts of small files will not only increase the load of entire cluster,but also decrease operating efficiency. In order to solve the defect,the usual method is to merge small files to a big one,and then it will be stored instead. However,the conventional method does not take advantage of the volume size distribution,so it failed to further enhance the combined effect of small files. This paper presents a data block based on a balance of small files merging algorithm to optimize distribution of merged large files volume,which could effectively reducing the HDFS data block. Thereby the reducing of primary node memory consumption and running load will cause data processing can be run more efficiently.
出处
《智能计算机与应用》
2015年第3期28-32,36,共6页
Intelligent Computer and Applications