摘要
HDFS(Hadoop Distributed File System)作为开源系统广泛地适用于各类存储服务中,具有高容错,易扩展,廉价存储等特点。然而,HDFS基于单一的服务器Name Node来处理元数据信息管理,当处理海量小文件时会造成Name Node内存过分消耗以及存储和读取性能并不理想,使Name Node成为系统瓶颈。本文提出一种基于HAR(Hadoop Archive)的优化机制来提高Name Node存储元数据信息的内存利用效率和提高读取小文件的访问效率。另外,该策略也扩展了HAR文件追加的优化和为提高访问效率采用索引预取机制。实验结果表明该优化策略能够提高现有HAR处理小文件的能力和访问海量小文件的效率。
Hadoop Distributed File System(HDFS) is an open source system which has advantages of high fault-tolerance, scalability and low-cost storage capability and has been widely suitable for storage applications. How-ever, HDFS based on single master (NameNode) to handle metadata management, NameNode would have the memory overhead problem and suffer the performance penalty in both storage and accessing while handling massive small files, NameNode would become bottleneck. This paper proposes a mechanism based on Hadoop Archive (HAR) to improve the memory utilization for metadata and enhance efficiency of accessing small files. In addition, this strategy also ex-tends HAR capability to allow additional files to be inserted into the existing archive files and adopts the preload of index files to improve the access efficiency. Experimental results show that this strategy can to improve the capability to handle small files and the efficiency of accessing large number of small files.
出处
《软件》
2015年第2期107-111,共5页
Software