一种Hadoop小文件存储优化策略研究被引量：5

Research on Small Files Optimized Storage Strategy in Hadoop System

下载PDF

导出

摘要随着"大数据"时代的到来,Hadoop等大数据处理平台也应运而生。但其存储载体——Hadoop分布式文件系统却在海量小文件存储方面存在着很大缺陷,存储海量小文件会导致整个集群的负载增高、运行效率下降。为了解决这一针对小文件的存储缺陷,通常的方法是将小文件进行合并,将合并后的大文件进行存储,但以往方法并未将文件体积大小分布加以利用,未能进一步提升小文件合并效果。本文提出一种基于数据块平衡的小文件合并算法,优化合并后的大文件体积分布,有效降低HDFS数据分块,从而减少集群主节点内存消耗、降低负载,使数据处理过程可以更高效的运行。 With the advent of ＂ BIG data＂,big data processing platform such as Hadoop has emerged. But its storage carrier-- Hadoop distributed file system has many significant flaws on the storage of mass small files,storing massive amounts of small files will not only increase the load of entire cluster,but also decrease operating efficiency. In order to solve the defect,the usual method is to merge small files to a big one,and then it will be stored instead. However,the conventional method does not take advantage of the volume size distribution,so it failed to further enhance the combined effect of small files. This paper presents a data block based on a balance of small files merging algorithm to optimize distribution of merged large files volume,which could effectively reducing the HDFS data block. Thereby the reducing of primary node memory consumption and running load will cause data processing can be run more efficiently.

作者杜忠晖何慧王星

机构地区哈尔滨工业大学计算机科学与技术学院

出处《智能计算机与应用》 2015年第3期28-32,36,共6页 Intelligent Computer and Applications

关键词 HDFS 小文件存储小文件合并算法 HDFS Storage of Small Files Small Files Merge Algorithm

分类号 TP391.41 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献11

1http://finance.21cn. com/stock/wmkzg/a/2014/0910/14/28200740. sht-ml. 被引量：1
2www. zdnet. com. cn至顶网-石计算第一门户. 被引量：1
3大数据架构hadoop http://blog. csdn. net/guoxiaoqian8028/article/details/18772363. 被引量：1
4YU L, CHEN G, WANG W, et al. Msfss: A storage system for masssmall files [ C ] // Computer Supported Cooperative Work in Design,2007. CSCWD 2007. 11th International Conference on, [ S. 1.].IEEE, 2007:1087-1092. 被引量：1
5BEAVERD, KUMAR S, LI H C,et al. Finding a Needle in Hay-stack ;Facebooks Photo Storage[C]//OSDI. 2010,10,Vancouver,BC : [ s. n. ] : 1-8. 被引量：1
6TaobaoFile System 项目主页,http://tfs. taobao. org/. 被引量：1
7LIUX, YU Q, LIAO J. FASTDFS: A High Performance DistributedFile System[ J] . ICIC express letters. Part B, Applications : an inter-national journal of research and surveys, 2014, 5(6) : 1741 - 1746. 被引量：1
8QIANY, YI R, DU Y,et al. Dynamic I/O congestion control inscalable Lustre file system [ C ] //Mass Storage Systems and Technolo-gies (MSST), 2013 IEEE 29th Symposium on. IEEE, Lake Arrow-head: IEEE, 2013:1 -5. 被引量：1
9陈剑,龚发根.一种优化分布式文件系统的文件合并策略[J].计算机应用,2011,31(A02):161-163. 被引量：6
10董其文..基于HDFS的小文件存储方法的研究[D].大连海事大学,2013:

二级参考文献9

1BOKHARI S, RUT1" B, WYCKOFF P, et al. Experimental analysis of a mass storage system [ J]. Concurrency and Computation: Practice and Experience, 2006, 18(4) : 1929-1950. 被引量：1
2WANG FANG, YUE YINLIANG, FENF DAN, et al. High availability storage system based on two-level metadata management [ C]// FCST 2007: Proceedings of the 2007 Japan-China Joint Workshop on Frontier of Computer Science and Technology. Piscataway, N J: IEEE, 2007:41 -48. 被引量：1
3LI HUAIYANG, LIU YAN, CAO QIANG. Approximate parameters analysis of a closed fork-join queue model in an object-based storage system [ C] // Proceedings of the Eighth International Symposium on Optical Storage and 2008 International Workshop on Information Data Storage, SPIE 7125. IS. 1. ] : SPIE, 2008:1 -6. 被引量：1
4ZHAO TIEZHU, VERDI M, DONG SHOUBIN, et d. Evaluation of a performance model Lustre file system [ C]// Proceedings tff the fifth Annual ChinaGfid Conference. Piscataway, NJ: IIElZ.; 2010:. 191 -196. 被引量：1
5ZHAO TIEZHU, HU JINLONG. Performance evaluation of parallel file system based on Lustre and grey theory [ C]//Proceedings of the 2010 Ninth International Conference on Grid and Cloud Computing. Washington, DC: IEEE Computer Society, 2010:118 -122. 被引量：1
6KONSTANTIN S, HAIRONG K, SANJAY R, et al. The Hadoop distributed file system [ C]// Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies. Piseataway, NJ: IEEE, 2010:1-10. 被引量：1
7Apache Hadoop Project. SequenceFile Class [ EB/OL]. [ 2011-02-17 ]. http://hadoop, apache, org/common/docs/current/api/ org/apache/hadoop/io/SequenceFile, html. 被引量：1
8栾亚建,黄翀民,龚高晟,赵铁柱.Hadoop平台的性能优化研究[J].计算机工程,2010,36(14):262-263. 被引量：51
9赵铁柱,董守斌,Verdi MARCH,Simon SEE.面向并行文件系统的性能评估及相对预测模型[J].软件学报,2011,22(9):2206-2221. 被引量：7

共引文献5

1张海,马建红.基于HDFS的小文件存储与读取优化策略[J].计算机系统应用,2014,23(5):167-171. 被引量：14
2付红阁,姜华,张怀锋.基于Hadoop的海量统计小文件存取优化方案[J].聊城大学学报（自然科学版）,2016,29(1):102-106. 被引量：2
3朱永强,周珂,李丹,赵亚萌.HDFS小文件读写优化策略[J].计算机时代,2016(9):9-12.
4谭静仪.入侵检测技术算法的改进与应用研究[J].电脑知识与技术,2016,12(3X):72-73.
5郭强.基于Hadoop分布式文件系统的模型分析[J].电脑知识与技术（过刊）,2016,22(6X):229-230. 被引量：1

同被引文献32

1董新华,李瑞轩,周湾湾,王聪,薛正元,廖东杰.Hadoop系统性能优化与功能增强综述[J].计算机研究与发展,2013,50(S2):1-15. 被引量：70
2林智煜.基于海量高维图像的大数据处理框架[J].电子科技大学,2014. 被引量：1
3李国琦.中国产业互联网峰会[EB/OL],2015.http://www.800t00.com/content/548884.shtml. 被引量：1
4Finta I, Farkas L, Szenasi S, et al. Buffering strategies in HDFS environment with STORM framework [ C ] //16th IEEE inter- national symposium on computational intelligence and infor- matics. [ s. I. ] : IEEE ,2015:297-302. 被引量：1
5Zhang Q F,Zhang W D,Li W J,et al. Cloud storage system forsmall file based on P2P [ J ]. Journal of Zhejiang University ( Engineering Science) ,2013,47 ( 1 ) :7-8. 被引量：1
6Mackey G, Sehfish S, Wang J. Improving metadata manage- ment for small files in HDFS[ C]//IEEE international confer- ence on cluster computing and workshops. [ s. 1. ] : IEEE, 2009 : 1-4. 被引量：1
7Apache. The homepage of Hadoop [ EB/OL]. 2012. http:// Hadoop. apache, org/. 被引量：1
8Liu X, Han J, Zhong Y, et al. Implementing WebGIS on Ha- doop:a case study of improvingsmall file I/O performance on HDFS[ C]//IEEE international conference on duster compu- ting and workshops. [ s. 1. ] : IEEE ,2009 : 1-8. 被引量：1
9江柳.HDFS下小文件存储优化相关技术研究[D].北京:北京邮电大学,2011. 被引量：1
10胡海峰,贾玉辰.一种Hadoop存取海量小文件的优化方法:CN,CN104536959A[P],2015-04-22. 被引量：1

引证文献5

1金峥耀,张健,耿超.基于Hadoop的存储资源调度机制研究[J].信息系统工程,2015,0(10):140-142. 被引量：2
2杨曦,巩青歌.基于云计算的后勤保障资源数据处理平台研究与设计[J].电子世界,2016,0(20):39-41. 被引量：1
3王全民,张程,赵小桐,雷佳伟.一种Hadoop小文件存储优化方案[J].计算机技术与发展,2016,26(11):41-44.
4丁建立,郑峰弓,李永华,罗云生,曹卫东.基于NoSQL的海量航空物流小文件分布式多级存储方法[J].计算机应用研究,2017,34(5):1433-1436. 被引量：8
5曾伟,曾小琴,冉露,谭丹.搜索引擎技术在急诊知识库中的研究与应用[J].现代医药卫生,2022,38(20):3585-3587. 被引量：2

二级引证文献13

1肖玉芝,冶忠林,张伟,贾泽宇.新工科背景下“操作系统原理”课程的教与学探索[J].青海师范大学学报（自然科学版）,2023,39(1):83-89. 被引量：2
2刘君.基于Hadoop的海量小文件存储优化方法[J].厦门理工学院学报,2017,25(3):34-39. 被引量：1
3张昊,申夏夏,赵博,朱晓华.应用互联网技术搭建精准放射治疗工具[J].电脑编程技巧与维护,2018(1):140-141. 被引量：1
4李国,李汶晓,徐俊洁.航空货运中海量小文件的存储优化[J].计算机工程与设计,2018,39(5):1484-1489. 被引量：3
5黄裕.基于分布式Redis集群的WEB共享管理研究[J].计算机与数字工程,2018,46(10):2078-2082. 被引量：8
6叶伦强.云计算中数据流存储负载均衡优化仿真[J].计算机仿真,2018,35(10):246-249. 被引量：1
7侯海耀,钱育蓉,英昌甜,张晗,卢学远,赵燚.基于Hilbert-R树分级索引的时空查询算法[J].计算机应用,2018,38(10):2869-2874. 被引量：7
8金晓磊.供电企业后勤保障信息化管理系统建设研究[J].现代科学仪器,2018,0(4):132-134.
9戴威.一种跨HDFS集群的文件资源分布式高效存储方法[J].电子设计工程,2019,27(21):14-17. 被引量：4
10刘斌.Hadoop计算存储分离架构技术应用研究[J].信息通信,2020(5):220-222.

1何晓辉.降低负载三层网络改造[J].网管员世界,2009(18):111-111.
2王全民,张程,赵小桐,雷佳伟.一种Hadoop小文件存储优化方案[J].计算机技术与发展,2016,26(11):41-44.
3戴文俊,庞明勇,武港山,张福炎.三维模型轴向体积分布特征提取及比较算法[J].华中科技大学学报（自然科学版）,2005,33(z1):326-330. 被引量：2
4段昌敏,沈济南,周慧华.一种高效云任务调度博弈算法[J].微电子学与计算机,2017,34(3):40-45.
5韩丽,胡江月.体积分布的三维模型形状分析方法[J].计算机工程与应用,2015,51(23):195-198. 被引量：1
6王玥,蔡皖东,段琪.一种自适应动态负载均衡算法[J].计算机工程与应用,2006,42(21):121-123. 被引量：12
7庞明勇,戴文俊,武港山,张福炎.基于体积分布特征匹配的三维实体网格模型检索[J].系统仿真学报,2007,19(1):30-34. 被引量：4
8张军.对象存储系统的均衡调度算法[J].计算机工程,2010,36(24):57-58.
9石祥滨,杜玲,邢元胜.基于P2P的MMOG中动态负载均衡算法[J].计算机工程,2007,33(16):86-87. 被引量：4
10黄金强,彭宇行.基于位置信息的非结构化overlay匹配方法研究[J].计算机应用研究,2008,25(9):2790-2793.

智能计算机与应用

2015年第3期

浏览历史

内容加载中请稍等...

一种Hadoop小文件存储优化策略研究被引量：5

参考文献11

二级参考文献9

共引文献5

同被引文献32

引证文献5

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

一种Hadoop小文件存储优化策略研究 被引量：5

参考文献11

二级参考文献9

共引文献5

同被引文献32

引证文献5

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

一种Hadoop小文件存储优化策略研究被引量：5