期刊文献+

基于MapReduce的多路连接优化方法研究 被引量:5

Research of Optimizing Multiway Joins Based on MapReduce
下载PDF
导出
摘要 MapReduce是Google开发的一种并行分布式计算模型,已在搜索和处理海量数据领域得到了广泛的应用。但是MapReduce框架中的"一对一分区"策略使得其在处理多数据连接任务时,需要将该任务拆分成多个链接的子任务,造成中间结果集的频繁"洗牌",带来巨大的磁盘I/O开销。文中就该问题提出了一种新的分区策略:"一对多分区"策略,为了能够在MapReduce框架中实现这一分区策略,因此需要对MapReduce框架中的分区函数接口进行修改。改进策略的优点在于只要一个MapReduce任务就能够完成多数据集连接任务,因此节省了I/O开销。最后在搭建的Hadoop平台上对改进前和改进后的两种方法进行比较。实验结果表明,改进模式的效率明显得到提高,因此这一方案是可行的。 MapReduce is a parallel distributed computing model developed by Google, it is widely used in the area of searching and large date dealing. But because of its one-to-one shuffling scheme, MapReduce divides multiway join tasks into a sequential subtasks which frequently checkpoints and shuffles intermediate results in introducing a huge I/O overhead. In this paper, introduce a new shuffling scheme:one-to-many shuffling strategy. In order to achieve this partition strategy in the MapReduce framework, so need to modify parti- tion function interface of MapReduce framework. The improvement strategy advantage is that a MapReduce task will be able to perform multiple data set connection task, so saving the I/O overhead. Finally in setting up the Hadoop platform the two methods of improvement of the former and the latter are compared. Experimental results show that this one-phase joining approach, in certain cases, is more efficient than the multiphases joining approach employed by MapReduce, so this scheme is feasible.
作者 王晓军 孙惠
出处 《计算机技术与发展》 2013年第6期59-62,66,共5页 Computer Technology and Development
基金 国家科技支撑计划(2007BAH17B04)
关键词 MapReduce技术 多数据集连接 分区策略 HADOOP MapReduce technology multiway joins shuffling strategy Hadoop
  • 相关文献

参考文献12

二级参考文献86

  • 1刘华元,袁琴琴,王保保.并行数据挖掘算法综述[J].电子科技,2006,19(1):65-68. 被引量:15
  • 2卢锡城,王怀民,王戟.虚拟计算环境iVCE:概念与体系结构[J].中国科学(E辑),2006,36(10):1081-1099. 被引量:37
  • 3李伟,徐志伟,唐志敏,等.国家高性能计算环境的设计与实现[C]//863计划智能计算机会议论文集,北京:清华大学出版社,2001:46-56. 被引量:2
  • 4Nurmi D, Wolski R, Grzegorczyk C, et al. The Eucalyptus Opensource Cloud-Computing System[C]//Proc of Cloud Computing and Its Applications, 2008. 被引量:1
  • 5Buyya R. Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering It Services as omputing utilities[C]//Proc of CORR'08,2008. 被引量:1
  • 6Youseff L, Butrico M, Silva D D. Toward a Unified Ontology of Cloud Computing[C]//Proc of Grid Computing Environments Workshop, 2009 : 1-10. 被引量:1
  • 7LU Kai,CHI Wanqing, LIU Yongpeng, et al. HPVZ:A High Performance Virtual Computing Environment for Super Computers[C]//Proc of APPT'09,2009. 被引量:1
  • 8Foster I, Zhao Y, Raicu I, et al. Cloud Computing and Grid Computing 360-Degree Compared[C]//Proc of Grid Computing Environments Workshop, 2008 : 1-10. 被引量:1
  • 9Campbell R, et al. Open CirrusTM Cloud Computing Testbed:Federated Data Centers for Open Source Systems and Services Researeh[C]//Proc of Workshop on Hot Topics in Cloud Computing, 2009. 被引量:1
  • 10陈贵海,李振华.对等网络:结构、应用与设计[M].北京:清华大学出版社,2006. 被引量:1

共引文献1032

同被引文献63

  • 1崔杰,李陶深,兰红星.基于Hadoop的海量数据存储平台设计与开发[J].计算机研究与发展,2012,49(S1):12-18. 被引量:141
  • 2董新华,李瑞轩,周湾湾,王聪,薛正元,廖东杰.Hadoop系统性能优化与功能增强综述[J].计算机研究与发展,2013,50(S2):1-15. 被引量:69
  • 3任年海.一个有效的并行模型——BSP并行模型[J].计算机与现代化,2006(3):34-36. 被引量:3
  • 4WHITET.Hadoop权威指南[M].北京:清华大学出版社.2010.5. 被引量:16
  • 5方木云,刘辉.高级软件工程[M].北京:清华大学出版社,2011. 被引量:6
  • 6Dean J, Ghemawat S. MapReduce:simplified data processing on large clusters [ J ]. Communications of the ACM, 2008,51 (1) :107-113. 被引量:1
  • 7Bu Yingyi, Howe B, Balazinska M, et al. HaLoop : efficient iter- ative data processing on large clusters [ J ]. Proceedings of the VLDB Endowment ,2010,3 ( 1-2 ) :285-296. 被引量:1
  • 8Elnikety E, Elsayed T, Ramadan H E. iHadoop : asynchronous iterations for MapReduce [ C]//Proc of IEEE third interna- tional conference on cloud computing technology and science. Athens : IEEE ,2011:81-90. 被引量:1
  • 9Zhang Yanfeng, Gao Qixin, Gao Lixin, et al. iMapReduce : a distributed computing framework for iterative computation[ J]. Journal of Grid Computing,2012,10( 1 ) :47-68. 被引量:1
  • 10Malewicz G,Austern M H,Bik A J C,et al, Pregel:a system for large - scale graph processing [ C ]//Proceedings of the 2010 ACM SIGMOD international conference on management of data. [ s. 1. ] :ACM ,2010 : 135-146. 被引量:1

引证文献5

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部