摘要
MapReduce是Google开发的一种并行分布式计算模型,已在搜索和处理海量数据领域得到了广泛的应用。但是MapReduce框架中的"一对一分区"策略使得其在处理多数据连接任务时,需要将该任务拆分成多个链接的子任务,造成中间结果集的频繁"洗牌",带来巨大的磁盘I/O开销。文中就该问题提出了一种新的分区策略:"一对多分区"策略,为了能够在MapReduce框架中实现这一分区策略,因此需要对MapReduce框架中的分区函数接口进行修改。改进策略的优点在于只要一个MapReduce任务就能够完成多数据集连接任务,因此节省了I/O开销。最后在搭建的Hadoop平台上对改进前和改进后的两种方法进行比较。实验结果表明,改进模式的效率明显得到提高,因此这一方案是可行的。
MapReduce is a parallel distributed computing model developed by Google, it is widely used in the area of searching and large date dealing. But because of its one-to-one shuffling scheme, MapReduce divides multiway join tasks into a sequential subtasks which frequently checkpoints and shuffles intermediate results in introducing a huge I/O overhead. In this paper, introduce a new shuffling scheme:one-to-many shuffling strategy. In order to achieve this partition strategy in the MapReduce framework, so need to modify parti- tion function interface of MapReduce framework. The improvement strategy advantage is that a MapReduce task will be able to perform multiple data set connection task, so saving the I/O overhead. Finally in setting up the Hadoop platform the two methods of improvement of the former and the latter are compared. Experimental results show that this one-phase joining approach, in certain cases, is more efficient than the multiphases joining approach employed by MapReduce, so this scheme is feasible.
出处
《计算机技术与发展》
2013年第6期59-62,66,共5页
Computer Technology and Development
基金
国家科技支撑计划(2007BAH17B04)