摘要
随着大规模数据的快速增长及高可靠性需求,将本地数据迁移到分布式数据库势在必行。针对这种情况,提出一种基于MapReduce的"快速并行导入"技术,充分利用集群的并行计算能力,直接向HBase底层存储文件HFile写入数据,既可避免上层数据导入时间的浪费,又节省资源开销。有效解决了从单机数据库向HBase分布式数据库导入数据功能低下、效率不高等问题。实验结果表明,在"快速并行导入"技术的基础上设计并实现的快速并行导入工具,支持多列族文本数据的快速导入。与传统使用API导入数据相比,速度提升一倍以上。
With the rapid growth of very-large data and its high reliability requirement, it is inevitable to transplant local data to distributed database. In light of this case, the paper presents a MapReduce-based "fast parallel importing" technology. It makes full use of parallel computational capability of the cluster to write data directly to underlying storage file HFile of HBase, which can either avoid time-wasters in upper data import and save resources overhead as well, thus effectively solves the problems of low performance and inefficiency when importing data from a single database to HBase distributed database. Experimental result demonstrates that the fast parallel import tool designed and implemented based on the "fast parallel importing" technology supports the fast import of multi-column text data. Compared with traditional way using API to import data, its speed heightens more than double.
出处
《计算机应用与软件》
CSCD
2015年第9期26-30,共5页
Computer Applications and Software
基金
河南省教育厅科学技术研究重点项目(12B520025)
郑州市科技攻关项目(20120473)
校级科研项目(KYZR201006)