摘要
针对大数据时代数据量级随时间不断累积、在大数据环境中查询数据困难且缓慢的问题,提出一种基于HiveSQL的增加任务并行度与建立中间表组合的优化查询方法。本文应用的是Hadoop生态系统中Hive数据库,从Hive数据库里数据量级为Pb的数据当中,通过编写SQL语句查询出实验所需要的数据。当在查询数据量级过大、查询指标较多并且SQL查询语句冗长的情况下,都会造成查询时间过长和查询效率低的问题,提出增加SQL任务并行度与建立中间表组合的优化查询方法来解决这一问题。实验结果证明,本文提出的方法将大数据查询时间缩短为原来的25%,并且提高了集群的利用效率。
In the era of big data, the magnitude of data keeps accumulating over time, and it is difficult and slow to query data in the big data environment,an optimization query method based on HiveSQL was proposed to increase task parallelism and build intermediate tables.This article applies Hive database in Hadoop ecosystem,obtain Pb data from the Hive database,through the preparation of SQL statements to query the data required by the experiment.When the magnitude of the query data is too large, the query index is more and the SQL query statement is long,the query time is too long and the query efficiency is low,to solve this problem, an optimization query method of increasing the parallelism of SQL tasks and establishing the combination of intermediate tables is proposed.Experimental results show that:The method proposed in this paper reduces the query time of big data by 25%,and improve the utilization efficiency of the cluster.
作者
郑灵逸
李擎
Zheng Lingyi;Li Qing(Department of Automation,Beijing Information Science and Technology University,Beijing 100192;Beijing Key Laboratory of High Dynamic Navigation Technology,Beijing 100192)
出处
《现代计算机》
2021年第36期55-59,共5页
Modern Computer