摘要
针对长时间序列、多站点和多气象要素的大数据量查询需求,现有的CIMISS(China Integrated Meteorological Information Sharing System)存在支撑能力严重不足的问题。本研究使用广西气象站点建站至今的历史地面气象记录月报表数据资料和现有Hadoop集群物理资源,重新设计数据ETL流程,构建Parquet格式数据集并完成HDFS转换存储;嵌入Spark的Broadcast广播变量,优化Spark集群执行参数,提高了集群的处理并行度和SparkSql的关联查询效率。结果表明,Parquet格式数据集的最高压缩比超过95%,一次性大数据量的查询效率比原来提升了1~5倍,并支持高并发访问,为各类相关预报预测业务的开展提供了有效的技术支撑。
Aiming at a large amount of data query requirements of long-time series,multi-sites and multi-meteorological elements,the supporting capacity of the existing CMISS(China Integrated Meteorological Information Sharing System)is seriously insufficient.In this study,the monthly report data of historical surface meteorological records since the establishment of the meteorological stations in Guangxi and existing Hadoop cluster physical resources are used to redesign the ETL process,construct the Parquet format dataset,and complete HDFS conversion storage.Besides,the Broadcast variable of Spark is embedded to optimize the execution parameters of the Spark cluster,which improves the processing parallelism of the cluster and the association query efficiency of SparkSql.The results show that the maximum compression ratio of the Parquet format data set was more than 95%;the query efficiency of the one-time large amount of data was 1 to 5 times higher than the original and supported high concurrent access,providing effective technical support for the development of various related forecasting services.
作者
黄志
苏传程
苏晓红
HUANG Zhi;SU Chuancheng;SU Xiaohong(Guangxi Meteorological Information Center,Nanning 530022)
出处
《气象科技》
2022年第1期51-58,共8页
Meteorological Science and Technology
基金
2021年广西气象科研计划指令性项目(桂气科2021ZL02)资助。