摘要
大数据的组件种类繁多,选择合适的组件及其调用方式能极大地降低大数据平台的使用成本。基于SQL模版的大数据处理平台,让业务方仅需掌握SQL语句就可以选择平台底层不同的计算引擎完成数据分析工作。平台选用了基于ANTLR解析SQL的大数据组件——Hive、SparkSQL、Presto来作为底层批处理和即席查询的计算引擎,利用ANTLR工具实现了对SQL语句二次解析和定制化开发工作,解决了业务方数据权限的难题。平台架构从上往下分为数据拉取层、SQL语句解析路由层及底层计算引擎和分布式存储层,使用Airflow组件作为作业调度,利用SQL语句模版完成了数据拉取、数据质量监控和业务方数据分析处理的工作,极大地降低了业务方的技术成本,简化了大数据平台搭建及二次开发的复杂度。
Selecting appropriate components and their invocation methods from a wide variety of big data components can greatly reduce the use cost of big data platforms.The big data processing platform based on SQL template enables the business side to select different computing engines at the bottom of the platform to complete their data analysis only by mastering SQL statements.In the proposed platform,the components of Hive,Spark SQL and Presto to parse SQL based on ANTLR are selected as the computing engine for underlying batch processing and AD hoc query.Meanwhile,ANTLR is also used to realize the secondary parsing and customized development of SQL statements,which solves the problem of data permission from the business side.Platform architecture contains three parts:ETL layer,parse SQL and routing layer,the underlying computing engine and distributed storage layer.The platform uses Airflow components as job scheduling and uses the SQL statement template to complete ETL,data quality monitor and the data analysis.The Platform greatly reduces the cost of business technology,simplifies the complexity of big data platform building and the secondary development.
作者
曾姣艳
高宋俤
曾美艳
ZENG Jiao-yan;GAO Song-di;ZENG Mei-yan(School of Technology,Fuzhou University of Intemational Studies and Trade,Fuzhou 350003,Fujian Province;Fuzhou Wulimiao Information Technology Co,Ltd..Fuzhou 350003,Fujian Province;Chenzhou Vocational Technical College,Chenzhou 423000,Hunan Province)
出处
《沈阳工程学院学报(自然科学版)》
2022年第2期90-96,共7页
Journal of Shenyang Institute of Engineering:Natural Science