摘要
在诸多电网数据处理应用中,电网数据质量监测是电网大数据处理业务中最重要的一个环节.随着电网数据规模和数据质量校验规则数量和复杂度的不断增大,目前现有的基于传统数据库系统和计算平台的数据质量校验系统的处理能力已经出现严重的瓶颈,难以快速完成数据质量的监测和校验,且系统难以扩展,越来越难以满足日常的生产管理和经营决策的需求.大数据技术为解决电网大数据处理提供了良好的技术手段和支撑平台.为此,提出了一种基于大数据的电网数据质量校验解决方案,研究设计了基于Hadoop平台的分布式数据存储管理和并行化校验规则执行技术,选择批量和增量数据质量校验典型场景,进行了验证性研究,设计实现了针对数据校验的索引存储机制,对校验规则相关的属性建立快速索引,并进一步设计实现了基于HBase和MapReduce的并行化校验规则执行算法,使得数据质量校验的处理性能得到显著提升.在此基础上,基于验证性数据集和校验规则实现了一个验证性系统,实验结果表明,所提出的技术方法可以有效地提升数据质量校验处理性能,可满足实时/准实时电网数据数据校验需求,并且提供了一种具有良好可扩展性的系统解决方案.
Among many power grid data processing applications,the quality monitoring of power grid data is one of the most important services.With constant increase of the scale of power grid data and the number of data quality checking rules,the processing power of the current data quality checking system based on the traditional RDBMs and computing platforms has become a serious bottleneck,making it hard to conduct the data quality monitoring and checking in time and hard to scale when the size of data volume and number of checking rules increase.All of these make the current system hard to meet the need of management and operational decision making.The big data technology has provided great technical means and support platforms for the solution to power grid big data processing.Thus,in this paper,we propose a big data solution to power grid big data processing.We study and design the techniques for distributed data storage and parallel computing based on Hadoop for executing data quality checking rules.After choosing a few typical scenarios of batch-style and streaming-style power grid data quality checking for verification study,we design and implement an indexing mechanism for data quality checking,building a fast search index for the attributes related to data quality checking to speed up the data quality checking process.Further we design the parallel algorithms for executing multiple data quality checking rules based on HBase and MapReduce.Based on above key techniques,we implement a prototype system based on experimental data sets and checking rules for verification purpose.The experimental results indicate that the proposed techniques can effectively improve the performance of data quality checking process and meet the need of real time/near-realtime power grid data quality checking,and,at the same time,provide a system solution with excellent scalability.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2014年第S2期134-144,共11页
Journal of Computer Research and Development
关键词
电网大数据
数据质量
校验规则
索引
并行化算法
power grid big data
data quality
checking rules
indexing
parallel algorithm