摘要
函数依赖发现被广泛应用于分布式大数据分析,是数据清洗、质量评估和语义分析的重要手段.已有函数依赖发现算法主要针对集中式数据,不适用于分布在不同节点上的云计算数据.将分布式数据汇集到集中节点非常耗时,而使用传统集中式方法分别处理分布式节点上的数据会导致错误的结果.已经存在的分布式算法存在内存消耗过大的缺点.因此,本文提出一种基于云计算数据处理平台Spark的快速低内存分布式函数依赖发现算法.该算法提出了多个分布式任务分配策略和基于标识符集一致性的最大等价类元素去重策略,在保障正确性前提下,减少了集合交集运算的次数,加快了处理速度.实验结果表明,与传统集中式算法相比,本文提出的分布式算法在本实验环境下使平均执行时间降低了50%左右,去重策略进一步降低了30%左右执行时间.和已有分布式函数依赖发现算法相比,在有些实例上可以节省大约75%的内存.
Functional dependency discovery is widely used in distributed big data analysis and is an important means of data cleaning,quality assessment and semantic analysis.Existing function dependency discovery algorithms are mainly for centralized data and are not suitable for cloud computing data distributed on different nodes.It is time consuming to gather the original distributed data to the centralized node,and processing the data on the distributed node separately using the traditional single machine method may lead to inaccurate results.Existing distributed algorithms have the disadvantage of excessive memory consumption.Therefore,this paper proposes a fast low-memory distributed function dependency discovery algorithm based on cloud computing data processing platform Spark.The algorithm proposes multiple distributed task allocation strategies and maximum equivalence class element deduplication strategies based on identifier set consistency.Under the premise of ensuring correctness,the number of set intersection operations is reduced and the processing speed is accelerated.The experimental results show that compared with the traditional centralized algorithm,the distributed algorithm proposed in this paper reduces the average execution time by about 50%in this experimental environment,and the deduplication strategy further reduces the execution time by about 30%.Compared with the existing distributed function dependency discovery algorithm,this algorithm can save about 75%of memory in some instances.
作者
朱星宇
蔡志成
刘段
徐建
李小平
ZHU Xing-yu;CAI Zhi-cheng;LIU Duan;XU Jian;LI Xiao-ping(School of Computer,Nanjing University of Science and Technology,Nanjing 210094,China;School of Computer,Southeast University,Nanjing 211102,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2020年第8期1569-1575,共7页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61602243,61972202,61872186)资助
江苏省自然科学基金项目(BK20160846)资助
中央高校基本科研业务费项目(30919011235,30920120180101)资助。