摘要
随着数据量的增大,FP-Growth算法压缩数据思想的优势就体现出来,基于MapReduce框架的PFP-Growth算法实现该算法在Hadoop平台上的并行化,但是MapReduce框架每次对作业进行操作都要将中间结果输出存储到磁盘,影响算法的效率。为了提高关联挖掘的效率,基于Spark平台,运用均衡分组的思想对该算法进行改进,同时在对具有很长前缀情况进行共享前缀的拆分,通过4个步骤使IPFP-Growth算法在Spark上实现。实验结果表明在Spark平台上优化过后的算法在性能上要优于PFP-Growth算法。
The advantage of the FP-Growth algorithm for compressing data is reflected with the increasing of the data size.With the MapReduce framework,the PFP-Growth algorithm can be parallelized on the Hadoop platform. However,when processing tasks with the MapReduce framework,the intermediate results need to be written to the disk,which will affect the efficiency of the algorithm. Therefore,based on Spark platform,this algorithm was improved according to the concept of balanced grouping to improve the efficiency of association mining. In addition,if there is a long prefix,the improved algorithm will split the shared prefix. The IPFP-Growth is implemented in Spark through four steps. The experimental results show that the performance of the algorithm optimized in Spark is superior to that of the PFP-Growth algorithm.
出处
《现代电子技术》
北大核心
2016年第8期9-13,共5页
Modern Electronics Technique
基金
江苏省973项目(BK2011022)
国家自然科学基金重点项目(612724420)