摘要
传统的基于频繁模式增长的并行关联规则算法在处理动态更新的数据集时,需要把更新后的数据集全部压缩到频繁模式树中,消耗了大量时间和存储空间,且没有充分考虑头表分组过程中组间负载量不同的问题。针对在关联规则的实际挖掘过程中,数据集快速增长所造成的增量更新问题,基于并行频繁模式增长PFP-tree算法,结合Spark分布式并行处理框架,提出一种改进的并行关联规则增量更新算法。在增量更新过程中,为了减少挖掘时间和存储空间,利用已有挖掘结果对新增数据集构建频繁模式树。通过改进头表分组策略,实现了并行挖掘节点之间的负载均衡。实验分析表明,相较于传统的关联增量更新算法,该算法是可行的且具备较高的挖掘效率和可扩展性,适用于动态增长的大数据环境。
Traditional parallel association rule algorithm based on frequent pattern growth has to compress the whole updated dataset into the frequent pattern tree when processing a dynamically updated dataset,expending much time and storage space.Moreover,it neglects the load-balancing problem during the grouping stage.Aimed at the incremental updating problem caused by the rapid increasing of data in actual association rules mining,we propose an improved incremental updated algorithm for parallel association rule based on parallel frequent pattern-tree algorithm and the Spark distributed processing framework.During the updating process,in order to reduce the mining time and storage space,existing mining results are used to construct frequent pattern trees for the adding datasets.The grouping strategy for header-table is improved to ensure load-balancing between the nodes.The experiment demonstrates that compared with the traditional associative incremental updating algorithm,the proposed algorithm is feasible with high mining efficiency and scalability and suitable for large data environment with dynamic growth.
作者
王诚
赵申屹
WANG Cheng;ZHAO Shen-yi(School of Telecommunications & Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处
《计算机技术与发展》
2018年第7期48-52,共5页
Computer Technology and Development
基金
江苏省自然科学青年基金(BK20150861)