摘要
针对大数据环境下频繁项查找效率低和可扩展性问题,提出了一种基于MapReduce框架运行的新分布式FIM算法。首先,使用前缀序列树来构建候选序列子集,避免了昂贵的扫描过程。接着,使用宽幅支持度的方法产生频繁项集,每个MapReduce迭代将修剪掉非频繁项集,显著地压缩内存消耗,以及每一个MapReduce作业的迭代时间。最后,在不同事务规模和支持度下,与不同算法进行实验对比。实验结果表明,提出的序列增长算法获得了良好的效率和可扩展性,特别是在处理大数据集和长项集方面。
For the problems of low efficiency and scalability in frequent itemset mining, a new distributed FIM algorithm is proposed, and implements it on MapReduce framework. Firstly, the algorithm applies the idea of prefix sequence to construct a tree, by which all frequent itemsets can be found without exhaustive search over the transaction databases. Then, it produces frequent itemsets in a breadth-wide support-based approach. In each Map Reduce iteration, the infrequent itemsets will be pruned away. It significantly deducts memory consumption and iteration time of each MapReduce job. Finally, the experimental comparison with different algorithms is performed under different scales of business and support degree. The results show the good efficiency and scalability of sequence-growth especially for dealing with big data and long itemsets.
作者
黄彩娟
刘卓华
所辉
杨滨
HUANG Cai-juan;LIU Zhuo-hua;SUO Hui;YANG Bin(School of Computer and Design,Guangdong Mechanical&Electrical Polytechnic,Guangzhou 510515,China;School of Design,Jiangnan University,Wuxi 214122,China)
出处
《控制工程》
CSCD
北大核心
2019年第11期2136-2140,共5页
Control Engineering of China
基金
广东省高等学校优秀青年教师培养计划资助项目(Yq2013171)