期刊文献+

FSMBUS:一种基于Spark的大规模频繁子图挖掘算法 被引量:21

FSMBUS:A Frequent Subgraph Mining Algorithm in Single Large-Scale Graph Using Spark
下载PDF
导出
摘要 随着社交网络用户数的快速增加,大规模单图上频繁子图挖掘的需求越来越强烈.单机算法对大规模图的运行效率较低,难以支撑支持度较低的频繁子图的挖掘;现有的分布式环境下单图的频繁子图挖掘算法不支持子图增长模式的挖掘,它们所使用的Hadoop框架也不适合运行迭代式算法.提出了一种基于Spark的大规模单图频繁子图挖掘算法FSMBUS,通过次优树构建并行计算的候选子图,在给定最小支持度时挖掘出所有的频繁子图,并利用非频繁检测和搜索顺序选择实现优化,还设计了一种名为Sorted-Greedy的轻量级数据划分方法.实验结果表明,FSMBUS的效率要比现有单图上最新的算法快一个数量级,并支持更低最小支持度阈值以及更大规模图数据的挖掘,同时FSMBUS比其Hadoop的移植版要快2~4倍. Mining frequent subgraphs in a single large-scale graph is of huge demand with the rapid growth of the social networking. However, it is inefficient for the serial algorithms to mine frequent subgraphs in low support when mining for a single large-scale graph. Meanwhile, few existing distributed algorithms can't support the growth pattern mining, and the Hadoop framework they worked is not suitable for iterative running. In this paper, a distributed algorithm named FSMBUS for mining frequent subgraph in a single large-scale graph under Spark framework is proposed. It constructs the parallel computing candidate subgraphs by suboptimal CAM Tree, which returns all the frequent subgraphs for given user-defined minimum support. Additionally, infrequent patterns' test and searching order chosen is introduced to optimize the algorithm. Sorted-Greedy method is designed for data partition to balance the workload. Our experiments show that FSMBUS runs faster and more effective than the existing algorithms with real datasets, and even can run with the lower support threshold and the larger graph datasets as well. At the same time, FSMBUS runs 2~4 times faster on Spark framework than that on Hadoop framework.
出处 《计算机研究与发展》 EI CSCD 北大核心 2015年第8期1768-1783,共16页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61170006 61202007) 宁波市自然科学基金项目(2013A610063 2013A610110)
关键词 频繁子图 大规模单图 分布式挖掘 SPARK 负载均衡 frequent subgraph single large-scale graph distribute mining Spark workload balance
  • 相关文献

参考文献28

  • 1Borgelt C, Berthold M R, Patterson D E. Molecular fragment mining for drug discovery [G] //Symbolic and Quantitative Approaches to Reasoning with Uncertainty. Berlin: Springer, 2005 : 1002-1013. 被引量:1
  • 2王桂娟,印鉴,詹卫许.一种新的基于嵌入集的图分类方法[J].计算机研究与发展,2012,49(11):2311-2319. 被引量:5
  • 3Guralnik V, Karypis G. A scalable algorithm for clustering sequential data [C] //Proc of the 1st IEEE Int Conf on Data Mining. Piscataway, NJ: IEEE, 2001:179-186. 被引量:1
  • 4Yan X, Yu P S, Han J. Graph indexing: A frequent structure-based approach [C] //Proc of the 17th ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2004: 335-346. 被引量:1
  • 5Liu Y, Jiang X, Chen H, et al. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network [G] //Advanced Parallel Processing Technologies. Berlin: Springer, 2009: 341-355. 被引量:1
  • 6Shahrivari S, Jalili S. Distributed discovery of /requent subgraphs of a network using MapReduce [OL]. [2015-03- 25]. http://link, springer, corn/article/10. 1007/s00607-015 0446 9. 被引量:1
  • 7Elseidy M, Abdelhamid E, Skiadopoulos S, et al. GRAMI: Frequent subgraph and pattern mining in a single large graph [C] //Proc of the 40th Int Conf on Very Large Data Bases. Berlin: Springer, 2014:517-528. 被引量:1
  • 8Bhuiyan M A, A1 Hasan M. An iterative MapReduce based frequent subgraph mining algorithm [J]. IEEE Trans on Knowledge and Data Engineering, 2013, 27(3): 608-620. 被引量:1
  • 9Lu W, Chen G, Tung A K H, et al. Efficiently extracting frequent subgraphs using mapreduce [C] //Proc of the 1st IEEE Int Conf on Big Data. Piscataway, NJ: IEEE, 2013: 639-647. 被引量:1
  • 10Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107-113. 被引量:1

二级参考文献45

  • 1Rakesh Agrawal, Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. VLDB1994, Santiago,Chile, 1994. 被引量:1
  • 2Heikki Mannila, et al. Search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery,1997, 1(3): 241~258. 被引量:1
  • 3Jong Soo Park, et al. An effective Hash based algorithm for mining association rules. SIGMOD1995, San Jose, USA, 1995. 被引量:1
  • 4Sergey Brin, et al. Dynamic itemset counting and implication rules for market basket data. SIGMOD1997, Tucson, USA,1997. 被引量:1
  • 5Ramesh C. Agarwal, et al. Depth first generation of long patterns, KDD 2000, Boston, USA, 2000. 被引量:1
  • 6Ramesh C. Agarwal, et al. A tree projection algorithm for generation of frequent itemsets. J. of Parallel and Distributed Computing, 2001, 61(3): 350~371. 被引量:1
  • 7Jiawei Han, Jian Pei, Yiwen Yin. Mining frequent patterns without candidate generation. SIGMOD2000, Dallas, USA, 2000. 被引量:1
  • 8J. Pei, et al.. H-Mine: Hyper-structure mining of frequent patterns in large databases. ICDM'01, San Jose, CA, 2001. 被引量:1
  • 9Mike Perkowitz, Oren Etzioni. Adaptive sites: Automatically learning from user access patterns. WWW' 97, Santa Clara, 1997. 被引量:1
  • 10J. Pei, et al.. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, Heidelberg, 2001. 被引量:1

共引文献49

同被引文献100

  • 1刘正伟,文中领,张海涛.云计算和云数据管理技术[J].计算机研究与发展,2012,49(S1):26-31. 被引量:170
  • 2李洪波,吴凤鸽,孙增圻,孙富春.网络控制系统仿真平台的设计与实现[J].系统仿真学报,2006,18(6):1700-1704. 被引量:21
  • 3谢莹,吴建国,李炜,许荣斌.基于gSpan算法的未知化合物毒性预测[J].合肥工业大学学报(自然科学版),2007,30(10):1278-1280. 被引量:4
  • 4刘玉艳,沈明玉.LVS负载均衡技术在网络服务中的应用[J].合肥工业大学学报(自然科学版),2007,30(12):1592-1595. 被引量:12
  • 5IDC. The Digital Universe of Opportunities:Rich Data and the Incdreasing Value of the Internet of Things [EB/OL]. [2014-04]. http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm. 被引量:1
  • 6FERRERIA C R L , Traina J C, MACHADO T A J, et al. Clustering Very Large Multi-Dimensional Datasets with Mapreduce [C]. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011 ACM. San Diego: ACM Press, 2011: 690-698. 被引量:1
  • 7YU Y, HUANG C, LEE Y. An Intelligent Touring System Based on Mobile Social Network and Cloud Computing for Travel Recom- mendation[C]. 28th International Conference on Advanced Information Networking and Applications Workshops(AINA), 2014 IEEE. Victoria, Canada: IEEE Press, 2014:19-24. 被引量:1
  • 8WALUNJ S G, SADAFALE K. An Online Recommendation System for E-commerce Based on Apache Mahout Framework[C]. 2013 Annual Conference on Computers and People Research, 2013 ACM. Cincinnati: ACM Press,2013: 153-158. 被引量:1
  • 9ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster Computing with Working Sets[C]. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing , 2010:10-10. 被引量:1
  • 10ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing[C]. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2012:2-2. 被引量:1

引证文献21

二级引证文献72

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部