期刊文献+

基于Spark的极限学习机算法并行化研究 被引量:6

Study of ELM Algorithm Parallelization Based on Spark
下载PDF
导出
摘要 极限学习机算法虽然训练速度较快,但包含了大量矩阵运算,因此其在面对大数据量时,处理效率依然缓慢。在充分研究Spark分布式数据集并行计算机制的基础上,设计了核心环节矩阵乘法的并行计算方案,并对基于Spark的极限学习机并行化算法进行了设计与实现。为方便性能比较,同时实现了基于Hadoop MapReduce的极限学习机并行化算法。实验结果表明,基于Spark的极限学习机并行化算法相比于Hadoop MapReduce版本的运行时间明显缩短,而且若处理数据量越大,Spark在效率方面的优势就越明显。 Extreme learning mechine(ELM)has high training speed,but with lots of matrix operations,it rernams poor efficiency while applied to massive amount of data.After thorough research on parallel computation of Spark resilient distributed dataset(RDD),we proposed and implemented a parallelized algorithm of ELM based on Spark.And for convenienceof performance comparison,Hadoop-MapReduce-based version was also implemented.Experimental results show that the training efficiency of the Spark-based ELM parallelization algorithm is significantly improved than the Hadoop-MapReduce-based version.If the amount of data processed is greater,the advantage of Spark in efficiency is more obvious.
作者 刘鹏 王学奎 黄宜华 孟磊 丁恩杰 LIU Peng;WANG XUe-kui;HUANG Yi-hua;MENG Lei;DING En-jie(Internet of Things Perception Mine Research Centre,China University of Mining and Technology,Xuzhou221008,China;National and Local Joint Engineering Laboratory of Internet Application Technology on Mine,Xuzhou221008,China;Schoo1 of Information and Control Engineering,China University of Mining and Technology,Xuzhou221116,China;PASA Big-data Laboratory.Department of Computer Science,Nanjing University,Nanjing210023,China)
出处 《计算机科学》 CSCD 北大核心 2017年第12期33-37,共5页 Computer Science
基金 国家重点研发计划:矿山安全生产物联网关键技术与装备研发(2017YFC0804400 2017YFC0804401) 国家自然科学基金项目(61471361 41302203)资助
关键词 限学习机 并行化 SPARK RDD Hadoop MAPREDUCE ELM Parallelization Spark RDD Hadoop MapReduce
  • 相关文献

参考文献3

  • 1安俊秀,王鹏,靳宇倡编..Hadoop大数据处理技术基础与实践[M].北京:人民邮电出版社,2015:291.
  • 2夏俊鸾著..Spark大数据处理技术[M].北京:电子工业出版社,2015:336.
  • 3刘志强,顾荣,袁春风,黄宜华.基于SparkR的分类算法并行化研究[J].计算机科学与探索,2015,9(11):1281-1294. 被引量:14

二级参考文献19

  • 1刘华元,袁琴琴,王保保.并行数据挖掘算法综述[J].电子科技,2006,19(1):65-68. 被引量:15
  • 2Dean J,Ghemawat S.Map Reduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113. 被引量:1
  • 3Zaharia M,Chowdhury M,Das T,et al.Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation,San Jose,USA,Apr 25-27,2012.Berkeley,CA,USA:USENIX Association,2012. 被引量:1
  • 4The R Foundation.The R project for statistical computing[EB/OL].[2014-10-06].http://www.r-project.org/. 被引量:1
  • 5Amplab-extras.Spark R(R frontend for Spark)[EB/OL].[2014-09-25].http://amplab-extras.github.io/Spark R-pkg/. 被引量:1
  • 6Liu Chuang.Research on classification algorithms based on multicore computing[D].Nanjing:Nanjing University of Aeronautics and Astronautics,2011. 被引量:1
  • 7Jin Lei,Wang Zhaokang,Gu Rong,et al.Training large scale deep neural networks on the Intel Xeon Phi many-core coprocessor[C]//Proceedings of the 2014 IEEE 28th International Parallel&Distributed Processing Symposium Workshops(Par Learning),Phoenix,USA,May 19-25,2014.Piscataway,NJ,USA:IEEE,2014:1622-1630. 被引量:1
  • 8Woodsend K,Gondzio J.Hybrid MPI/Open MP parallel linear support vector machine training[J].Journal of Machine Learning Research,2009,10:1937-1953. 被引量:1
  • 9Narang A,Gupta R,Joshi A,et al.Highly scalable parallel collaborative filtering algorithm[C]//Proceedings of the 2010International Conference on High Performance Computing,Dona Paula,Dec 19-22,2010.Piscataway,NJ,USA:IEEE,2010:1-10. 被引量:1
  • 10The Apache Software Foundation.Apache Mahout:scalable machine learning and data mining[EB/OL].(2014)[2014-10-06].http://mahout.apache.org/. 被引量:1

共引文献13

同被引文献53

引证文献6

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部