期刊文献+

基于Spark平台的K-means聚类算法改进及并行化实现 被引量:11

Improvement and parallel implementation of K-means clustering algorithm based on the Spark platform
原文传递
导出
摘要 针对K-means算法在数据聚类过程中初始值选取的随机性问题,基于非均匀采样原则对该算法进行改进。同时,针对聚类算法并行化的需求,基于Spark平台对改进算法进行了并行化实现。单机串行处理和集群并行化实验证明了该改进算法在处理海量数据集时具有更高的准确性和稳定性,且在Spark平台上的并行化实现具有良好的加速比和可扩展性,从而表明该算法能在实际的海量数据处理中高效运行。 For the randomness problems of the initial values selected that the K-means algorithm in data clustering process, the algorithm was improved based on the principle of non-uniform sampling. At the same time, in allusion to the clustering algorithm for parallel needs, the improved algorithm was implemented parallelization based on the Spark platform. And the improved algorithm has a higher accuracy and stability was proved by the serial and parallel experiment on cluster. It was also demonstrated that the parallel implement of improved algorithm has a better speed up ratio and scalability, thereby the improved algorithm can operate efficiently in processing massive data was proved.
出处 《互联网天地》 2016年第1期44-50,共7页 China Internet
基金 浙江省自然科学基金(No.LY13F010011) 浙江省科技厅重大专项(No.2014NM002)
关键词 K-MEANS 聚类 SPARK 并行化 K-means, clustering, Spark, parallel
  • 相关文献

参考文献9

  • 1HAN J W, KAMBER M. Data mining:, concepts and techniques[M].San Francisco, CA, itd: Morgan Kaufmann Publishers, 2000. 被引量:1
  • 2WU X D, KUMAR V, QUINLAN J R, et al. Top t0 algorithms in data mining[J]. Knowledge and Information Systems, 2008, 14(1): 1-37. 被引量:1
  • 3ZHANG'I; RAMAKRISHNAN R. LIVNY M. BIRCH: an efficient data clustering method for very large databaseslC]//ACM SigmtM Record. 1996:103-114. 被引量:1
  • 4毛典辉.基于MapReduce的Canopy-Kmeans改进算法[J].计算机工程与应用,2012,48(27):22-26. 被引量:65
  • 5XU Y J, Qu w, LI Z, et aL Efficient k-means++ Approximation with MapReduce[J]. IEEE Computer Society, 2014,25(12):3135- 3144. 被引量:1
  • 6ZIMICHEV E A, KAZANSKIY N 14 SERAFIMOVICH P G. Spectral- spatial classification with k-means++ particional clustering[J]. Computer Optics, 2014, 38(2): 281-286. 被引量:1
  • 7张刚红.Hadoop下并行遗传算法研究及在应急设施选址中的应用[J].互联网天地,2013(8):11-14. 被引量:4
  • 8DEAN J, GHEMAWAT S. MapReduce: simplified data proce-ssing on large clusters[J]. Communications of the ACM,2008, 51 (1): 107-113. 被引量:1
  • 9ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et td. Spark: cluster computing with working sets[C]//Book of Extremes. 2010: 1765-1773. 被引量:1

二级参考文献15

  • 1刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15. 被引量:23
  • 2陈亮,任世军.一种遗传算法在集合覆盖问题中的应用研究[J].哈尔滨商业大学学报(自然科学版),2006,22(2):67-70. 被引量:7
  • 3Han Jiawei,Kamber M.Data mining:concepts and tech- niques[M].San Francisco:Morgan Kaufmann Publishers, 2000. 被引量:1
  • 4Januzaj E, Kriegel H P, Pfeifle M.DBDC : Density-Based Distributed Clustering[C]//Proceedings of 9th International Conference on Extending Database Technology(EDBT). Oakland: IEEE Computer Press, 2004 : 88-105. 被引量:1
  • 5Samatova N F, Ostrouchov G.RACHET : an efficient cov- er-based merging of clustering hierarchies from distribut- ed datasets[J].Distributed and Parallel Databases,2002, 11 (2) : 157-180. 被引量:1
  • 6Johoson E, KarguPta H.Collective, hierarchical clustering from distributed, heterogeneous data[C]//Lecture Notes in Computer Science.Berlin: Springer, 2000 : 221-244. 被引量:1
  • 7Kargupta H.Sclable, distributed data mining using an agent based architecture[C]//Proceedings of 3rd Interna- tional Conference on Knowledge Discovery and Data Mining.Oakland .. AAAI Press, 1997 .. 211-214. 被引量:1
  • 8Hearst M A.Texttiling: segmenting text into multi-para- graph subtopic passages[J].Computational Linguistics, 1997,23(1) :33-64. 被引量:1
  • 9Dean J, Ghemawat S.MapReduce-simplified data process- ing on large clusters[C]//Proceedings of the 6th Inter- national Conference on Operation Systems Design & Im- plementation(OSDI), Berkeley, CA, USA, 2004 : 137-150. 被引量:1
  • 10WhiteT.Hadoop权威指南[M].曾大聃,周傲英,译.北京清华大学出版社,2010. 被引量:1

共引文献66

同被引文献94

引证文献11

二级引证文献59

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部