摘要
k-means算法是一种最常用的基于划分的聚类算法。传统的集中式k-means算法已不能适应当前呈爆炸式增长的数据规模,设计分布式k-means算法成为了目前亟需解决的问题。现有分布式k-means算法基于MapReduce计算框架且没有考虑初始聚类中心的影响。由于每个MapReduce任务均需要读写分布式文件系统,导致MapReduce不能有效表达多个任务之间的依赖关系,因此提出了一种基于数据流的计算框架,该框架建立在MapReduce之上,将数据处理过程按照数据流图建模。在该框架的基础上,提出了一种高效的k-means算法,它采用基于多次采样的初始聚类中心选取方法来实现负载均衡及减少迭代次数。实验结果表明,该算法的可扩展性较好,且效率比现有算法高。
k-means algorithm is one of the most commonly used clustering algorithm. Now data scale is exploding, and traditional centralized algorithm can not meet the requirements, so it is an urgent problem to design distributed k-means clustering algorithm currently. Existing distributed k-means algorithms are based on MapReduce framework and don't consider the clustering center. Since each MapReduce job reads and writes data from distributed file system, it is ineffi- cient to express dependencies between jobs. Then this paper proposed a framework based on data stream. Based on Ma- pReduce framework, this framework models according to the data flow diagram. And it proposed an efficient k-means al- gorithm on the framework. It uses an improved algorithm based on sampling to confirm clustering center for load ba- lance and reducing iterations. Experimental results demonstrate that the algorithm can efficiently resolve the large scale k-means cluster.
出处
《计算机科学》
CSCD
北大核心
2015年第11期235-239,265,共6页
Computer Science
基金
国家自然科学基金项目(61373015
61300052)
国家教育部高等学校博士学科点专项科研基金资助项目(20103218110017)
江苏高校优势学科建设工程资助项目(PAPD)
中央高校基本科研业务费专项项目(NP2013307)
云计算-南航-大数据处理引擎技术研究项目资助