摘要
K-Means算法是一种基于划分的算法,具有实现简单、效率较高的特点,但存在对初始中心选取依赖性强、分类数K未必总是已知及算法频繁迭代资源开销大等缺点。为解决这些问题,通过引入Canopy算法和最小最大距离算法对原K-Means算法进行改进,并在大数据的现实背景下,采用Spark并行计算框架来实现该算法。实验结果表明:改进后的聚类算法在分类稳定性、准确性和收敛速度上都有所提升,并在处理大规模数据方面表现出较大的性能优势。
The K-Means algorithm is a partition-based algorithm with numerous advantages of simple and high efficiency. But the algorithm has a strong dependence on the selection of initial center. What's more,the number of classes is not always known and frequent iterations can result in the overload of server. To solve these problems,the original K-Means algorithm is improved by introducing Canopy algorithm and minimum maximum distance algorithm. In order to deal with big data,the Spark computing model is utilized to improve the algorithm. Experimental results show that the improved clustering algorithm can improve the classification stability,the accuracy and the convergence speed,thus having performance advantages in dealing with big data.
出处
《南京邮电大学学报(自然科学版)》
北大核心
2017年第4期113-118,共6页
Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition