摘要
针对k-means算法过度依赖初始聚类中心、收敛速度慢等局限性及其在处理海量数据时存在的内存不足问题,提出一种新的针对大数据集的混合聚类算法super-k-means,将改进的基于超网络的高维数据聚类算法与k-means相结合,并经过MapReduce并行化后部署在Hadoop集群上运行。实验表明,该算法不仅在收敛性以及聚类精度两方面得到优化,其加速比和扩展性也有了大幅度的改善。
Aiming at the following three problems of the k-means algorithm:excessive dependence on the initial clustering center, slow convergence speed and insufficient memory when dealing with huge a- mounts of data, we present a new hybrid clustering algorithm called super-k-means for large data sets. The algorithm combines the k-means algorithm with the improved high-dimensional data clustering algo- rithm based on the super-network. We run it on the Hadoop clusters after the MapReduce parallel pro- cessing, and an ideal effect of clustering is achieved. Experimental results show that the algorithm not only improves the convergence and the clustering accuracy but also has high speedup and scalability per- formance.
出处
《计算机工程与科学》
CSCD
北大核心
2015年第9期1621-1626,共6页
Computer Engineering & Science
基金
国家自然科学基金资助项目(61373149
61472233)
山东省科技计划项目(2012GGX10118
2014GGX101026)