摘要
针对K-means算法对初值选取的依赖,收敛速度慢,聚类精度低,以及对海量数据的处理存在内存瓶颈的问题,提出一种基于MapReduce的高效K-means并行算法.该算法在MapReduce框架基础上,结合K选择排序算法进行并行采样,提高采样效率;采用基于样本预处理策略获取初始中心点;使用权值替换策略对迭代中心进行更新;此外,通过调整Hadoop集群,对算法的运行效率作出进一步提升.实验结果表明,该算法不仅具有良好的收敛性、准确率、加速比,算法性能也得到进一步改善.
Focusing on the problem of K-means algorithm that has dependence of initial value selection, slow convergence, lower clustering accuracy, slow operating speed and overflow memory when dealing with large data, an efficient K-means parallel algorithm based on Map Reduce is proposed. Firstly, the algorithm is based on the Map Reduce framework, and combined with K selective sorting algorithm to improve the sampling efficiency; Secondly, the initial center point is obtained based on the sample pretreatment strategy; Finally, the iterative center is updated by using the weight replacement policy; In addition, by adjusting the Hadoop cluster, the efficiency of the algorithm is further enhanced. Experimental results show that the proposed algorithm not only has good convergence, accuracy and speedup, but also can improve performance of the algorithm.
出处
《辽宁工程技术大学学报(自然科学版)》
CAS
北大核心
2017年第11期1204-1211,共8页
Journal of Liaoning Technical University (Natural Science)
基金
国家自然科学基金(61404069)
辽宁省科技厅博士启动基金(20141140)