摘要
针对K-means算法在数据聚类过程中初始值选取的随机性问题,基于非均匀采样原则对该算法进行改进。同时,针对聚类算法并行化的需求,基于Spark平台对改进算法进行了并行化实现。单机串行处理和集群并行化实验证明了该改进算法在处理海量数据集时具有更高的准确性和稳定性,且在Spark平台上的并行化实现具有良好的加速比和可扩展性,从而表明该算法能在实际的海量数据处理中高效运行。
For the randomness problems of the initial values selected that the K-means algorithm in data clustering process, the algorithm was improved based on the principle of non-uniform sampling. At the same time, in allusion to the clustering algorithm for parallel needs, the improved algorithm was implemented parallelization based on the Spark platform. And the improved algorithm has a higher accuracy and stability was proved by the serial and parallel experiment on cluster. It was also demonstrated that the parallel implement of improved algorithm has a better speed up ratio and scalability, thereby the improved algorithm can operate efficiently in processing massive data was proved.
出处
《互联网天地》
2016年第1期44-50,共7页
China Internet
基金
浙江省自然科学基金(No.LY13F010011)
浙江省科技厅重大专项(No.2014NM002)