摘要
面对大数据规模庞大且计算复杂等问题,基于MapReduce框架采用两阶段渐进式的聚类思想,提出了改进的K-means并行化计算的大数据聚类方法。第一阶段,该算法通过Canopy算法初始化划分聚类中心,从而迅速获取粗精度的聚类中心点;第二阶段,基于MapReduce框架提出了并行化计算方案,使每个数据点围绕其邻近的Canopy中心进行细化的聚类或合并,从而对大数据实现快速、准确地聚类分析。在MapReduce并行框架上进行算法验证,实验结果表明,所提算法能够有效地提升并行计算效率,减少计算时间,并提升大数据的聚类精度。
Aiming at solving the problem of big data’s large scale and complex computation,this paper adopted the idea of two-stage progressive clustering,and proposed a parallel computation algorithm for big data clustering based on MapReduce.In the first stage,this method acquired the initialized clustering center through Canopy algorithm,in order to find relatively accurate cluster center points quickly.In the second stage,it presented a novel scheme of parallel computation based on MapReduce framework,which maked each data node cluster or merge around its adjacent Canopy center node.In this way,the algorithm could make the procedure of data clustering fast and accurately.The results of the experiments deployed on MapReduce show that this algorithm can effectively improve the efficiency of parallel computing,reduce computing time,and improve big data’s clustering accuracy.
作者
张文杰
蒋烈辉
Zhang Wenjie;Jiang Liehui(Faculty of Cyberspace Security,PLA Information Engineering University,Zhengzhou 450001,China;State Key Laboratory Mathematical Engineering&Advanced Computing,Zhengzhou 450001,China)
出处
《计算机应用研究》
CSCD
北大核心
2020年第1期53-56,共4页
Application Research of Computers
基金
河南省基础前沿课题(142300410090)
河南省科技攻关计划项目(162102210035).