期刊文献+

基于MapReduce和MSSA的并行K-means算法 被引量:4

Parallel K-means algorithm based on MapReduce and MSSA
下载PDF
导出
摘要 针对大数据环境下并行K-means算法存在的面对高维数据聚类效果差、数据分区不均匀、初始质心敏感等问题,提出了一种基于MapReduce和MSSA的并行K-means算法MR-MSKCA。首先,提出基于肯德尔相关系数和深度稀疏自动编码器的降维策略(dimensionality reduction strategy based on Kendall correlation coefficient and DSAE,DRKCAE)对高维数据进行特征加权和特征提取,解决了高维数据不相关特征和结构稀疏导致的聚类效果差的问题;其次,提出基于两段映射的广义超平面分区策略(uniform partition strategy based on two-stage mapping,UPS)对数据集进行划分,获取均匀的数据分区;最后提出非均匀变异麻雀搜索算法(non-uniform mutation sparrow search algorithm,MSSA)用于获取并行K-means的聚类质心,解决了算法初始质心敏感的问题。在UCI数据集上进行的实验显示,MR-MSKCA较MR-KNMF、MR-PGDLSH、MR-GAPKCA的运行时间分别降低了45.1%、49.1%、59.8%,聚类效果分别提升了19.2%、22.8%、24%,表明MR-MSKCA对大数据进行聚类时有良好性能,适用于不同场景的大数据聚类分析。 In the big data environment,the parallel K-means clustering algorithm suffers from poor clustering effect,unba-lanced data partition,cluster centroid sensitivity.To solve these problems,this paper proposed a parallel K-means algorithm based on MapReduce and MSSA(MR-MSKCA).Firstly,MR-MSKCA designed a dimensionality reduction strategy(DRKCAE),which used Kendall correlation coefficient and deep sparse autoencoder to weight features and extract features to improve the clustering effect of high-dimensional data.Secondly,it proposed a UPS,which divided the dataset and obtained uniform data partition.Finally,this paper proposed MSSA to get the parallel K-means clustering centroid,which solved the problem of initial centroid sensitivity.Compared with MR-KNMF,MR-PGDLSH and MR-GAPKCA,the running time of MR-MSKCA decreased by 45.1%,49.1%,59.8%,and the clustering effect increased by 19.2%,22.8%,24%.Experiments show that the MR-MSKCA not only has excellent performance,but also has strong adaptability with large-scale dataset.
作者 刘卫明 崔瑜 毛伊敏 刘蔚 Liu Weiming;Cui Yu;Mao Yimin;Liu Wei(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou Jiangxi 341000,China;School of Information Engineering,Gannan University of Science&Technology,Ganzhou Jiangxi 341000,China)
出处 《计算机应用研究》 CSCD 北大核心 2022年第11期3244-3251,3257,共9页 Application Research of Computers
基金 2020年度科技创新2030—“新一代人工智能”重大项目(2020AAA0109605) 国家自然科学基金资助项目(41562019)。
关键词 MAPREDUCE框架 DRKCAE策略 UPS策略 并行聚类 MSSA算法 MapReduce framework DRKCAE UPS parallel clustering MSSA
  • 相关文献

参考文献4

二级参考文献26

共引文献201

同被引文献40

引证文献4

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部