摘要
传统k-mean算法解决数据聚类问题时容易陷入局部最优,且单位时间内聚类数据的效率不高等问题,本文针对这些缺点对k-mean算法进行改进。在MapReduce框架下并行化布局k-mean聚类算法,基于分治策略将大数据集分为数据块,同时削减spill文件的合并以降低Map节点计算量输出;基于密度参数选取k-mean聚类算法的中心点,使用误差平方和确定算法聚类个数,避免数据聚类陷入局部最优。实验结果显示,该方法在聚类精度与效率方面均展现其优势,具有较强的数据聚类实际应用价值。
Absrtact:The traditional k-mean algorithm is easy to fall into the local optimum when solving the data clustering problem,and the efficiency of clustering data per unit time is not high,so this paper improves the k-mean algorithm in view of these shortcomings.Under the MapReduce framework,the parallel layout k-mean clustering algorithm divides the large data set into data blocks based on the divide and conquer strategy,and reduces the merge of spike files to reduce the computation output of the map node.The center point of k-mean clustering algorithm is selected based on density parameters,and the number of clustering algorithms is determined by using the sum of squares of errors to avoid data clustering falling into local optimization.Finally,the experimental results show that this method shows its advantages in clustering accuracy and efficiency,and has strong practical application value in data clustering.
作者
李英杰
王芮
尚影
LI Ying-jie;WANG Rui;SHANG Ying(Fuyang Preschool Teachers College,Fuyang 236015,Anhui Province,China)
出处
《景德镇学院学报》
2022年第6期28-30,共3页
Journal of JingDeZhen University
基金
安徽省高等学校省级质量工程教学团队项目(2020jxtd194)
安徽高校自然科学重点研究项目(KJ2021A1573)
安徽省高等学校省级质量工程线下课程项目(2020kfkc387)。