摘要
随着位置大数据的爆炸式增长,传统的串行算法已无法对其进行高效地聚类处理,因此,基于MapReduce框架的并行聚类算法研究逐渐成为热点。聚类算法并行化后的聚类质量通常难以保证,因此对并行化聚类结果进行归约的方法极为重要。首先提出基于网格的改进DBSCAN并行化聚类算法,通过该步骤得到每个数据子集的聚类结果。然后在分析网格与簇的关系,定义网格簇和网格簇的连通、强连通概念的基础上,通过计算网格簇之间的连通权值矩阵,对具有强连通关系的网格簇进行归约,构成基于MapReduce的强连通网格聚类算法。该算法可实现位置大数据集的高效聚类。实验分析表明,基于MapReduce的强连通网格聚类算法对位置大数据的处理具有较高的效率和聚类质量。
With the explosive growth of large location data,most of the traditional serial clustering algorithms can not process big data efficiently.In order to solve this problem,more and more people are studying parallel clustering algorithm.It is difficult to guarantee the clustering quality of parallel clustering algorithm,so it is important to study the algorithm of reducing the result of parallel clustering.Therefore,a grid clustering algorithm based on strongly connected fusion was proposed.Firstly,clustering result of data subsets is obtained according to the improved DBSCAN algorithm based on MapReduce.Next,the relationship between grid and cluster is analyzed and the concepts of Gird-cluster,connectivity and strong connectivity of Gird-clusters are defined.Then the connectivity weights matrix between Gird-cluster and Gird-cluster is calculated.Finally,whether to reduce two Gird-clusters or not is decided according to connectivity weight.The experimental results show that the proposed algorithm has high efficiency and high clustering quality in processing large location data.
作者
胡赢双
陆亿红
HU Ying-shuang;LU Yi-hong(College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310014,China)
出处
《计算机科学》
CSCD
北大核心
2019年第S11期204-207,215,共5页
Computer Science
基金
浙江省基础公益研究计划项目(GG19E090005)资助