摘要
共享近邻(SNN)相似度能有效克服由高维和多密度等因素引起的聚类有效性问题,但计算效率不高.基于分治策略,提出一种改进的共享近邻聚类算法(DC-SNN).采用软划分策略将数据集分割为多个小规模子集,降低了计算SNN相似矩阵时需要搜索的数据点数量,同时,也避免了子集分割边界对数据点K近邻产生的不利影响.根据在子集中定义的核心数据点和扩展数据点,给出了子集中SNN相似矩阵的计算方法和合并策略,从而确保了以子集SNN相似矩阵表示整个数据集SNN相似矩阵的有效性.实验结果表明,DC-SNN算法能够在确保聚类精度不变的情况下,显著提高共享近邻聚类的效率.
Shared nearest neighbor { SNN } similarity can effectively overcome the problems of cluster validity caused by the factors such as high-dimensional and multi-density, but a high computational cost is required for the SNN similarity matrix. Based on divide and conquer strategy, an improved shared nearest neighbor clustering algorithm ( DC-SNN) is proposed to address the issue. Using a soft partitioning strategy, the dataset is divided into some small subsets. Then less data points are searched during computing the SNN similarity matrix of each subset, and the adverse impact on the K nearest neighbors of data points, which is caused by the partition boundaries of the subsets, can effectively be avoided. Furthermore, according to the two terms defined in the subset, namely core data point and extended data point, both the computing method and combining strategy for SNN similarity matrix in the subset are provided to ensure that the SNN similarity matrix of dataset can effectively be expressed by those of all subsets. The experimental results show that DC-SNN algorithm can significantly improve the efficiency of the shared nearest neighbor clustering without the clustering accuracy declined.
出处
《小型微型计算机系统》
CSCD
北大核心
2014年第1期50-54,共5页
Journal of Chinese Computer Systems
基金
广东省教育部产学研结合项目(2011B090400466)资助
广东省教育科学规划项目(2010tjk119)资助
广东金融学院校级课题项目(11XJ04-03)资助
关键词
共享近邻
分治法
大数据集
聚类分析
shared nearest neighbor
divide and conquer
large dataset
clustering analysis