摘要
聚类分析是数据挖掘领域的一项重要研究课题,对大数据集的聚类更以其数据量大、噪声数据多等而成为一个难点.针对数据垂直划分的情况,提出连通点集及局部噪声点集等概念.在分析局部噪声点集与全局噪声点集以及局部连通点集与全局连通点集关系的基础上,对全局噪声点进行有效过滤,进一步设计闭三角链表结构存储各个结点的聚类中间结果,提出了基于密度的分布式聚类算法DDB-SCAN.理论分析和实验结果表明,算法可以有效解决垂直划分的大数据集聚类问题,算法是有效可行的.
Clustering is an important research in data mining. Clustering massive datasets has especially been a challenge for its large scale and too much noise data points. Distributed clustering is an effective method to solve these problems. Most of existing distributed clustering research aims at circumstances of horizontally partitioned dataset. In this paper, considering vertically partitioned distributed datasets, based on the analysis of relations between local noise datasets and the corresponding global one, an efficient filtering is applied to the global noise, which can efficiently eliminate the negative affection of noise data and reduce the scale of dataset to be dealt on the center node. Furthermore, an effect storage structure CTL (closed triangle list) is designed to store the intermediate clustering results of each node, which can efficiently reduce communication costs among distributed computer nodes during the clustering process and is helpful to conveniently generate global clustering model with high space utilization ratio and complete clustering information. Thus, a distributed density-based clustering algorithm DDBSCAN is proposed. Theoretical analysis and experimental results testify that DDBSCAN can effectively solve the problem of clustering massive vertically partitioned datasets, and the algorithm is effective and efficient.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2007年第9期1612-1617,共6页
Journal of Computer Research and Development
基金
江苏省自然科学基金项目(BK2006095)
教育部高等学校博士学科点专项科研基金项目(20040286009)