期刊文献+

一种基于数据垂直划分的分布式密度聚类算法 被引量:8

An Efficient Density-Based Clustering Algorithm for Vertically Partitioned Distributed Datasets
下载PDF
导出
摘要 聚类分析是数据挖掘领域的一项重要研究课题,对大数据集的聚类更以其数据量大、噪声数据多等而成为一个难点.针对数据垂直划分的情况,提出连通点集及局部噪声点集等概念.在分析局部噪声点集与全局噪声点集以及局部连通点集与全局连通点集关系的基础上,对全局噪声点进行有效过滤,进一步设计闭三角链表结构存储各个结点的聚类中间结果,提出了基于密度的分布式聚类算法DDB-SCAN.理论分析和实验结果表明,算法可以有效解决垂直划分的大数据集聚类问题,算法是有效可行的. Clustering is an important research in data mining. Clustering massive datasets has especially been a challenge for its large scale and too much noise data points. Distributed clustering is an effective method to solve these problems. Most of existing distributed clustering research aims at circumstances of horizontally partitioned dataset. In this paper, considering vertically partitioned distributed datasets, based on the analysis of relations between local noise datasets and the corresponding global one, an efficient filtering is applied to the global noise, which can efficiently eliminate the negative affection of noise data and reduce the scale of dataset to be dealt on the center node. Furthermore, an effect storage structure CTL (closed triangle list) is designed to store the intermediate clustering results of each node, which can efficiently reduce communication costs among distributed computer nodes during the clustering process and is helpful to conveniently generate global clustering model with high space utilization ratio and complete clustering information. Thus, a distributed density-based clustering algorithm DDBSCAN is proposed. Theoretical analysis and experimental results testify that DDBSCAN can effectively solve the problem of clustering massive vertically partitioned datasets, and the algorithm is effective and efficient.
出处 《计算机研究与发展》 EI CSCD 北大核心 2007年第9期1612-1617,共6页 Journal of Computer Research and Development
基金 江苏省自然科学基金项目(BK2006095) 教育部高等学校博士学科点专项科研基金项目(20040286009)
关键词 分布式数据挖掘 数据垂直划分 连通点集 局部噪声点集 闭三角链表 distributed data mining vertically partitioned data connected set local noise set closed triangle list
  • 相关文献

参考文献11

  • 1Inderjit S Dhillon,Dharmendra S Modha.A data-clustering algorithm on distributed memory multiprocessors[OL].http://www.cs.rpi.edu/-zaki/WKDD99/dhillon.ps.gz,1999 被引量:1
  • 2S Datta,C Giannella,H Kargupta.K-means clustering over a large,dynamic network[C].2006 SIAM Conf on Data Mining,Bethesda,MD,2006 被引量:1
  • 3Harsha S Nagesh,Sanjay Goil,Alok Choudhary.A scalable parallel subspace clustering algorithm for massive data sets[C].The 2000 Int'l Conf on Parallel Processing,Toronto,Canada,2000 被引量:1
  • 4Sally McClean,Bryan Scotney,Philip Morrow,et al.Knowledge discovery by probabilistic clustering of distributed databases[J].Data & Knowledge Engineering,2005,54(2):189-210 被引量:1
  • 5S Datta,C Giannella,H Kargupta.K-means clustering over a large,dynamic network[C].2006 SIAM Conf on Data Mining,Bethesda, MD,2006 被引量:1
  • 6倪巍伟,陆介平,孙志挥.基于向量内积不等式的分布式k均值聚类算法[J].计算机研究与发展,2005,42(9):1493-1497. 被引量:15
  • 7E Januzaj,H P Kriegel,M Pfeifle.DBDC:Density-based distributed clustering[C].The 9th Int'l Conf on Extending Database Technology (EDBT),Heraklion,Greece,2004 被引量:1
  • 8M Ester M,H P Kriegel,J Sander,et al.A density based algorithm of discovering clusters in large spatial databases with noise[C].In:Proc of the 2nd Int'l Conf on Knowledge Discovery and Data Mining.Menlo Park,CA:AAAI Press,1996.226-231 被引量:1
  • 9Karin Kailing,Hans-Peter Kriegel,Peer Kroger.Density-connected subspace clustering for high-dimensional data[C].SIAM Int'l Conf on Data Mining(SIAM DM'04),Orlando,FL,2004 被引量:1
  • 10N Beckmann,H-P Kriegel,R Schneider,et al.The R*-tree:An efficient and robust access method for points and rectangles[C].The 1990 ACM SIGMOD Int'l Conf on Management of Data,Atlantic City,USA,1990 被引量:1

二级参考文献10

  • 1Han Jiawei, Micheline. Data Mining: Concepts and Techniques.San Francisco: Morgan Kaufmann Publishers, 2000. 被引量:1
  • 2M. Ester, HP. Kriegel, J. Sander, et al. A density based algorithm of discovering clusters in large spatial databases with noise. In: E. Simoudis, Han Jiawei, U. M. Fayyad, eds. Proc.the 2nd Int'l Conf. Knowledge Discovery and Data Mining Portland. Menlo Park, CA: AAAI Press, 1996. 226~231. 被引量:1
  • 3Tian Zhang, Raghu Ramakrishnan, Miron Livny. BIRCH: An efficient data clustering method for very large databases. In: Proc.ACM SIGMOD Int'l Conf. Management of Data. New York:ACM Press, 1996. 73~84. 被引量:1
  • 4S. Guha, R. Rostogi, K. Shim. CURE: An efficient clustering algorithm for large databases. In: L. M. Haas, A. Tiwary, eds.Proc. the ACM SIGMOD Int'l Conf. Management of Data Seattle. New York: ACM Press, 1998. 73~84. 被引量:1
  • 5W. Zhnn, et al. Muntz. STING: A statistical information grid approach to spatial data mining. In: Proc. 23rd VLDB Conf.,San Francisco: Morgan Kaufrnann, 1997. 186~195. 被引量:1
  • 6S. Kantabutra, A. L. Couch. Parallel k-means clustering algorithm on Nows. NECTEC Technical Journal, 1999, 1 ( 1 ) :243~ 247. 被引量:1
  • 7Manasi N. Joshi. Parallel k-means algorithm on distributed memory multiprocessors. http:∥www. cs. umn. edu/~mnjoshi/PKMeans. pdf, 2003. 被引量:1
  • 8C. Pizzuti, D. Talia. P-Autoclass: Scalable parallel clustering for mining large data sets. IEEE Trans. Knowledge and Data Engineering, 2003, 15(6): 629~641. 被引量:1
  • 9O. Egecioglu, H. Ferhatosmanoglu, U. Ogras. Dimensionality reduction and similarity computation by inner-product approximates. IEEE Trans. Knowledge and Data Engineering,2004, 16(6): 714~726. 被引量:1
  • 10Maria Halkidi, Michalis Vazirgiannis. Clustering validity assessment: Finding the optimal partitioning of a data set. IEEE Int'l Conf. Data Mining, California, 2001. 被引量:1

共引文献14

同被引文献46

引证文献8

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部