一种基于数据垂直划分的分布式密度聚类算法被引量：8

An Efficient Density-Based Clustering Algorithm for Vertically Partitioned Distributed Datasets

下载PDF

导出

摘要聚类分析是数据挖掘领域的一项重要研究课题,对大数据集的聚类更以其数据量大、噪声数据多等而成为一个难点.针对数据垂直划分的情况,提出连通点集及局部噪声点集等概念.在分析局部噪声点集与全局噪声点集以及局部连通点集与全局连通点集关系的基础上,对全局噪声点进行有效过滤,进一步设计闭三角链表结构存储各个结点的聚类中间结果,提出了基于密度的分布式聚类算法DDB-SCAN.理论分析和实验结果表明,算法可以有效解决垂直划分的大数据集聚类问题,算法是有效可行的. Clustering is an important research in data mining. Clustering massive datasets has especially been a challenge for its large scale and too much noise data points. Distributed clustering is an effective method to solve these problems. Most of existing distributed clustering research aims at circumstances of horizontally partitioned dataset. In this paper, considering vertically partitioned distributed datasets, based on the analysis of relations between local noise datasets and the corresponding global one, an efficient filtering is applied to the global noise, which can efficiently eliminate the negative affection of noise data and reduce the scale of dataset to be dealt on the center node. Furthermore, an effect storage structure CTL （closed triangle list） is designed to store the intermediate clustering results of each node, which can efficiently reduce communication costs among distributed computer nodes during the clustering process and is helpful to conveniently generate global clustering model with high space utilization ratio and complete clustering information. Thus, a distributed density-based clustering algorithm DDBSCAN is proposed. Theoretical analysis and experimental results testify that DDBSCAN can effectively solve the problem of clustering massive vertically partitioned datasets, and the algorithm is effective and efficient.

作者倪巍伟陈耿孙志挥

机构地区东南大学计算机科学与工程学院南京审计学院审计信息工程实验室

出处《计算机研究与发展》 EI CSCD 北大核心 2007年第9期1612-1617,共6页 Journal of Computer Research and Development

基金江苏省自然科学基金项目(BK2006095) 教育部高等学校博士学科点专项科研基金项目(20040286009)

关键词分布式数据挖掘数据垂直划分连通点集局部噪声点集闭三角链表 distributed data mining vertically partitioned data connected set local noise set closed triangle list

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献11

1Inderjit S Dhillon,Dharmendra S Modha.A data-clustering algorithm on distributed memory multiprocessors[OL].http://www.cs.rpi.edu/-zaki/WKDD99/dhillon.ps.gz,1999 被引量：1
2S Datta,C Giannella,H Kargupta.K-means clustering over a large,dynamic network[C].2006 SIAM Conf on Data Mining,Bethesda,MD,2006 被引量：1
3Harsha S Nagesh,Sanjay Goil,Alok Choudhary.A scalable parallel subspace clustering algorithm for massive data sets[C].The 2000 Int'l Conf on Parallel Processing,Toronto,Canada,2000 被引量：1
4Sally McClean,Bryan Scotney,Philip Morrow,et al.Knowledge discovery by probabilistic clustering of distributed databases[J].Data & Knowledge Engineering,2005,54(2):189-210 被引量：1
5S Datta,C Giannella,H Kargupta.K-means clustering over a large,dynamic network[C].2006 SIAM Conf on Data Mining,Bethesda, MD,2006 被引量：1
6倪巍伟,陆介平,孙志挥.基于向量内积不等式的分布式k均值聚类算法[J].计算机研究与发展,2005,42(9):1493-1497. 被引量：15
7E Januzaj,H P Kriegel,M Pfeifle.DBDC:Density-based distributed clustering[C].The 9th Int'l Conf on Extending Database Technology (EDBT),Heraklion,Greece,2004 被引量：1
8M Ester M,H P Kriegel,J Sander,et al.A density based algorithm of discovering clusters in large spatial databases with noise[C].In:Proc of the 2nd Int'l Conf on Knowledge Discovery and Data Mining.Menlo Park,CA:AAAI Press,1996.226-231 被引量：1
9Karin Kailing,Hans-Peter Kriegel,Peer Kroger.Density-connected subspace clustering for high-dimensional data[C].SIAM Int'l Conf on Data Mining(SIAM DM'04),Orlando,FL,2004 被引量：1
10N Beckmann,H-P Kriegel,R Schneider,et al.The R*-tree:An efficient and robust access method for points and rectangles[C].The 1990 ACM SIGMOD Int'l Conf on Management of Data,Atlantic City,USA,1990 被引量：1

二级参考文献10

1Han Jiawei, Micheline. Data Mining: Concepts and Techniques.San Francisco: Morgan Kaufmann Publishers, 2000. 被引量：1
2M. Ester, HP. Kriegel, J. Sander, et al. A density based algorithm of discovering clusters in large spatial databases with noise. In: E. Simoudis, Han Jiawei, U. M. Fayyad, eds. Proc.the 2nd Int'l Conf. Knowledge Discovery and Data Mining Portland. Menlo Park, CA: AAAI Press, 1996. 226～231. 被引量：1
3Tian Zhang, Raghu Ramakrishnan, Miron Livny. BIRCH: An efficient data clustering method for very large databases. In: Proc.ACM SIGMOD Int'l Conf. Management of Data. New York:ACM Press, 1996. 73～84. 被引量：1
4S. Guha, R. Rostogi, K. Shim. CURE: An efficient clustering algorithm for large databases. In: L. M. Haas, A. Tiwary, eds.Proc. the ACM SIGMOD Int'l Conf. Management of Data Seattle. New York: ACM Press, 1998. 73～84. 被引量：1
5W. Zhnn, et al. Muntz. STING: A statistical information grid approach to spatial data mining. In: Proc. 23rd VLDB Conf.,San Francisco: Morgan Kaufrnann, 1997. 186～195. 被引量：1
6S. Kantabutra, A. L. Couch. Parallel k-means clustering algorithm on Nows. NECTEC Technical Journal, 1999, 1 ( 1 ) :243～ 247. 被引量：1
7Manasi N. Joshi. Parallel k-means algorithm on distributed memory multiprocessors. http:∥www. cs. umn. edu/～mnjoshi/PKMeans. pdf, 2003. 被引量：1
8C. Pizzuti, D. Talia. P-Autoclass: Scalable parallel clustering for mining large data sets. IEEE Trans. Knowledge and Data Engineering, 2003, 15(6): 629～641. 被引量：1
9O. Egecioglu, H. Ferhatosmanoglu, U. Ogras. Dimensionality reduction and similarity computation by inner-product approximates. IEEE Trans. Knowledge and Data Engineering,2004, 16(6): 714～726. 被引量：1
10Maria Halkidi, Michalis Vazirgiannis. Clustering validity assessment: Finding the optimal partitioning of a data set. IEEE Int'l Conf. Data Mining, California, 2001. 被引量：1

共引文献14

1唐皓,刘希玉.引力流形上的空间聚类[J].科协论坛（下半月）,2009(10):96-98.
2陈晓云,王平,何春霞,冷明伟.基于三角不等式原理的TTSAS聚类加速算法[J].计算机工程,2006,32(17):97-99. 被引量：1
3刘峰,刘希玉,刘弘.流形上的空间密度聚类算法研究[J].中国海洋大学学报（自然科学版）,2007,37(4):681-684. 被引量：1
4吉根林,凌霄汉,杨明.一种基于集成学习的分布式聚类算法[J].东南大学学报（自然科学版）,2007,37(4):585-588. 被引量：1
5刘峰,刘希玉,张建萍.基于拓扑聚类的密度聚类算法研究[J].山东师范大学学报（自然科学版）,2007,22(3):30-33.
6刘希玉,张建萍.一种基于密度聚类的一般观点——拓扑聚类[J].计算机工程与应用,2007,43(26):164-168.
7刘韬,蔡淑琴,曹丰文,崔志磊.基于距离浓度的K-均值聚类算法[J].华中科技大学学报（自然科学版）,2007,35(10):50-52. 被引量：7
8赵伟,李文辉,张姝.一种改进R-Link的空间数据检索算法[J].吉林大学学报（理学版）,2008,46(3):499-503. 被引量：1
9孙涛,李雄飞,刘丽娟.数据分布不敏感的决策树算法[J].吉林大学学报（工学版）,2009,39(6):1607-1611. 被引量：1
10唐皓,刘希玉.基于密度流形上的空间聚类[J].河北大学学报（自然科学版）,2009,29(6):658-662.

同被引文献46

1Sen Wu,Xuedong Gao Management School, University of Science and Technology Beijing, Beijing 100083, China.CABOSFV algorithm for high dimensional sparse data clustering[J].Journal of University of Science and Technology Beijing,2004,11(3):283-288. 被引量：7
2倪巍伟,孙志挥,陆介平.k-LDCHD——高维空间k邻域局部密度聚类算法[J].计算机研究与发展,2005,42(5):784-791. 被引量：18
3何中胜,刘宗田,庄燕滨.基于数据分区的并行DBSCAN算法[J].小型微型计算机系统,2006,27(1):114-116. 被引量：16
4谷淑化,吕维先.基于消息传递的并行聚类算法[J].现代计算机,2006,12(1):82-84. 被引量：3
5陈良维.数据挖掘中聚类算法研究[J].微计算机信息,2006(07X):209-211. 被引量：32
6王轶,达新宇.分布式并行数据挖掘计算框架及其算法研究[J].微电子学与计算机,2006,23(9):223-225. 被引量：9
7罗杰文.peertopeer(P2P)综述[EB/OL].http ://www. intsci.cn/users/luojw/pzp/index. html,2008 - 08 - 10. 被引量：3
8Eisenhardt M, Muller W, Henrich A. Classifying documents by distributed P2P elustering[C] // Proceedings of Informatik 2003 ,GI Lecture Notes in Informatins, Frankfort, Germany, 2003 : 286-291. 被引量：1
9Dhillon I, Modha D. A Data-clustering algorithm on distributed memory multiprocessors [C] // Proceedings of the KDD'99 Workshop on High Performance Knowledge Discovery, San Diego, CA, USA, 1999: 245-260. 被引量：1
10Forman G, Zhang B. Distributed data clustering can be efficient and exact [J]. SIGKDD Explorations, 2000,2 (2) :422-448. 被引量：1

引证文献8

1武森,冯小东,吴庆海.基于稀疏指数排序的高维数据并行聚类算法[J].系统工程理论与实践,2011,31(S2):13-18. 被引量：1
2陶冶,曾志勇.Robust的分布式k中心聚类算法的研究与实现[J].计算机工程与应用,2009,45(32):122-125. 被引量：2
3华铨平.面向数据特征的分布式数据挖掘研究[J].计算机工程与设计,2010,31(6):1313-1315. 被引量：2
4田野,刘大有.改进的Peer-to-Peer环境下的聚类算法[J].吉林大学学报（工学版）,2010,40(6):1639-1643.
5陶冶,曾志勇,余建坤,冯涛.并行k均值聚类算法的完备性证明与实现[J].计算机工程,2010,36(22):72-74. 被引量：5
6岳金柱,王德来.对易县“两山”划分和“四荒”拍卖的思考[J].河北林果研究,2000,15(1):20-23. 被引量：3
7王飞,王国胤,李智星,彭思源.一种基于网格的密度峰值聚类算法[J].小型微型计算机系统,2017,38(5):1034-1038. 被引量：22
8周小亮,吴东洋,曹磊,王玉鹏,业宁.基于修剪树的优化聚类中心算法[J].南京大学学报（自然科学版）,2021,57(2):167-176. 被引量：1

二级引证文献36

1李明,刘敏,陈胜利,耿存胜,黄茂业,李茜.赣榆县生态林业建设与可持续发展的思考[J].江苏林业科技,2010,37(1):52-55.
2李青华,马春波.基于并行聚类算法的无监督异常检测研究[J].舰船电子工程,2012,32(1):79-82. 被引量：2
3金华英.对农业承包中不完善合同的探讨[J].科技创新与应用,2013,3(25):253-253.
4李静滨,杨柳,陈宁江.基于MapReduce的改进K-Medoids并行算法[J].广西大学学报（自然科学版）,2014,39(2):341-345. 被引量：5
5罗会兰,郭敏杰,孔繁胜.集成多特征与稀疏编码的图像分类方法[J].模式识别与人工智能,2014,27(4):345-355. 被引量：7
6邓仲华,李志芳.基于情报学视角的科学研究第四范式需求分析[J].情报科学,2015,33(7):3-6. 被引量：12
7王鹏,王睿婕.K-均值聚类算法的MapReduce模型实现[J].长春理工大学学报（自然科学版）,2015,38(3):120-124. 被引量：3
8郭均鹏,王梅南,高成菊,戴晖.函数型数据的分步系统聚类算法[J].系统管理学报,2015,24(6):814-820. 被引量：7
9杨向荣,王希武,王涌鑫.基于特征值的标称数据相关分析[J].计算机与数字工程,2016,44(5):822-824.
10郭永玲.多租户环境下多机群网格数据负载均衡方法[J].电子设计工程,2017,25(24):109-113. 被引量：1

1邓瑞鹏,王意洁,李小勇,王媛.基于数据垂直划分的高效并行Skyline查询[J].计算机工程,2012,38(14):56-58. 被引量：1
2杨挺,孙雨耕,张志东,杨郁.无线传感器网络异构驱动路由算法[J].计算机工程,2008,34(19):12-14. 被引量：2
3张志东,孙雨耕,杨挺,杨郁.基于多分支虚拟槽节点的无线传感器网络路由算法研究[J].传感技术学报,2007,20(11):2456-2460.
4李磊,张毅斌,童若锋,董金祥.一种三维场景中入口自动生成的方法[J].现代机械,2006(3):46-48.
5杨挺,孙雨耕,张强,南国芳.无线传感器网络数据聚合网络划分算法[J].天津大学学报,2008,41(11):1276-1280.

计算机研究与发展

2007年第9期

浏览历史

内容加载中请稍等...

一种基于数据垂直划分的分布式密度聚类算法被引量：8

参考文献11

二级参考文献10

共引文献14

同被引文献46

引证文献8

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

一种基于数据垂直划分的分布式密度聚类算法 被引量：8

参考文献11

二级参考文献10

共引文献14

同被引文献46

引证文献8

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

一种基于数据垂直划分的分布式密度聚类算法被引量：8