期刊文献+

面向大规模数据的DBSCAN加速算法综述 被引量:4

Survey on DBSCAN Acceleration Algorithms for Large Scale Data
下载PDF
导出
摘要 DBSCAN(density-based spatial clustering of applications with noise)是应用最广的密度聚类算法之一.然而,它时间复杂度过高(O(n^(2))),无法处理大规模数据.因而,对它进行加速成为一个研究热点,众多富有成效的工作不断涌现.从加速目标上看,这些工作大体上可分为减少冗余计算和并行化两大类;就具体加速手段而言,可分为6个主要类别:基于分布式、基于采样化、基于近似模糊、基于快速近邻、基于空间划分以及基于GPU加速技术.根据该分类,对现有工作进行了深入梳理与交叉比较,发现采用多重技术的融合加速算法优于单一加速技术;近似模糊化、并行化与分布式是当前最有效的手段;高维数据仍然难以应对.此外,对快速化DBSCAN算法在多个领域中的应用进行了跟踪报告.最后,对本领域未来的方向进行了展望. DBSCAN(density-based spatial clustering of applications with noise)is one of the most widely used and studied density clustering algorithms for its simplicity and easy implementation.However,the high time complexity(O(n^(2)))yields large-scale data that it is unable to deal with,due to that DBSCAN has great number of redundant distance computations in the process of calculating density.Therefore,accelerating it,which aims to make it suitable for big data environment,has become a research hotspot,and much fruitful work has emerged.From the perspective of acceleration goals,these efforts can be broadly divided into two categories:reducing redundant computations and parallelization.In terms of specific acceleration means,they can be divided into six main categories:distributed technique,sampling,approximation,fast neighbor,space division and GPU acceleration.According to this classification,the existing work is thoroughly combed and cross compared.It is found that the fusion acceleration algorithms of multiple technologies are better than those that only use single acceleration technology;approximate fuzziness,parallelism and distribution are the most effective methods to accelerate DBSCAN at present;high-dimensional data are still difficult to deal with.In addition,the applications of fast DBSCAN in many fields are tracked and reported.Finally,the future direction of rapid DBSCAN is prospected.
作者 陈叶旺 曹海露 陈谊 康昭 雷震 杜吉祥 Chen Yewang;Cao Hailu;Chen Yi;Kang Zhao;Lei Zhen;Du Jixiang(College of Computer Science and Technology,Huaqiao University,Xiamen,Fujian 361021;Beijing Key Laboratory of Big Data Technology for Food Safety(Beijing Technology and Business University),Beijing 100048;School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731;State Key Laboratory of Pattern Recognition(Institute of Automation,Chinese Academy of Sciences),Beijing 100190;Xiamen Key Laboratory of Data Security and Blockchain Technology(Huaqiao University),Xiamen,Fujian 361021;Fujian Key Laboratory of Big Data Intelligence and Security(Huaqiao University),Xiamen,Fujian 361021;Jiangsu Provincial Key Laboratory for Computer Information Processing Technology(Soochow University),Suzhou,Jiangsu 215006)
出处 《计算机研究与发展》 EI CSCD 北大核心 2023年第9期2028-2047,共20页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61673186,71771094,61876068,61972010) 福建省科技计划引导性项目(2021H0019) 福建省自然科学基金项目(2020J05059,2021J01317)。
关键词 快速化DBSCAN 密度聚类 聚类算法 大数据 数据挖掘 fast DBSCAN density clustering clustering algorithm big data data mining
  • 相关文献

参考文献6

二级参考文献31

  • 1姚和平.电力系统计算机网络通信协议-ICCP[J].电力系统自动化,1996,20(2):49-53. 被引量:1
  • 2Ester M, Kriegel H P, Sander J, Xu X. A densitybased algorithm for dis?covering clusters in large spatial databases. Data Mining and Knowl?edge Discovery, 1996,96: 226-231. 被引量:1
  • 3MacQueen J B. Some methods for classification and analysis of multi?variate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967,281-297. 被引量:1
  • 4Zhang T, Ramakrishnan R, Livny M. Birch: an efficient data cluster?ing method for very large databases. In: Proceedings of 1996 the ACM SIGMOD Conference on Managemnet of Data. 1996, lO3-114. 被引量:1
  • 5Dempster A P, Laird N M, Rubin D B. Maximum likelihood from in?complete data via the EM algorithm. Journal of the Royal Statisticai Societ, 1977,39(1): 1-38. 被引量:1
  • 6Wang W, Yang J, Muntz R R. Sting: A statistical information grid ap?proach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, 186-195. 被引量:1
  • 7Microsoft Academic Search. Top publications in data mining. http://academic.research.microsoft.com/CSDirectory/papeccategory_ 7.html. 2013. 被引量:1
  • 8Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. 2008, lO7-113. 被引量:1
  • 9White T. Hadoop: The Definitive Guide, 1st edition. O'Reilly Media, Inc., 2009. 被引量:1
  • 10Berger M, Bokhari S. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 1987,36: 570- 580. 被引量:1

共引文献39

同被引文献37

引证文献4

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部