期刊文献+

Spark框架下支持差分隐私保护的K-means++聚类方法

K-means++Clustering Method Supporting Differential Privacy Protection in Spark Framework
下载PDF
导出
摘要 针对差分隐私聚类算法在处理海量数据时其隐私性和可用性之间的矛盾,提出了一种分布式环境下支持差分隐私的K-means++聚类算法.该算法通过内存计算引擎Spark,创建弹性分布式数据集,利用转换算子及行动算子操作数据进行运算,并在选取初始化中心点及迭代更新中心点的过程中,通过综合利用指数机制和拉普拉斯机制,以解决初始聚类中心敏感及隐私泄露问题,同时减少计算过程中对数据实施的扰动.根据差分隐私的特性,从理论角度对整个算法进行证明,以满足ε-差分隐私保护.实验结果证明了该方法在确保聚类结果可用性的前提下,具备出色的隐私保护能力和高效的运行效率. To address the trade-off between privacy and utility in differentially private clustering algorithms when handling with massive data,a distributed differentially private K-means++clustering algorithm is proposed.This algorithm leverages the memory-based computing engine Spark to create resilient distributed datasets(RDD)and performscomputationsusing transformation and action operators.During the selection of initial centroids and iterative updates,a combination of the exponential mechanism and the Laplace mechanism is employed to mitigate the issues of sensitive initial centroids and privacy leakage,while reducing perturbation applied to the data during the computation.According to the characteristics of differential privacy,this paper provides a theoretical proof for the entire algorithm to satisfy e-differential privacy protection.Experimental results demonstrate that this method possesses excellent privacy protection capabilities and efficient operational efficiency while ensuring the usability of clustering results.
作者 石江南 彭长根 谭伟杰 Shi Jiangnan;Peng Changgen;Tan Weijie(State Key Laboratory of Public Big Data(Guizhou University),Guiyang 550025;Key Laboratory of Advanced Manufacturing Technology(Guizhou University),Ministry of Education,Guiyang 550025)
出处 《信息安全研究》 CSCD 北大核心 2024年第8期712-718,共7页 Journal of Information Security Research
基金 国家自然科学基金项目(62272124,62361010) 国家重点研发计划项目(2022YFB2701401) 贵州大学培育项目(贵大培育[2019]56号) 贵州大学人才引进科研项目(贵大人基合字(2020)61号) 现代制造技术教育部重点实验室2021年度开放基金项目(GZUAMT2021KF[01])。
关键词 数据挖掘 聚类算法 差分隐私 Spark框架 指数机制 data mining clustering algorithm differential privacy Spark exponential mechanism
  • 相关文献

参考文献6

二级参考文献49

  • 1江小平,李成华,向文,张新访,颜海涛.k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报(自然科学版),2011,39(S1):120-124. 被引量:79
  • 2宋晓云,苏宏升.一种并行决策树学习方法研究[J].现代电子技术,2007,30(2):141-144. 被引量:4
  • 3Han J W, Kamber M, Pei J. Data Mining: Concepts and Techniques. 3rd ed. San Francisco: Morgan Kaufmann, 2011. 被引量:1
  • 4Luo P, Lu K, Huang R, et al. A heterogeneous computing system for data mining workflows in multi-agent environ- ments. Expert Syst, 2006, 23:258-272. 被引量:1
  • 5Zhuang F Z, He Q, Shi Z Z. Multi-agent based on automatic evaluation system for classification algorithm. In: Proceedings of International Conference on Information Automation, Zhangjiajie, 2008. 264-269. 被引量:1
  • 6Hameenanttila T, Guan X L, Carothers J D, et al. The flexible hypercube: a new fault-tolerant architecture for parallel computing. J Parallel Distr Com, 1996, 37:213-220. 被引量:1
  • 7Goudreau M W, Lang K, Rao S B, et al. Portable and efficient parallel computing using the BSP model. IEEE Trans Comput, 1999, 48:670-689. 被引量:1
  • 8Chu C T, Kim S K, Lin Y A, et al. Map-reduce for machine learning on multicore. In: Proceedings of Advances in Neural Information Processing Systems 19, Vancouver, 2006. 281-288. 被引量:1
  • 9Borthakur D. The hadoop distributed file system: architecture and design. Hadoop Project Website, 2007, 11:21. 被引量:1
  • 10Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM, 2008, 51:107-113. 被引量:1

共引文献77

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部