摘要
针对差分隐私聚类算法在处理海量数据时其隐私性和可用性之间的矛盾,提出了一种分布式环境下支持差分隐私的K-means++聚类算法.该算法通过内存计算引擎Spark,创建弹性分布式数据集,利用转换算子及行动算子操作数据进行运算,并在选取初始化中心点及迭代更新中心点的过程中,通过综合利用指数机制和拉普拉斯机制,以解决初始聚类中心敏感及隐私泄露问题,同时减少计算过程中对数据实施的扰动.根据差分隐私的特性,从理论角度对整个算法进行证明,以满足ε-差分隐私保护.实验结果证明了该方法在确保聚类结果可用性的前提下,具备出色的隐私保护能力和高效的运行效率.
To address the trade-off between privacy and utility in differentially private clustering algorithms when handling with massive data,a distributed differentially private K-means++clustering algorithm is proposed.This algorithm leverages the memory-based computing engine Spark to create resilient distributed datasets(RDD)and performscomputationsusing transformation and action operators.During the selection of initial centroids and iterative updates,a combination of the exponential mechanism and the Laplace mechanism is employed to mitigate the issues of sensitive initial centroids and privacy leakage,while reducing perturbation applied to the data during the computation.According to the characteristics of differential privacy,this paper provides a theoretical proof for the entire algorithm to satisfy e-differential privacy protection.Experimental results demonstrate that this method possesses excellent privacy protection capabilities and efficient operational efficiency while ensuring the usability of clustering results.
作者
石江南
彭长根
谭伟杰
Shi Jiangnan;Peng Changgen;Tan Weijie(State Key Laboratory of Public Big Data(Guizhou University),Guiyang 550025;Key Laboratory of Advanced Manufacturing Technology(Guizhou University),Ministry of Education,Guiyang 550025)
出处
《信息安全研究》
CSCD
北大核心
2024年第8期712-718,共7页
Journal of Information Security Research
基金
国家自然科学基金项目(62272124,62361010)
国家重点研发计划项目(2022YFB2701401)
贵州大学培育项目(贵大培育[2019]56号)
贵州大学人才引进科研项目(贵大人基合字(2020)61号)
现代制造技术教育部重点实验室2021年度开放基金项目(GZUAMT2021KF[01])。