=Paper=
{{Paper
|id=Vol-3318/paper5
|storemode=property
|title=k-Means SubClustering: A Differentially Private Algorithm with Improved Clustering Quality
|pdfUrl=https://ceur-ws.org/Vol-3318/paper5.pdf
|volume=Vol-3318
|authors=Devvrat Joshi,Janvi Thakkar
|dblpUrl=https://dblp.org/rec/conf/cikm/JoshiT22
}}
==k-Means SubClustering: A Differentially Private Algorithm with Improved Clustering Quality==
π-Means SubClustering: A Differentially Private Algorithm with Improved Clustering Quality Devvrat Joshi1,β,β , Janvi Thakkar1,β,β 1 Indian Institute of Technology Gandhinagar, India Abstract In todayβs data-driven world, the sensitivity of information has been a significant concern. With this data and additional information on the personβs background, one can easily infer an individualβs private data. Many differentially private iterative algorithms have been proposed in interactive settings to protect an individualβs privacy from these inference attacks. The existing approaches adapt the method to compute differentially private(DP) centroids by iterative Llyodβs algorithm and perturbing the centroid with various DP mechanisms. These DP mechanisms do not guarantee convergence of differentially private iterative algorithms and degrade the quality of the cluster. Thus, in this work, we further extend the previous work on βDifferentially Private π-Means Clustering With Convergence Guaranteeβ by taking it as our baseline. The novelty of our approach is to sub-cluster the clusters and then select the centroid which has a higher probability of moving in the direction of the future centroid. At every Lloydβs step, the centroids are injected with the noise using the exponential DP mechanism. The results of the experiments indicate that our approach outperforms the current state-of-the-art method, i.e., the baseline algorithm, in terms of clustering quality while maintaining the same differential privacy requirements. The clustering quality significantly improved by 4.13 and 2.83 times than baseline for the Wine and Breast_Cancer dataset, respectively. Keywords differential privacy, π-means clustering, convergence guarantee 1. Introduction privacy [3]. In contrast to Lloydβs K-means algorithm, which guarantees convergence, these algorithms do not Achieving extraordinary results is dependent on the data provide any convergence guarantee. Getting differen- on which the machine learning models are trained. Data tially private centroids might not help in getting quality curators have a responsibility to provide datasets such inferences because of this non-convergence. We studied that the privacy of data is not compromised. However, an existing approach that provides this guarantee and attackers use other public datasets to perform inference converges in twice the number of iterations to Lloydβs al- and adversarial attacks to get information about an indi- gorithm while maintaining the same differential privacy vidual in the dataset. Differential privacy is a potential requirements as existing works [4] [5]. Their algorithm technique for giving customers a mathematical guarantee perturbs the centroids in a random direction from the of the privacy of their data[1]. There are two fundamen- center of the cluster. However, this lowers the quality of tal settings in which differential privacy is used on data: clustering, which is necessary for making inferences. in interactive setting data curator holds the data and re- In this work, we propose a variant of the existing ap- turns the response based on the queries requested by proach, which provides better clustering quality while third parties; while in non-interactive setting the curator using the same privacy budget. We used the intuition sanitized the data before publishing[2]. of Lloydβs algorithm that the next centroid will move Iterative clustering algorithms provide important in- in the direction where there is a higher number of data sights about the dataset, which helps in a large number of points. Finally, we give the mathematical proof that our applications. They are prone to privacy threats because approach at any instance gives better clustering quality they can reveal information about an individual with ad- than the existing approaches. We have tested our ap- ditional knowledge. Existing approaches obtain the set proach on breat_cancer, wine, iris, and digits datasets. of centroids using Lloydβs K-means algorithm, then per- We were able to get a significant improvement from the turb them with a differentially private mechanism to add previous approach in terms of clustering quality. Interactive setting implies that the dataset is not dis- CIKM-PASβ22: PRIVACY ALGORITHMS IN SYSTEMS (PAS) Workshop, closed to the user, however, the data curator returns the Conference on Information and Knowledge Management, October 21, response of each query received from the user by manip- 2022, CIKM-PAS β Corresponding author. ulating it using DP strategy. β These authors contributed equally. Our main contribution includes: Envelope-Open devvrat.joshi@iitgn.ac.in (D. Joshi); janvi.thakkar@iitgn.ac.in 1. We proposed SubClustering approach which has (J. Thakkar) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License better clustering quality than the baseline (which Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) is the current SOTA in terms of clustering qual- ity). For the Wine and Breastcancer dataset, the same epsilon privacy. clustering quality improved by 4.13 and 2.83 times respectively. 2. In addition to improving the clustering quality, 3. Preliminaries our algorithm used same privacy budget as that The definitions used in this work are briefly discussed of the existing work. in this section. The following is a formal definition of Differential Privacy: 2. Related Work Definition 1 (π-DP [9]). A randomised mechanism T is π- differentially private if for all neighbouring datasets β² The concept of differential privacy has inspired a plethora π and π and for an arbitrary answer π β π ππππ(π ), T of studies, particularly in the area of differentially pri- satisfies vate k-means clustering [6][7][8] in an interactive setting. ππ[π (π ) = π ] β€ ππ₯π(π) β ππ[π (π β² ) = π ], The important mechanisms of DP in the literature include: the Laplace mechanisms (LapDP) [9], the exponential where π is the privacy budget. mechanisms (ExpDP) [10], and the sample and aggregate Here, π and π β² differ by only one item. Smaller val- framework [11]. To achieve differential privacy, many im- ues of π imply a better privacy guarantee. It is because plementations included infusing Laplace noise into each the difference between the two neighboring datasets is iteration of Lloydβs algorithm. The proportion of noise reflected by the privacy budget. In this work, we use the added was based on a fixed privacy budget. Some of the ExpDP and LapDP. In exponential DP for non-numeric strategies for allocating privacy budget included splitting computation, they introduce the concept of scoring func- the overall privacy budget uniformly to each iteration tion π(π , π₯), which represents the effectiveness of the [12]. However, this requires us to calculate the number of pair (π , π₯). Here π is the dataset and π₯ is the response to iterations for the convergence, prior to the execution of the π(π , π₯) on X. algorithm, thus increasing the computational cost. Fur- The formal definition of Exponential DP mechanism ther, researchers overcome this weakness by allocating is defined as follow: theoretically guaranteed optimal allocation method [6], Definition 2 (Exponential Mechanism [10]). but the major assumption taken in this approach was Given a scoring function of a dataset π , π(π , π₯), which that every cluster has the same size, which does not align reflects the quality of query respond x. The expo- with the real-world datasets. In another work, Mohan nential mechanism T provides π-differential privacy, et al. [8] proposed GUPT, which uses Lloydβs algorithm if π (π ) = {ππ[π₯] β ππ₯π( πβ π(π ,π₯) )}, where Ξπ is the for local clustering of each bucket where the items were 2Ξπ uniformly sampled to different buckets. The final result sensitivity of scoring function q(X,x), π is the privacy was the mean of locally sampled points in each bucket budget. with added Laplace noise. But, the clustering quality of Definition 3 (Convergent & Sampling Zones[3]). GUPT was unsatisfying because a large amount of noise A region (π‘) whose points satisfies the condition: { Node S: was added in the aggregation stage. βπ β π π β < βππ (π‘β1) β ππ (π‘) β} is the convergent zone. ππ (π‘) is (π‘) Based on the study of past literature on differentially defined as the mean of πΆπ . A sub-region inside convergent private k-means clustering, Zhigang et al. [3] concluded zone is defined as a sampling zone. that convergence of an iterative algorithm is important to Definition 4 (Orientation Controller[3]). ππ (π‘) is a the clustering quality. To solve this, they introduced the direction from the center of the convergent zone to a point concept of the convergent zone and orientation controller. on its circumference. This is the direction along which the With the help of a convergent zone and orientation con- center of the sampling zone will be sampled, defined as the troller, they further create a sampling zone for selecting orientation controller. a potential centroid for the ππ‘β iteration. The approach iteratively adds noise with an exponential mechanism (ExpDP) by using prior and future knowledge of the po- 4. Approach tential centroid at every step of Lloydβs algorithm. The In this section, we explain our proposed approach and approach maintains the same DP requirements as existing the baseline approach. literature, with guaranteed convergence and improve- ment in clustering quality. However, their algorithm perturbs the centroids in a random direction from the 4.1. Overview - KMeans Guarantee center of the cluster, degrading the quality of clustering. (Baseline) Thus, in this work, we further build upon the approach We took βDifferentially Private K-Means Clustering with and significantly improve the clustering quality with the Convergence Guaranteeβ [3] as our baseline and im- Algorithm 1: Differentially Private πβMeans SubClustering Algorithm Input: X = {π₯1 , π₯2 , ...., π₯π }: Dataset with N data points k: number of clusters π ππ₯π : ExpDP privacy budget π πππ : Laplacian privacy budget for the converged centroids. πππ‘ππππππΎ: number of sub-clusters per cluster Figure 1: Overview of KMeans Guarantee Approach Output: S: Final clustering centroids (0) (0) (0) (0) 1 Select π centroids S = (π1 , π2 , ..., ππ ) uniformly from X. proved the clustering quality by further building on it. 2 ππ‘ππππ‘ππππΉ πππΏπππ¦π = number of iterations to run the The key concept of the algorithm is to use ExpDP to in- algorithm. troduce bounded noise into centroids at each iteration of 3 for iters i in ππ‘ππππ‘ππππΉ πππΏπππ¦π do Lloydβs algorithm. The technique is designed in a way 4 for each Cluster i at Iteration t do that it ensures the new centroid is different from the cen- (π‘) 5 πΆπ β assign each π₯π to its closest troid of Lloydβs algorithm while maintaining constraint given in Lemma 1. The constraint guarantees that the centroid ππ π‘β1 ; perturbed centroid will eventually converge with the 6 ππ π‘ β centroid of πΆπ π‘ ; centroid of Lloydβs algorithm. 7 πΆπππ£ππππππ‘π ππππ (π‘) β List of data points Their algorithm has four main steps to update the inside the spherical region having ππ π‘ and centroids at each Lloyd step t [3]. The overview of their ππ π‘β1 as the endpoints of its radius. approach can be seen in (Figure : 1). 8 πππππππππ ππππ (π‘) β run Algorithm 2 using 1. Let the differentially private centroid at iteration πΆπππ£ππππππ‘π ππππ (π‘) , πππ‘ππππππΎ; (π‘β1) (π‘) π‘ β 1 for a cluster π be ππΜ . Using this centroid, 9 ππΜ β sample from πππππππππ ππππ (π‘) run one iteration of Lloydβs algorithm to get the using ExpDP with π and π ππ₯π ; (π‘) current Lloydβs centroid ππ (π‘) for each cluster π. 10 ππ (π‘) β ππΜ 2. Using ππ (π‘) and ππ (π‘β1) , generate a convergent zone for each cluster π as described in π·ππ ππππ‘πππ 3. 11 Publish: πππππππππ ππππ (π‘) , π, π ππ₯π , ππ (π‘) 12 S β add laplace noise with π πππ to S(π‘) ; 3. Generate a sampling zone in the convergence zone and an orientation controller ππ (π‘) for each cluster i as defined in π·ππ ππππ‘πππ 3 πππ 4 respectively. (π‘) 4. Sample a differentially private ππΜ with ExpDP in the sampling zone generated in step 3. Algorithm 2: SubClusterSamplingAlgorithm Input: ConvergentZone: Convergent Zone The definition for the convergent zone (for convergence internalK: Subclustering K guarantee) and sampling zone (for centroid updating) is Output: πππππππππ πππππ‘ defined in Definition 3. (π‘) (π‘) 1 S : Mean of πΆπππ£ππππππ‘π ππππ 2 ConvergentZoneClusters β Cluster 4.2. Overview - SubCluster Guarantee ConvergentZone using Lloydβs algorithm and We build upon the KMeans Guarantee algorithm to πππ‘ππππππΎ achieve better clustering quality. Our idea differs from 3 ConvergentZoneProbability β Assign the baseline in terms of creating a sampling zone. For probabilities to the ConvergentZoneClusters each cluster, we execute Lloydβs algorithm over its con- proportional to the number of points inside each vergent zone to generate its sub-clustering. Further, we cluster. (t) assign each sub-cluster with a probability linearly pro- 4 SamplingZonei β Sample a cluster from the portional to the number of points it contains. Finally, we ConvergentZoneClusters using sample the sub-cluster based on the assigned probability ConvergentZoneProbability (t) and define it as the sampling zone of the convergent zone. 5 Return: SamplingZonei ; Drawing analogy from the KMeans Guarantee algorithm, for each cluster π as described in π·ππ ππππ‘πππ 3. 3. SubCluster the convergence zone and sample one of the sub-cluster as our sampling zone based on the probability assigned to each sub-cluster. The probability assignment is directly proportional to the number of points in each sub-cluster. (π‘) 4. Sample a differentially private ππΜ with EXpDP in the sampling zone generated in step 3. Our approach surpasses the baseline approach in terms of clustering quality while maintaining the same DP re- quirements as that of the KMeans Guarantee approach, which is evident from the results obtained (Figure : 3). The better clustering quality is a result of our sub- clustering strategy to perturb centroid with a higher prob- ability than the baseline approach towards the direction Figure 2: Overview of SubCluster Guarantee Approach of the actual centroid generated by Lloydβs algorithm. The pseudo-code of our approach is shown in the Algo- rithm 1 and Algorithm 2. our orientation controller is this sub-clustering and sam- Lemma 1: [3] A randomised iterative algorithm π is pling technique. Intuitively, our algorithm ensures that convergent if, in πΆ (π‘) (Cluster i at iteration t), ππΜ (π‘) (sampled π the sampling zone lies towards the region containing a centroid using π), π (π‘β1) (centroid before recentering) and π higher number of data points in an expected case. With (π‘) (π‘) Μ (π‘) (π‘) this, we guarantee that our differentially private centroid ππ (centroid of πΆπ ) satisfies the invariant, ||ππ βππ || < (π‘) (π‘β1) moves in the direction where the number of data points ||ππ β ππ || in Euclidean distance, βπ‘, π. is higher, incorporating the intuition of Lloydβs algorithm We reproduce this lemma from our baseline approach without compromising on the π-differential privacy. The [3]. Lemma1 and Lemma 2 together provides the com- probability of a differentially private centroid at π β 1π‘β pleteness and proof for the convergence of our approach. iteration to move in the direction of a more populated re- If the distance between the sampled centroid ππΜ(π‘) from gion at the ππ‘β step of Lloydβs algorithm is also high. Thus, the πΆ (π‘) and the new centroid π (π‘) is less than the distance π π we introduce the concept of sub-clustering in the conver- (π‘) (π‘β1) gent zone and consequently sample one sub-cluster as between the new π π and the old centroid ππ , then the our sampling zone. random iterative algorithm will always converge. Intu- (π‘) (π‘) We sample the centroid from the sampling zone using itively, the loss of πΆπ is minimum if the mean of πΆπ is the ExpDP mechanism. Finally, we inject Laplace noise taken as centroid. But, if we slightly shift from the mean in the centroids of the clustering when our algorithm of πΆπ(π‘) , then the loss will increase. However, if we can converges. It is because the differentially private cen- ensure that any sampled point from πΆ (π‘) fulfills the condi- π troids obtained are a subset of one of the local minima tion: ||π Μ (π‘) β ππ (π‘) || < ||ππ (π‘) β ππ (π‘β1) ||, it will lead to a lesser at which Lloydβs algorithm converges. The overview of π (π‘β1) the proposed approach can be seen in (Figure : 2). We loss than π½ π π , thus, resulting into convergence of the show that a randomized iterative algorithm satisfies an randomised iterative algorithm. For the mathematical invariant (given in the claim of Lemma 1) and always proof, refer [3]. converges (Proof: refer Lemma 1). Finally, we show Lemma 2: Differentially Private πβMeans SubClus- that the SubCluster algorithm is a randomized iterative tering approach (SubClustering) is a randomised itera- algorithm that satisfies the invariant(given in Lemma 1) tive algorithm that satisfies the invariant ||π Μ (π‘) β π (π‘) || < π π (Proof: Refer Lemma 2). ||ππ (π‘) β ππ (π‘β1) ||. We have four main steps to update the centroids at Proof: SubClustering is an iterative algorithm that each Lloyd step t. samples a set of centroids for each iteration with Ex- 1. Let the differentially private centroid at iteration pDP mechanism, thus, making it a randomised itera- (π‘β1) tive algorithm. It subclusters the points lying inside π‘ β 1 for a cluster π be ππΜ . Using this centroid, (π‘) run one iteration of Lloydβs algorithm to get the πΆπππ£ππππππ‘π ππππ . After subclustering, it samples one subcluster (sampling zone) with the assigned probabili- current Lloydβs centroid ππ (π‘) for each cluster π. (π‘) (π‘β1) ties (linearly proportional to the number of data points in 2. Using ππ and ππ , generate a convergent zone subcluster). Finally, it samples a datapoint from the sam- Figure 3: Above figures plots the graph between costGap and epsilon budget for two approaches, the baseline as KmeansGuar- antee and our approach SubClusterGuarantee. The algorithm was tested on four dataset, Digits (top-left), Wine (top-right), Breast Cancer (bottom-left), and Iris (bottom-right) datasets. pled subcluster with ExpDP and call it as the centroid of Cluster Guarantee approach) (πΆππ π‘π·π ) and Lloydβs algo- (π‘) rithm (πΆππ π‘πΏπππ¦π ): πΆπππ£ππππππ‘π ππππ . Thus, our sampling zone always lies (π‘) inside πΆπππ£ππππππ‘π ππππ . Therefore, the sampled point |πΆππ π‘π·π β πΆππ π‘πΏπππ¦π | (π‘) lies inside πΆπππ£ππππππ‘π ππππ and it satisfies the invariant πΆππ π‘πΊππ = (1) (π‘) πΆππ π‘πΏπππ¦π ||ππΜ β ππ (π‘) || < ||ππ (π‘) β ππ (π‘β1) ||. The smaller CostGap [3] represents the better quality of clustering. In the experiments, we compare the clustering 5. Experimental Setup quality of SubCluster Guarantee with KMeans Guarantee. 5.1. Dataset Used 6. Results and Discussion We used following four datasets to test our work Sub- Cluster Guarantee upon the baseline: We tested our algorithm on four datasets. All the datasets 1. Iris [13] dataset comprises total of 150 datapoints have different dimensions ranging from 4 to 64 dimen- with four features and three classes. sions and training sets ranging from 150 to 1800. As 2. Wine[13] dataset comprises total of 178 data- defined in metric smaller gap represents the better clus- points with 13 features and three classes. tering quality. From the (Figure : 3) we can observe that, cost gap for all the dataset is smaller or equal to 3. Breast Cancer[13] dataset comprises total of the baseline. Thus, it is evident that our algorithm has 569 datapoints with 30 features and two classes. better clustering quality than the existing work for all the 4. Digits[13] dataset comprises of 1797 datapoints datasets experimented. We varied internalK (parameter with 64 dimensions and 10 classes. for number of sub-clusters) from 2 to 5. Each experiment was conducted 30 times in the case 5.2. Metric for Clustering Quality of the Iris, Wine, and Breast cancer dataset and 10 times To evaluate the clustering quality, we used the follow- for digits dataset due to computational constraints. Fi- ing equation to calculate the normalised difference be- nally, for each dataset, we took the average of all the tween the differentially private algorithms (here, Sub- experiments as our final result for plotting the graphs. Figure 4: Above figures plots the graph between costGap and epsilon budget for different internalK in SubClusterGuarantee Algorithm. The algorithm was tested for internalK=2,3,4,5 for all the four datasets, Digits (top-left), Wine (top-right), Breast Cancer (bottom-left), and Iris (bottom-right). Please note: K and internalK are the same parameter Comparing the SubCluster Guarantee (proposed ap- internalK is increased causing zero probability proach) and K-means Guarantee approach (baseline) by sub-cluster regions. taking an average of all the cost gaps for varied epsilon, 2. Wine: The wine dataset has 13 dimensions and and finally taking the ratio between K-means and Sub- 178 data points in the training set. Our algorithm Cluster approach: performs significantly better than the baseline, as 1. In case of Iris dataset, the cost gap is 1.1 times observed in (Figure : 3). It is because the base- smaller than baseline algorithm. line algorithm is constrained to choose a theta in 2. In case of Wine dataset, the cost gap is 4.13 times any abrupt direction ranging from [βπ/2, π/2] as smaller than baseline algorithm. shown in (Figure : 1). In contrast, our algorithm 3. In case of Breast_Cancer dataset, the cost gap shifts the centroids in the direction where the fu- is 2.83 times smaller than baseline algorithm. ture centroid of Lloydβs algorithm is more likely 4. In case of Digits dataset, the cost gap is almost to move (in the expected case). From (Figure : 4), same as that of baseline algorithm. it is evident that internalK=4 for the wine dataset performs better than the rest of the internalK val- ues. Here, the number of dimensions is more than 6.1. Detailed Analysis Iris. Therefore, the spatial arrangement will be in 1. Iris: Iris dataset has four dimensions and a very an n-sphere which allows better sub-clustering. small training set of 150 data points. Our al- 3. Breast_Cancer: Breast_Cancer dataset has 569 gorithm achieves better clustering quality than data points in its training set and 30 dimensions. the baseline algorithm for smaller epsilon values. Our algorithm performs exceptionally better than Since the number of data points is less in Iris, the the baseline, with internalK equal to 4. From impact of sub-clustering reduces, resulting in its (Figure : 3), we can observe that there is no performance similar to that of the baseline ap- monotonous trend for the costGap. Trends are proach. From (Figure : 4), we can observe that visible in other datasets due to the larger num- changing the value of intenalK has a small impact ber of classification classes, whereas this dataset on the costGap due to a small number of points has only two classes. Thus, adding Laplace noise in each sub-cluster. This is because there is a pos- does not have a relation to the clustering quality. sibility that a sub-cluster has no data point when Increasing the internalK improves the clustering quality, with internalK being 4 having the least to prove the effectiveness of our work, we plan loss. It is because this dataset has a high number to experiment with other algorithms in the lit- of dimensions and a larger number of training erature including, PrivGene [14], GUPT [8] and points than other datasets. DWork [7]. 4. Digits: It has 64 dimensions and 1797 data points β’ The DP requirements in this work are the same in the training dataset. Although it has a large as that of past literature, but in the future, we number of dimensions, our algorithm has a very plan to explore ways to improve the current DP small improvement over the baseline algorithm as guarantees while maintaining the same clustering seen in (Figure : 3). Because of the higher time quality as in this work. complexity of our algorithm, it is hard to tune β’ We used Exponential and Laplace mechanisms the internalK parameter. As the number of sam- of DP in the proposed approach; we further plan ples in a dataset increases, the internalK should to explore the third mechanisms, i.e., sample and increase because a single cluster can contain a aggregate framework, by integrating it with the large number of data points. But, due to limited current algorithm. computational resources, we were not able to ex- β’ In our algorithm, the number of data points inside periment with it further. We took internalK to a cluster is variable. Thus we plan to choose an be 5 for our experiments as it performed best in internalK, custom to the size of the cluster to the range [2, 5] as in the (Figure : 4). One of the improve the clustering quality. intriguing findings in the datasetβs results is that the curves based on the internalK have a clearly evident trend, which is a result of the large num- Acknowledgement ber of training data points. We would like to thank Prof. Anirban Dasgupta Our proposed algorithm significantly improves over the (IIT Gandhinagar) for his continuous support and baseline in terms of clustering quality, especially for the guidance throughout the research. wine and breast cancer dataset. In addition our algorithm maintains the same DP requirements as that of existing works. References [1] C. Dwork, Differential privacy: A survey 7. Conclusion of results, in: International conference on theory and applications of models of com- This work presents a novel method for improving the putation, Springer, 2008, pp. 1β19. clustering quality of differentially private k-means al- [2] A. Narayanan, Data privacy: The non- gorithms while ensuring convergence. The novelty of interactive setting, The University of Texas our approach is the sub-clustering of the cluster to select at Austin, 2009. the differentially private centroid, which has a higher [3] Z. Lu, H. Shen, Differentially private k- probability of moving in the direction of the next cen- means clustering with convergence guar- troid. We proved that our work surpasses the current antee, IEEE Transactions on Dependable state-of-the-art algorithms in terms of clustering quality. and Secure Computing (2020). Especially for the Wine and Breast_Cancer dataset, the [4] D. Su, J. Cao, N. Li, E. Bertino, H. Jin, Dif- clustering quality was significantly improved by 4.13 and ferentially private k-means clustering, in: 2.83 times than the baseline. In addition, we maintain Proceedings of the sixth ACM conference the same DP requirements as that of baseline and other on data and application security and pri- existing approaches. vacy, 2016, pp. 26β37. [5] J. Lei, Differentially private m-estimators, Advances in Neural Information Processing 8. Future Work Systems 24 (2011). [6] D. Su, J. Cao, N. Li, E. Bertino, M. Lyu, H. Jin, β’ In this work, we proved our claim using empirical Differentially private k-means clustering results. We further plan to validate the results and a hybrid approach to private optimiza- by providing mathematical bounds for the con- tion, ACM Transactions on Privacy and vergence degree and rate of the SubClustering Security (TOPS) 20 (2017) 1β33. Lloydβs algorithm. In terms of clustering qual- [7] C. Dwork, A firm foundation for private ity, the proposed algorithm in this work is com- data analysis, Communications of the ACM pared with k-means guarantee clustering only; 54 (2011) 86β95. [8] P. Mohan, A. Thakurta, E. Shi, D. Song, D. Culler, Gupt: privacy preserving data analysis made easy, in: Proceedings of the 2012 ACM SIGMOD International Con- ference on Management of Data, 2012, pp. 349β360. [9] C. Dwork, F. McSherry, K. Nissim, A. Smith, Calibrating noise to sensitivity in private data analysis, in: Theory of cryptography conference, Springer, 2006, pp. 265β284. [10] F. McSherry, K. Talwar, Mechanism de- sign via differential privacy, in: 48th An- nual IEEE Symposium on Foundations of Computer Science (FOCSβ07), IEEE, 2007, pp. 94β103. [11] K. Nissim, S. Raskhodnikova, A. Smith, Smooth sensitivity and sampling in private data analysis, in: Proceedings of the thirty- ninth annual ACM symposium on Theory of computing, 2007, pp. 75β84. [12] A. Blum, C. Dwork, F. McSherry, K. Nis- sim, Practical privacy: the sulq framework, in: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2005, pp. 128β138. [13] A. Asuncion, Uci machine learning reposi- tory, university of california, irvine, school of information and computer sciences, http://www. ics. uci. edu/~ mlearn/ML- Repository. html (2007). [14] J. Zhang, X. Xiao, Y. Yang, Z. Zhang, M. Winslett, Privgene: differentially pri- vate model fitting using genetic algorithms, in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 665β676.