1. Introduction

-Means SubClustering: A

Quality

Devvrat Joshi

devvrat.joshi@iitgn.ac.in 0 1

Janvi Thakkar

janvi.thakkar@iitgn.ac.in 0 1 0 Conference on Information and Knowledge Management 1 Indian Institute of Technology Gandhinagar , India

2022

In today's data-driven world, the sensitivity of information has been a significant concern. With this data and additional information on the person's background, one can easily infer an individual's private data. Many diferentially private iterative algorithms have been proposed in interactive settings to protect an individual's privacy from these inference attacks. The existing approaches adapt the method to compute diferentially private(DP) centroids by iterative Llyod's algorithm and perturbing the centroid with various DP mechanisms. These DP mechanisms do not guarantee convergence of diferentially private iterative algorithms and degrade the quality of the cluster. Thus, in this work, we further extend the previous work on 'Diferentially Private -Means Clustering With Convergence Guarantee' by taking it as our baseline. The novelty of our approach is to sub-cluster the clusters and then select the centroid which has a higher probability of moving in the direction of the future centroid. At every Lloyd's step, the centroids are injected with the noise using the exponential DP mechanism. The results of the experiments indicate that our approach outperforms the current state-of-the-art method, i.e., the baseline algorithm, in terms of clustering quality while maintaining the same diferential privacy requirements. The clustering quality significantly improved by 4.13 and 2.83 times than baseline for the Wine and Breast_Cancer dataset, respectively.

diferential privacy -means clustering convergence guarantee

1. Introduction

Achieving extraordinary results is dependent on the data on which the machine learning models are trained. Data curators have a responsibility to provide datasets such that the privacy of data is not compromised. However, attackers use other public datasets to perform inference and adversarial attacks to get information about an individual in the dataset. Diferential privacy is a potential technique for giving customers a mathematical guarantee tal settings in which diferential privacy is used on data: in interactive setting data curator holds the data and returns the response based on the queries requested by third parties; while in non-interactive setting the curator sanitized the data before publishing[ 2 ]. sights about the dataset, which helps in a large number of applications. They are prone to privacy threats because ditional knowledge. Existing approaches obtain the set of centroids using Lloyd’s K-means algorithm, then perturb them with a diferentially private mechanism to add CIKM-PAS’22: PRIVACY ALGORITHMS IN SYSTEMS (PAS) Workshop, ulating it using DP strategy.

Our main contribution includes: 1. We proposed SubClustering approach which has better clustering quality than the baseline (which is the current SOTA in terms of clustering qual

Iterative clustering algorithms provide important in- in the direction where there is a higher number of data they can reveal information about an individual with ad- than the existing approaches. We have tested our apity). For the Wine and Breastcancer dataset, the clustering quality improved by 4.13 and 2.83 times respectively. 2. In addition to improving the clustering quality, our algorithm used same privacy budget as that of the existing work.

2. Related Work

The concept of diferential privacy has inspired a plethora of studies, particularly in the area of diferentially private k-means clustering [ 6 ][ 7 ][ 8 ] in an interactive setting. The important mechanisms of DP in the literature include: the Laplace mechanisms (LapDP) [ 9 ], the exponential mechanisms (ExpDP) [ 10 ], and the sample and aggregate framework [ 11 ]. To achieve diferential privacy, many implementations included infusing Laplace noise into each iteration of Lloyd’s algorithm. The proportion of noise added was based on a fixed privacy budget. Some of the strategies for allocating privacy budget included splitting the overall privacy budget uniformly to each iteration [ 12 ]. However, this requires us to calculate the number of iterations for the convergence, prior to the execution of algorithm, thus increasing the computational cost. Further, researchers overcome this weakness by allocating theoretically guaranteed optimal allocation method [ 6 ], but the major assumption taken in this approach was that every cluster has the same size, which does not align with the real-world datasets. In another work, Mohan et al. [ 8 ] proposed GUPT, which uses Lloyd’s algorithm for local clustering of each bucket where the items were uniformly sampled to diferent buckets. The final result was the mean of locally sampled points in each bucket with added Laplace noise. But, the clustering quality of GUPT was unsatisfying because a large amount of noise was added in the aggregation stage.

Based on the study of past literature on diferentially private k-means clustering, Zhigang et al. [ 3 ] concluded that convergence of an iterative algorithm is important to the clustering quality. To solve this, they introduced the concept of the convergent zone and orientation controller. With the help of a convergent zone and orientation controller, they further create a sampling zone for selecting a potential centroid for the ℎ iteration. The approach iteratively adds noise with an exponential mechanism (ExpDP) by using prior and future knowledge of the potential centroid at every step of Lloyd’s algorithm. The approach maintains the same DP requirements as existing literature, with guaranteed convergence and improvement in clustering quality. However, their algorithm perturbs the centroids in a random direction from the center of the cluster, degrading the quality of clustering. Thus, in this work, we further build upon the approach and significantly improve the clustering quality with the same epsilon privacy.

3. Preliminaries

The definitions used in this work are briefly discussed in this section. The following is a formal definition of Diferential Privacy:

Definition 1 ( -DP [ 9 ]). A randomised mechanism T is - diferentially private if for all neighbouring datasets and ′ and for an arbitrary answer ∈ ( ) , T satisfies [ ( ) = ] ≤ () ⋅ [ ( ′) = ], where is the privacy budget.

Here, and ′ difer by only one item. Smaller values of imply a better privacy guarantee. It is because the diference between the two neighboring datasets is reflected by the privacy budget. In this work, we use the ExpDP and LapDP. In exponential DP for non-numeric computation, they introduce the concept of scoring function ( , ) , which represents the efectiveness of the pair ( , ) . Here is the dataset and is the response to the ( , ) on X.

The formal definition of Exponential DP mechanism is defined as follow:

Definition 2 (Exponential Mechanism [ 10 ]). Given a scoring function of a dataset , ( , ), which reflects the quality of query respond x. The exponential mechanism T provides -diferential privacy, if ( ) = { [] ∝ ( ⋅( ,2)Δ )}, where Δ is the sensitivity of scoring function q(X,x), is the privacy budget.

Definition 3 (Convergent & Sampling Zones[ 3 ]). A region whose points satisfies the condition: { Node S: ‖ − () ‖ < ‖ (−1) − () ‖} is the convergent zone. () is defined as the mean of () . A sub-region inside convergent zone is defined as a sampling zone.

Definition 4 (Orientation Controller[ 3 ]). () is a direction from the center of the convergent zone to a point on its circumference. This is the direction along which the center of the sampling zone will be sampled, defined as the orientation controller.

4. Approach

In this section, we explain our proposed approach and the baseline approach.

4.1. Overview - KMeans Guarantee (Baseline)

We took ”Diferentially Private K-Means Clustering with Convergence Guarantee” [ 3 ] as our baseline and im

1. Let the diferentially private centroid at iteration 2. Using () and (−1) , generate a convergent zone for each cluster as described in 3 () for each cluster respectively. () with ExpDP i as defined in 3 4 and an orientation controller 4. Sample a diferentially private ̂ in the sampling zone generated in step 3.

The definition for the convergent zone (for convergence guarantee) and sampling zone (for centroid updating) is defined in Definition 3.

4.2. Overview - SubCluster Guarantee

We build upon the KMeans Guarantee algorithm to achieve better clustering quality. Our idea difers from the baseline in terms of creating a sampling zone. For each cluster, we execute Lloyd’s algorithm over its convergent zone to generate its sub-clustering. Further, we assign each sub-cluster with a probability linearly proportional to the number of points it contains. Finally, we sample the sub-cluster based on the assigned probability and define it as the sampling zone of the convergent zone.

Drawing analogy from the KMeans Guarantee algorithm, .

11 Publish: () , , , () 3. Generate a sampling zone in the convergence zone 12 S ← add laplace noise with to S() ; points Algorithm 1: Diferentially Private − Means

SubClustering Algorithm

Input: X = { 1, 2, ...., }: Dataset with N data : Laplacian privacy budget for the converged centroids. k: number of clusters : ExpDP privacy budget

: number of sub-clusters per cluster uniformly from X.

Output: S: Final clustering centroids 1 Select centroids S(0) = ( 1 (0), 2

(0), ..., (0)) 2

algorithm. 3 for iters i in = number of iterations to run the

do for each Cluster i at Iteration t do

() ← assign each to its closest 4 5 6 7 8 9 10 inside the spherical region having and −1 as the endpoints of its radius.

() ← run Algorithm 2 using () ,

; ← sample from using ExpDP with and ; () () ← ̂ () Algorithm 2: SubClusterSamplingAlgorithm Input: ConvergentZone: Convergent Zone internalK: Subclustering K

Output: 1 S() : Mean of 2 ConvergentZoneClusters ← Cluster

ConvergentZone using Lloyd’s algorithm and () cluster. 3 ConvergentZoneProbability ← Assign probabilities to the ConvergentZoneClusters proportional to the number of points inside each 4 SamplingZonei(t) ← Sample a cluster from the

ConvergentZoneClusters using

ConvergentZoneProbability 5 Return: SamplingZonei(t); 3. SubCluster the convergence zone and sample one of the sub-cluster as our sampling zone based on the probability assigned to each sub-cluster. The probability assignment is directly proportional to the number of points in each sub-cluster. 4. Sample a diferentially private ̂

in the sampling zone generated in step 3.

() with EXpDP

Our approach surpasses the baseline approach in terms of clustering quality while maintaining the same DP requirements as that of the KMeans Guarantee approach, which is evident from the results obtained (Figure : 3). The better clustering quality is a result of our subclustering strategy to perturb centroid with a higher probability than the baseline approach towards the direction of the actual centroid generated by Lloyd’s algorithm.

The pseudo-code of our approach is shown in the Algorithm 1 and Algorithm 2.

Lemma 1: [ 3 ] A randomised iterative algorithm is convergent if, in centroid using ), () (centroid of () (Cluster i at iteration t), ̂

(sampled (−1) (centroid before recentering) and () ) satisfies the invariant, || ̂

− () || < () () || () − (−1) || in Euclidean distance, ∀ , .

We reproduce this lemma from our baseline approach [ 3 ]. Lemma1 and Lemma 2 together provides the completeness and proof for the convergence of our approach. iteration to move in the direction of a more populated re- If the distance between the sampled centroid ̂() from 1. Let the diferentially private centroid at iteration

. Using this centroid, run one iteration of Lloyd’s algorithm to get the current Lloyd’s centroid () for each cluster . 2. Using () and (−1) , generate a convergent zone

() , then the loss will increase. However, if we can ensure that any sampled point from loss than (−1) , thus, resulting into convergence of the randomised iterative algorithm. For the mathematical () − (−1) ||, it will lead to a lesser

() fulfills the condiproof, refer [ 3 ].

Lemma 2: Diferentially Private − Means SubClustering approach (SubClustering) is a randomised itera|| tive algorithm that satisfies the invariant || ̂ Proof: SubClustering is an iterative algorithm that samples a set of centroids for each iteration with ExpDP mechanism, thus, making it a randomised iterative algorithm. It subclusters the points lying inside ()

− () || < subcluster (sampling zone) with the assigned probabilities (linearly proportional to the number of data points in subcluster). Finally, it samples a datapoint from the sam () . After subclustering, it samples one () . Therefore, the sampled point

5. Experimental Setup 5.1. Dataset Used

We used following four datasets to test our work SubCluster Guarantee upon the baseline: 1. Iris [ 13 ] dataset comprises total of 150 datapoints with four features and three classes. 2. Wine[ 13 ] dataset comprises total of 178 datapoints with 13 features and three classes. 3. Breast Cancer[ 13 ] dataset comprises total of 569 datapoints with 30 features and two classes. 4. Digits[ 13 ] dataset comprises of 1797 datapoints with 64 dimensions and 10 classes.

5.2. Metric for Clustering Quality

To evaluate the clustering quality, we used the following equation to calculate the normalised diference between the diferentially private algorithms (here, Sub| − | (1) The smaller CostGap [ 3 ] represents the better quality of clustering. In the experiments, we compare the clustering quality of SubCluster Guarantee with KMeans Guarantee.

6. Results and Discussion

We tested our algorithm on four datasets. All the datasets have diferent dimensions ranging from 4 to 64 dimensions and training sets ranging from 150 to 1800. As defined in metric smaller gap represents the better clustering quality. From the (Figure : 3) we can observe that, cost gap for all the dataset is smaller or equal to the baseline. Thus, it is evident that our algorithm has better clustering quality than the existing work for all the datasets experimented. We varied internalK (parameter for number of sub-clusters) from 2 to 5.

Each experiment was conducted 30 times in the case of the Iris, Wine, and Breast cancer dataset and 10 times for digits dataset due to computational constraints. Finally, for each dataset, we took the average of all the experiments as our final result for plotting the graphs.

Comparing the SubCluster Guarantee (proposed approach) and K-means Guarantee approach (baseline) by taking an average of all the cost gaps for varied epsilon, and finally taking the ratio between K-means and SubCluster approach: 1. In case of Iris dataset, the cost gap is 1.1 times smaller than baseline algorithm. 2. In case of Wine dataset, the cost gap is 4.13 times smaller than baseline algorithm. 3. In case of Breast_Cancer dataset, the cost gap is 2.83 times smaller than baseline algorithm. 4. In case of Digits dataset, the cost gap is almost same as that of baseline algorithm.

6.1. Detailed Analysis

1. Iris: Iris dataset has four dimensions and a very small training set of 150 data points. Our algorithm achieves better clustering quality than the baseline algorithm for smaller epsilon values. Since the number of data points is less in Iris, the impact of sub-clustering reduces, resulting in its performance similar to that of the baseline approach. From (Figure : 4), we can observe that changing the value of intenalK has a small impact on the costGap due to a small number of points in each sub-cluster. This is because there is a possibility that a sub-cluster has no data point when internalK is increased causing zero probability sub-cluster regions. 2. Wine: The wine dataset has 13 dimensions and 178 data points in the training set. Our algorithm performs significantly better than the baseline, as observed in (Figure : 3). It is because the baseline algorithm is constrained to choose a theta in any abrupt direction ranging from [− /2, /2 ] as shown in (Figure : 1). In contrast, our algorithm shifts the centroids in the direction where the future centroid of Lloyd’s algorithm is more likely to move (in the expected case). From (Figure : 4), it is evident that internalK=4 for the wine dataset performs better than the rest of the internalK values. Here, the number of dimensions is more than Iris. Therefore, the spatial arrangement will be in an n-sphere which allows better sub-clustering. 3. Breast_Cancer: Breast_Cancer dataset has 569 data points in its training set and 30 dimensions. Our algorithm performs exceptionally better than the baseline, with internalK equal to 4. From (Figure : 3), we can observe that there is no monotonous trend for the costGap. Trends are visible in other datasets due to the larger number of classification classes, whereas this dataset has only two classes. Thus, adding Laplace noise does not have a relation to the clustering quality. Increasing the internalK improves the clustering quality, with internalK being 4 having the least loss. It is because this dataset has a high number of dimensions and a larger number of training points than other datasets. 4. Digits: It has 64 dimensions and 1797 data points in the training dataset. Although it has a large number of dimensions, our algorithm has a very small improvement over the baseline algorithm as seen in (Figure : 3). Because of the higher time complexity of our algorithm, it is hard to tune the internalK parameter. As the number of samples in a dataset increases, the internalK should increase because a single cluster can contain a large number of data points. But, due to limited computational resources, we were not able to experiment with it further. We took internalK to be 5 for our experiments as it performed best in the range [ 2, 5 ] as in the (Figure : 4). One of the intriguing findings in the dataset’s results is that the curves based on the internalK have a clearly evident trend, which is a result of the large number of training data points.

Our proposed algorithm significantly improves over the baseline in terms of clustering quality, especially for the wine and breast cancer dataset. In addition our algorithm maintains the same DP requirements as that of existing works.

7. Conclusion

This work presents a novel method for improving the clustering quality of diferentially private k-means algorithms while ensuring convergence. The novelty of our approach is the sub-clustering of the cluster to select the diferentially private centroid, which has a higher probability of moving in the direction of the next centroid. We proved that our work surpasses the current state-of-the-art algorithms in terms of clustering quality. Especially for the Wine and Breast_Cancer dataset, the clustering quality was significantly improved by 4.13 and 2.83 times than the baseline. In addition, we maintain the same DP requirements as that of baseline and other existing approaches.

8. Future Work

• In this work, we proved our claim using empirical results. We further plan to validate the results by providing mathematical bounds for the convergence degree and rate of the SubClustering Lloyd’s algorithm. In terms of clustering quality, the proposed algorithm in this work is compared with k-means guarantee clustering only; to prove the efectiveness of our work, we plan to experiment with other algorithms in the literature including, PrivGene [ 14 ], GUPT [ 8 ] and DWork [ 7 ]. • The DP requirements in this work are the same as that of past literature, but in the future, we plan to explore ways to improve the current DP guarantees while maintaining the same clustering quality as in this work. • We used Exponential and Laplace mechanisms of DP in the proposed approach; we further plan to explore the third mechanisms, i.e., sample and aggregate framework, by integrating it with the current algorithm. • In our algorithm, the number of data points inside a cluster is variable. Thus we plan to choose an internalK, custom to the size of the cluster to improve the clustering quality.

Acknowledgement

We would like to thank Prof. Anirban Dasgupta (IIT Gandhinagar) for his continuous support and guidance throughout the research.

[1]

Dwork , Diferential privacy: A survey of results , in: International conference on theory and applications of models of computation , Springer, 2008 , pp. 1 - 19 .

[2]

Narayanan , Data privacy: The noninteractive setting , The University of Texas at Austin, 2009 .

[3]

Lu ,

Shen , Diferentially private kmeans clustering with convergence guarantee , IEEE Transactions on Dependable and Secure Computing ( 2020 ).

[4]

Su ,

Cao ,

Li ,

Bertino ,

Jin , Differentially private k-means clustering , in: Proceedings of the sixth ACM conference on data and application security and privacy , 2016 , pp. 26 - 37 .

[5]

Lei , Diferentially private m-estimators, Advances in Neural Information Processing Systems 24 ( 2011 ).

[6]

Su ,

Cao ,

Li ,

Bertino ,

Lyu ,

Jin , Diferentially private k-means clustering and a hybrid approach to private optimization , ACM Transactions on Privacy and Security (TOPS) 20 ( 2017 ) 1 - 33 .

[7]

Dwork , A firm foundation for private data analysis , Communications of the ACM 54 ( 2011 ) 86 - 95 .

[8]

Mohan ,

Thakurta ,

Shi ,

Song ,

Culler , Gupt: privacy preserving data analysis made easy , in: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data , 2012 , pp. 349 - 360 .

[9]

Dwork ,

McSherry ,

Nissim ,

Smith , Calibrating noise to sensitivity in private data analysis , in: Theory of cryptography conference , Springer, 2006 , pp. 265 - 284 .

[10]

McSherry ,

Talwar , Mechanism design via diferential privacy , in: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07) , IEEE, 2007 , pp. 94 - 103 .

[11]

Nissim ,

Raskhodnikova ,

Smith , Smooth sensitivity and sampling in private data analysis , in: Proceedings of the thirtyninth annual ACM symposium on Theory of computing , 2007 , pp. 75 - 84 .

[12]

Blum ,

Dwork ,

McSherry ,

Nissim , Practical privacy: the sulq framework , in: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , 2005 , pp. 128 - 138 .

[13] A. Asuncion, Uci machine learning repository , university of california, irvine, school of information and computer sciences, http://www. ics. uci. edu/~ mlearn/MLRepository. html ( 2007 ).

[14]

Zhang ,

Xiao ,

Yang ,

Zhang , M. Winslett, Privgene: diferentially private model fitting using genetic algorithms , in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data , 2013 , pp. 665 - 676 .