=Paper= {{Paper |id=Vol-3318/paper5 |storemode=property |title=k-Means SubClustering: A Differentially Private Algorithm with Improved Clustering Quality |pdfUrl=https://ceur-ws.org/Vol-3318/paper5.pdf |volume=Vol-3318 |authors=Devvrat Joshi,Janvi Thakkar |dblpUrl=https://dblp.org/rec/conf/cikm/JoshiT22 }} ==k-Means SubClustering: A Differentially Private Algorithm with Improved Clustering Quality== https://ceur-ws.org/Vol-3318/paper5.pdf

𝑘-Means SubClustering: A Differentially Private Algorithm
with Improved Clustering Quality
Devvrat Joshi1,∗,† , Janvi Thakkar1,∗,†
1
Indian Institute of Technology Gandhinagar, India

Abstract
In today’s data-driven world, the sensitivity of information has been a significant concern. With this data and additional
information on the person’s background, one can easily infer an individual’s private data. Many differentially private iterative
algorithms have been proposed in interactive settings to protect an individual’s privacy from these inference attacks. The
existing approaches adapt the method to compute differentially private(DP) centroids by iterative Llyod’s algorithm and
perturbing the centroid with various DP mechanisms. These DP mechanisms do not guarantee convergence of differentially
private iterative algorithms and degrade the quality of the cluster. Thus, in this work, we further extend the previous work
on ‘Differentially Private 𝑘-Means Clustering With Convergence Guarantee’ by taking it as our baseline. The novelty of our
approach is to sub-cluster the clusters and then select the centroid which has a higher probability of moving in the direction
of the future centroid. At every Lloyd’s step, the centroids are injected with the noise using the exponential DP mechanism.
The results of the experiments indicate that our approach outperforms the current state-of-the-art method, i.e., the baseline
algorithm, in terms of clustering quality while maintaining the same differential privacy requirements. The clustering quality
significantly improved by 4.13 and 2.83 times than baseline for the Wine and Breast_Cancer dataset, respectively.

Keywords
differential privacy, 𝑘-means clustering, convergence guarantee

1. Introduction privacy [3]. In contrast to Lloyd’s K-means algorithm,
which guarantees convergence, these algorithms do not
Achieving extraordinary results is dependent on the data provide any convergence guarantee. Getting differen-
on which the machine learning models are trained. Data tially private centroids might not help in getting quality
curators have a responsibility to provide datasets such inferences because of this non-convergence. We studied
that the privacy of data is not compromised. However, an existing approach that provides this guarantee and
attackers use other public datasets to perform inference converges in twice the number of iterations to Lloyd’s al-
and adversarial attacks to get information about an indi- gorithm while maintaining the same differential privacy
vidual in the dataset. Differential privacy is a potential requirements as existing works [4] [5]. Their algorithm
technique for giving customers a mathematical guarantee perturbs the centroids in a random direction from the
of the privacy of their data[1]. There are two fundamen- center of the cluster. However, this lowers the quality of
tal settings in which differential privacy is used on data: clustering, which is necessary for making inferences.
in interactive setting data curator holds the data and re- In this work, we propose a variant of the existing ap-
turns the response based on the queries requested by proach, which provides better clustering quality while
third parties; while in non-interactive setting the curator using the same privacy budget. We used the intuition
sanitized the data before publishing[2]. of Lloyd’s algorithm that the next centroid will move
Iterative clustering algorithms provide important in- in the direction where there is a higher number of data
sights about the dataset, which helps in a large number of points. Finally, we give the mathematical proof that our
applications. They are prone to privacy threats because approach at any instance gives better clustering quality
they can reveal information about an individual with ad- than the existing approaches. We have tested our ap-
ditional knowledge. Existing approaches obtain the set proach on breat_cancer, wine, iris, and digits datasets.
of centroids using Lloyd’s K-means algorithm, then per- We were able to get a significant improvement from the
turb them with a differentially private mechanism to add previous approach in terms of clustering quality.
Interactive setting implies that the dataset is not dis-
CIKM-PAS’22: PRIVACY ALGORITHMS IN SYSTEMS (PAS) Workshop, closed to the user, however, the data curator returns the
Conference on Information and Knowledge Management, October 21, response of each query received from the user by manip-
2022, CIKM-PAS
∗
Corresponding author.
ulating it using DP strategy.
†
These authors contributed equally. Our main contribution includes:
Envelope-Open devvrat.joshi@iitgn.ac.in (D. Joshi); janvi.thakkar@iitgn.ac.in 1. We proposed SubClustering approach which has
(J. Thakkar)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License better clustering quality than the baseline (which
Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) is the current SOTA in terms of clustering qual-
ity). For the Wine and Breastcancer dataset, the same epsilon privacy.
clustering quality improved by 4.13 and 2.83 times
respectively.
2. In addition to improving the clustering quality,
3. Preliminaries
our algorithm used same privacy budget as that The definitions used in this work are briefly discussed
of the existing work. in this section. The following is a formal definition of
Differential Privacy:
2. Related Work Definition 1 (𝜖-DP [9]). A randomised mechanism T
is 𝜖- differentially private if for all neighbouring datasets
′
The concept of differential privacy has inspired a plethora 𝑋 and 𝑋 and for an arbitrary answer 𝑠 ∈ 𝑅𝑎𝑛𝑔𝑒(𝑇 ), T
of studies, particularly in the area of differentially pri- satisfies
vate k-means clustering [6][7][8] in an interactive setting.
𝑃𝑟[𝑇 (𝑋 ) = 𝑠] ≤ 𝑒𝑥𝑝(𝜖) ⋅ 𝑃𝑟[𝑇 (𝑋 ′ ) = 𝑠],
The important mechanisms of DP in the literature include:
the Laplace mechanisms (LapDP) [9], the exponential where 𝜖 is the privacy budget.
mechanisms (ExpDP) [10], and the sample and aggregate Here, 𝑋 and 𝑋 ′ differ by only one item. Smaller val-
framework [11]. To achieve differential privacy, many im- ues of 𝜖 imply a better privacy guarantee. It is because
plementations included infusing Laplace noise into each the difference between the two neighboring datasets is
iteration of Lloyd’s algorithm. The proportion of noise reflected by the privacy budget. In this work, we use the
added was based on a fixed privacy budget. Some of the ExpDP and LapDP. In exponential DP for non-numeric
strategies for allocating privacy budget included splitting computation, they introduce the concept of scoring func-
the overall privacy budget uniformly to each iteration tion 𝑞(𝑋 , 𝑥), which represents the effectiveness of the
[12]. However, this requires us to calculate the number of pair (𝑋 , 𝑥). Here 𝑋 is the dataset and 𝑥 is the response to
iterations for the convergence, prior to the execution of the 𝑞(𝑋 , 𝑥) on X.
algorithm, thus increasing the computational cost. Fur- The formal definition of Exponential DP mechanism
ther, researchers overcome this weakness by allocating is defined as follow:
theoretically guaranteed optimal allocation method [6], Definition 2 (Exponential Mechanism [10]).
but the major assumption taken in this approach was Given a scoring function of a dataset 𝑋 , 𝑞(𝑋 , 𝑥), which
that every cluster has the same size, which does not align reflects the quality of query respond x. The expo-
with the real-world datasets. In another work, Mohan nential mechanism T provides 𝜖-differential privacy,
et al. [8] proposed GUPT, which uses Lloyd’s algorithm if 𝑇 (𝑋 ) = {𝑃𝑟[𝑥] ∝ 𝑒𝑥𝑝( 𝜖⋅𝑞(𝑋 ,𝑥) )}, where Δ𝑞 is the
for local clustering of each bucket where the items were 2Δ𝑞
uniformly sampled to different buckets. The final result sensitivity of scoring function q(X,x), 𝜖 is the privacy
was the mean of locally sampled points in each bucket budget.
with added Laplace noise. But, the clustering quality of Definition 3 (Convergent & Sampling Zones[3]).
GUPT was unsatisfying because a large amount of noise A region (𝑡)
whose points satisfies the condition: { Node S:
was added in the aggregation stage. ‖𝑆 − 𝑆 𝑖 ‖ < ‖𝑆𝑖 (𝑡−1) − 𝑆𝑖 (𝑡) ‖} is the convergent zone. 𝑆𝑖 (𝑡) is
(𝑡)
Based on the study of past literature on differentially defined as the mean of 𝐶𝑖 . A sub-region inside convergent
private k-means clustering, Zhigang et al. [3] concluded zone is defined as a sampling zone.
that convergence of an iterative algorithm is important to Definition 4 (Orientation Controller[3]). 𝑋𝑖 (𝑡) is a
the clustering quality. To solve this, they introduced the direction from the center of the convergent zone to a point
concept of the convergent zone and orientation controller. on its circumference. This is the direction along which the
With the help of a convergent zone and orientation con- center of the sampling zone will be sampled, defined as the
troller, they further create a sampling zone for selecting orientation controller.
a potential centroid for the 𝑖𝑡ℎ iteration. The approach
iteratively adds noise with an exponential mechanism
(ExpDP) by using prior and future knowledge of the po- 4. Approach
tential centroid at every step of Lloyd’s algorithm. The
In this section, we explain our proposed approach and
approach maintains the same DP requirements as existing
the baseline approach.
literature, with guaranteed convergence and improve-
ment in clustering quality. However, their algorithm
perturbs the centroids in a random direction from the 4.1. Overview - KMeans Guarantee
center of the cluster, degrading the quality of clustering. (Baseline)
Thus, in this work, we further build upon the approach
We took ”Differentially Private K-Means Clustering with
and significantly improve the clustering quality with the
Convergence Guarantee” [3] as our baseline and im-
Algorithm 1: Differentially Private 𝑘−Means
SubClustering Algorithm
Input: X = {𝑥1 , 𝑥2 , ...., 𝑥𝑁 }: Dataset with N data
points
k: number of clusters
𝜖 𝑒𝑥𝑝 : ExpDP privacy budget
𝜖 𝑙𝑎𝑝 : Laplacian privacy budget for the converged
centroids.
𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙𝐾: number of sub-clusters per cluster
Figure 1: Overview of KMeans Guarantee Approach
Output: S: Final clustering centroids
(0) (0) (0) (0)
1 Select 𝑘 centroids S = (𝑆1 , 𝑆2 , ..., 𝑆𝑘 )
uniformly from X.
proved the clustering quality by further building on it. 2 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝐹 𝑜𝑟𝐿𝑙𝑜𝑦𝑑 = number of iterations to run the
The key concept of the algorithm is to use ExpDP to in- algorithm.
troduce bounded noise into centroids at each iteration of 3 for iters i in 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝐹 𝑜𝑟𝐿𝑙𝑜𝑦𝑑 do
Lloyd’s algorithm. The technique is designed in a way 4 for each Cluster i at Iteration t do
that it ensures the new centroid is different from the cen- (𝑡)
5 𝐶𝑖 ← assign each 𝑥𝑗 to its closest
troid of Lloyd’s algorithm while maintaining constraint
given in Lemma 1. The constraint guarantees that the centroid 𝑆𝑖 𝑡−1 ;
perturbed centroid will eventually converge with the 6 𝑆𝑖 𝑡 ← centroid of 𝐶𝑖 𝑡 ;
centroid of Lloyd’s algorithm. 7 𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑡𝑍 𝑜𝑛𝑒𝑖 (𝑡) ← List of data points
Their algorithm has four main steps to update the inside the spherical region having 𝑆𝑖 𝑡 and
centroids at each Lloyd step t [3]. The overview of their 𝑆𝑖 𝑡−1 as the endpoints of its radius.
approach can be seen in (Figure : 1). 8 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔𝑍 𝑜𝑛𝑒𝑖 (𝑡) ← run Algorithm 2 using
1. Let the differentially private centroid at iteration 𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑡𝑍 𝑜𝑛𝑒𝑖 (𝑡) , 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙𝐾;
(𝑡−1) (𝑡)
𝑡 − 1 for a cluster 𝑖 be 𝑆𝑖̂ . Using this centroid, 9 𝑆𝑖̂ ← sample from 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔𝑍 𝑜𝑛𝑒𝑖 (𝑡)
run one iteration of Lloyd’s algorithm to get the using ExpDP with 𝑞 and 𝜖 𝑒𝑥𝑝 ;
(𝑡)
current Lloyd’s centroid 𝑆𝑖 (𝑡) for each cluster 𝑖. 10 𝑆𝑖 (𝑡) ← 𝑆𝑖̂
2. Using 𝑆𝑖 (𝑡) and 𝑆𝑖 (𝑡−1) , generate a convergent zone
for each cluster 𝑖 as described in 𝐷𝑒𝑓 𝑖𝑛𝑖𝑡𝑖𝑜𝑛 3. 11 Publish: 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔𝑍 𝑜𝑛𝑒𝑖 (𝑡) , 𝑞, 𝜖 𝑒𝑥𝑝 , 𝑆𝑖 (𝑡)
12 S ← add laplace noise with 𝜖
𝑙𝑎𝑝 to S(𝑡) ;
3. Generate a sampling zone in the convergence zone
and an orientation controller 𝑋𝑖 (𝑡) for each cluster
i as defined in 𝐷𝑒𝑓 𝑖𝑛𝑖𝑡𝑖𝑜𝑛 3 𝑎𝑛𝑑 4 respectively.
(𝑡)
4. Sample a differentially private 𝑆𝑖̂ with ExpDP
in the sampling zone generated in step 3. Algorithm 2: SubClusterSamplingAlgorithm
Input: ConvergentZone: Convergent Zone
The definition for the convergent zone (for convergence
internalK: Subclustering K
guarantee) and sampling zone (for centroid updating) is
Output: 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔𝑍 𝑜𝑛𝑒𝑖𝑡
defined in Definition 3. (𝑡) (𝑡)
1 S : Mean of 𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑡𝑍 𝑜𝑛𝑒𝑖
2 ConvergentZoneClusters ← Cluster
4.2. Overview - SubCluster Guarantee ConvergentZone using Lloyd’s algorithm and
We build upon the KMeans Guarantee algorithm to 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙𝐾
achieve better clustering quality. Our idea differs from 3 ConvergentZoneProbability ← Assign

the baseline in terms of creating a sampling zone. For probabilities to the ConvergentZoneClusters
each cluster, we execute Lloyd’s algorithm over its con- proportional to the number of points inside each
vergent zone to generate its sub-clustering. Further, we cluster.
(t)
assign each sub-cluster with a probability linearly pro- 4 SamplingZonei ← Sample a cluster from the
portional to the number of points it contains. Finally, we ConvergentZoneClusters using
sample the sub-cluster based on the assigned probability ConvergentZoneProbability
(t)
and define it as the sampling zone of the convergent zone. 5 Return: SamplingZonei ;
Drawing analogy from the KMeans Guarantee algorithm,
for each cluster 𝑖 as described in 𝐷𝑒𝑓 𝑖𝑛𝑖𝑡𝑖𝑜𝑛 3.
3. SubCluster the convergence zone and sample one
of the sub-cluster as our sampling zone based on
the probability assigned to each sub-cluster. The
probability assignment is directly proportional to
the number of points in each sub-cluster.
(𝑡)
4. Sample a differentially private 𝑆𝑖̂ with EXpDP
in the sampling zone generated in step 3.

Our approach surpasses the baseline approach in terms
of clustering quality while maintaining the same DP re-
quirements as that of the KMeans Guarantee approach,
which is evident from the results obtained (Figure :
3). The better clustering quality is a result of our sub-
clustering strategy to perturb centroid with a higher prob-
ability than the baseline approach towards the direction
Figure 2: Overview of SubCluster Guarantee Approach of the actual centroid generated by Lloyd’s algorithm.
The pseudo-code of our approach is shown in the Algo-
rithm 1 and Algorithm 2.
our orientation controller is this sub-clustering and sam- Lemma 1: [3] A randomised iterative algorithm 𝜏 is
pling technique. Intuitively, our algorithm ensures that convergent if, in 𝐶 (𝑡) (Cluster i at iteration t), 𝑆𝑖̂ (𝑡) (sampled
𝑖
the sampling zone lies towards the region containing a centroid using 𝜏), 𝑆 (𝑡−1) (centroid before recentering) and
𝑖
higher number of data points in an expected case. With (𝑡) (𝑡) ̂ (𝑡) (𝑡)
this, we guarantee that our differentially private centroid 𝑆𝑖 (centroid of 𝐶𝑖 ) satisfies the invariant, ||𝑆𝑖 −𝑆𝑖 || <
(𝑡) (𝑡−1)
moves in the direction where the number of data points ||𝑆𝑖 − 𝑆𝑖 || in Euclidean distance, ∀𝑡, 𝑖.
is higher, incorporating the intuition of Lloyd’s algorithm We reproduce this lemma from our baseline approach
without compromising on the 𝜖-differential privacy. The [3]. Lemma1 and Lemma 2 together provides the com-
probability of a differentially private centroid at 𝑖 − 1𝑡ℎ pleteness and proof for the convergence of our approach.
iteration to move in the direction of a more populated re- If the distance between the sampled centroid 𝑆𝑖̂(𝑡) from
gion at the 𝑖𝑡ℎ step of Lloyd’s algorithm is also high. Thus, the 𝐶 (𝑡) and the new centroid 𝑆 (𝑡) is less than the distance
𝑖 𝑖
we introduce the concept of sub-clustering in the conver- (𝑡) (𝑡−1)
gent zone and consequently sample one sub-cluster as between the new 𝑆 𝑖 and the old centroid 𝑆𝑖 , then the
our sampling zone. random iterative algorithm will always converge. Intu-
(𝑡) (𝑡)
We sample the centroid from the sampling zone using itively, the loss of 𝐶𝑖 is minimum if the mean of 𝐶𝑖 is
the ExpDP mechanism. Finally, we inject Laplace noise taken as centroid. But, if we slightly shift from the mean
in the centroids of the clustering when our algorithm of 𝐶𝑖(𝑡) , then the loss will increase. However, if we can
converges. It is because the differentially private cen- ensure that any sampled point from 𝐶 (𝑡) fulfills the condi-
𝑖
troids obtained are a subset of one of the local minima
tion: ||𝑆 ̂ (𝑡) − 𝑆𝑖 (𝑡) || < ||𝑆𝑖 (𝑡) − 𝑆𝑖 (𝑡−1) ||, it will lead to a lesser
at which Lloyd’s algorithm converges. The overview of 𝑖
(𝑡−1)
the proposed approach can be seen in (Figure : 2). We loss than 𝐽 𝑆 𝑖 , thus, resulting into convergence of the
show that a randomized iterative algorithm satisfies an randomised iterative algorithm. For the mathematical
invariant (given in the claim of Lemma 1) and always proof, refer [3].
converges (Proof: refer Lemma 1). Finally, we show Lemma 2: Differentially Private 𝑘−Means SubClus-
that the SubCluster algorithm is a randomized iterative tering approach (SubClustering) is a randomised itera-
algorithm that satisfies the invariant(given in Lemma 1) tive algorithm that satisfies the invariant ||𝑆 ̂ (𝑡) − 𝑆 (𝑡) || <
𝑖 𝑖
(Proof: Refer Lemma 2).
||𝑆𝑖 (𝑡) − 𝑆𝑖 (𝑡−1) ||.
We have four main steps to update the centroids at
Proof: SubClustering is an iterative algorithm that
each Lloyd step t.
samples a set of centroids for each iteration with Ex-
1. Let the differentially private centroid at iteration pDP mechanism, thus, making it a randomised itera-
(𝑡−1) tive algorithm. It subclusters the points lying inside
𝑡 − 1 for a cluster 𝑖 be 𝑆𝑖̂ . Using this centroid, (𝑡)
run one iteration of Lloyd’s algorithm to get the 𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑡𝑍 𝑜𝑛𝑒𝑖 . After subclustering, it samples one
subcluster (sampling zone) with the assigned probabili-
current Lloyd’s centroid 𝑆𝑖 (𝑡) for each cluster 𝑖.
(𝑡) (𝑡−1)
ties (linearly proportional to the number of data points in
2. Using 𝑆𝑖 and 𝑆𝑖 , generate a convergent zone subcluster). Finally, it samples a datapoint from the sam-
Figure 3: Above figures plots the graph between costGap and epsilon budget for two approaches, the baseline as KmeansGuar-
antee and our approach SubClusterGuarantee. The algorithm was tested on four dataset, Digits (top-left), Wine (top-right),
Breast Cancer (bottom-left), and Iris (bottom-right) datasets.

pled subcluster with ExpDP and call it as the centroid of Cluster Guarantee approach) (𝐶𝑜𝑠𝑡𝐷𝑃 ) and Lloyd’s algo-
(𝑡) rithm (𝐶𝑜𝑠𝑡𝐿𝑙𝑜𝑦𝑑 ):
𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑡𝑍 𝑜𝑛𝑒𝑖 . Thus, our sampling zone always lies
(𝑡)
inside 𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑡𝑍 𝑜𝑛𝑒𝑖 . Therefore, the sampled point |𝐶𝑜𝑠𝑡𝐷𝑃 − 𝐶𝑜𝑠𝑡𝐿𝑙𝑜𝑦𝑑 |
(𝑡)
lies inside 𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑡𝑍 𝑜𝑛𝑒𝑖 and it satisfies the invariant 𝐶𝑜𝑠𝑡𝐺𝑎𝑝 = (1)
(𝑡) 𝐶𝑜𝑠𝑡𝐿𝑙𝑜𝑦𝑑
||𝑆𝑖̂ − 𝑆𝑖 (𝑡) || < ||𝑆𝑖 (𝑡) − 𝑆𝑖 (𝑡−1) ||.
The smaller CostGap [3] represents the better quality of
clustering. In the experiments, we compare the clustering
5. Experimental Setup quality of SubCluster Guarantee with KMeans Guarantee.

5.1. Dataset Used
6. Results and Discussion
We used following four datasets to test our work Sub-
Cluster Guarantee upon the baseline: We tested our algorithm on four datasets. All the datasets
1. Iris [13] dataset comprises total of 150 datapoints have different dimensions ranging from 4 to 64 dimen-
with four features and three classes. sions and training sets ranging from 150 to 1800. As
2. Wine[13] dataset comprises total of 178 data- defined in metric smaller gap represents the better clus-
points with 13 features and three classes. tering quality. From the (Figure : 3) we can observe
that, cost gap for all the dataset is smaller or equal to
3. Breast Cancer[13] dataset comprises total of
the baseline. Thus, it is evident that our algorithm has
569 datapoints with 30 features and two classes.
better clustering quality than the existing work for all the
4. Digits[13] dataset comprises of 1797 datapoints
datasets experimented. We varied internalK (parameter
with 64 dimensions and 10 classes.
for number of sub-clusters) from 2 to 5.
Each experiment was conducted 30 times in the case
5.2. Metric for Clustering Quality of the Iris, Wine, and Breast cancer dataset and 10 times
To evaluate the clustering quality, we used the follow- for digits dataset due to computational constraints. Fi-
ing equation to calculate the normalised difference be- nally, for each dataset, we took the average of all the
tween the differentially private algorithms (here, Sub- experiments as our final result for plotting the graphs.
Figure 4: Above figures plots the graph between costGap and epsilon budget for different internalK in SubClusterGuarantee
Algorithm. The algorithm was tested for internalK=2,3,4,5 for all the four datasets, Digits (top-left), Wine (top-right), Breast
Cancer (bottom-left), and Iris (bottom-right). Please note: K and internalK are the same parameter

Comparing the SubCluster Guarantee (proposed ap- internalK is increased causing zero probability
proach) and K-means Guarantee approach (baseline) by sub-cluster regions.
taking an average of all the cost gaps for varied epsilon, 2. Wine: The wine dataset has 13 dimensions and
and finally taking the ratio between K-means and Sub- 178 data points in the training set. Our algorithm
Cluster approach: performs significantly better than the baseline, as
1. In case of Iris dataset, the cost gap is 1.1 times observed in (Figure : 3). It is because the base-
smaller than baseline algorithm. line algorithm is constrained to choose a theta in
2. In case of Wine dataset, the cost gap is 4.13 times any abrupt direction ranging from [−𝜋/2, 𝜋/2] as
smaller than baseline algorithm. shown in (Figure : 1). In contrast, our algorithm
3. In case of Breast_Cancer dataset, the cost gap shifts the centroids in the direction where the fu-
is 2.83 times smaller than baseline algorithm. ture centroid of Lloyd’s algorithm is more likely
4. In case of Digits dataset, the cost gap is almost to move (in the expected case). From (Figure : 4),
same as that of baseline algorithm. it is evident that internalK=4 for the wine dataset
performs better than the rest of the internalK val-
ues. Here, the number of dimensions is more than
6.1. Detailed Analysis Iris. Therefore, the spatial arrangement will be in
1. Iris: Iris dataset has four dimensions and a very an n-sphere which allows better sub-clustering.
small training set of 150 data points. Our al- 3. Breast_Cancer: Breast_Cancer dataset has 569
gorithm achieves better clustering quality than data points in its training set and 30 dimensions.
the baseline algorithm for smaller epsilon values. Our algorithm performs exceptionally better than
Since the number of data points is less in Iris, the the baseline, with internalK equal to 4. From
impact of sub-clustering reduces, resulting in its (Figure : 3), we can observe that there is no
performance similar to that of the baseline ap- monotonous trend for the costGap. Trends are
proach. From (Figure : 4), we can observe that visible in other datasets due to the larger num-
changing the value of intenalK has a small impact ber of classification classes, whereas this dataset
on the costGap due to a small number of points has only two classes. Thus, adding Laplace noise
in each sub-cluster. This is because there is a pos- does not have a relation to the clustering quality.
sibility that a sub-cluster has no data point when Increasing the internalK improves the clustering
quality, with internalK being 4 having the least to prove the effectiveness of our work, we plan
loss. It is because this dataset has a high number to experiment with other algorithms in the lit-
of dimensions and a larger number of training erature including, PrivGene [14], GUPT [8] and
points than other datasets. DWork [7].
4. Digits: It has 64 dimensions and 1797 data points • The DP requirements in this work are the same
in the training dataset. Although it has a large as that of past literature, but in the future, we
number of dimensions, our algorithm has a very plan to explore ways to improve the current DP
small improvement over the baseline algorithm as guarantees while maintaining the same clustering
seen in (Figure : 3). Because of the higher time quality as in this work.
complexity of our algorithm, it is hard to tune • We used Exponential and Laplace mechanisms
the internalK parameter. As the number of sam- of DP in the proposed approach; we further plan
ples in a dataset increases, the internalK should to explore the third mechanisms, i.e., sample and
increase because a single cluster can contain a aggregate framework, by integrating it with the
large number of data points. But, due to limited current algorithm.
computational resources, we were not able to ex- • In our algorithm, the number of data points inside
periment with it further. We took internalK to a cluster is variable. Thus we plan to choose an
be 5 for our experiments as it performed best in internalK, custom to the size of the cluster to
the range [2, 5] as in the (Figure : 4). One of the improve the clustering quality.
intriguing findings in the dataset’s results is that
the curves based on the internalK have a clearly
evident trend, which is a result of the large num- Acknowledgement
ber of training data points.
We would like to thank Prof. Anirban Dasgupta
Our proposed algorithm significantly improves over the (IIT Gandhinagar) for his continuous support and
baseline in terms of clustering quality, especially for the guidance throughout the research.
wine and breast cancer dataset. In addition our algorithm
maintains the same DP requirements as that of existing
works.
References
[1] C. Dwork, Differential privacy: A survey
7. Conclusion of results, in: International conference on
theory and applications of models of com-
This work presents a novel method for improving the putation, Springer, 2008, pp. 1–19.
clustering quality of differentially private k-means al- [2] A. Narayanan, Data privacy: The non-
gorithms while ensuring convergence. The novelty of interactive setting, The University of Texas
our approach is the sub-clustering of the cluster to select at Austin, 2009.
the differentially private centroid, which has a higher [3] Z. Lu, H. Shen, Differentially private k-
probability of moving in the direction of the next cen- means clustering with convergence guar-
troid. We proved that our work surpasses the current antee, IEEE Transactions on Dependable
state-of-the-art algorithms in terms of clustering quality. and Secure Computing (2020).
Especially for the Wine and Breast_Cancer dataset, the [4] D. Su, J. Cao, N. Li, E. Bertino, H. Jin, Dif-
clustering quality was significantly improved by 4.13 and ferentially private k-means clustering, in:
2.83 times than the baseline. In addition, we maintain Proceedings of the sixth ACM conference
the same DP requirements as that of baseline and other on data and application security and pri-
existing approaches. vacy, 2016, pp. 26–37.
[5] J. Lei, Differentially private m-estimators,
Advances in Neural Information Processing
8. Future Work Systems 24 (2011).
[6] D. Su, J. Cao, N. Li, E. Bertino, M. Lyu, H. Jin,
• In this work, we proved our claim using empirical Differentially private k-means clustering
results. We further plan to validate the results and a hybrid approach to private optimiza-
by providing mathematical bounds for the con- tion, ACM Transactions on Privacy and
vergence degree and rate of the SubClustering Security (TOPS) 20 (2017) 1–33.
Lloyd’s algorithm. In terms of clustering qual- [7] C. Dwork, A firm foundation for private
ity, the proposed algorithm in this work is com- data analysis, Communications of the ACM
pared with k-means guarantee clustering only; 54 (2011) 86–95.
[8] P. Mohan, A. Thakurta, E. Shi, D. Song,
D. Culler, Gupt: privacy preserving data
analysis made easy, in: Proceedings of
the 2012 ACM SIGMOD International Con-
ference on Management of Data, 2012, pp.
349–360.
[9] C. Dwork, F. McSherry, K. Nissim, A. Smith,
Calibrating noise to sensitivity in private
data analysis, in: Theory of cryptography
conference, Springer, 2006, pp. 265–284.
[10] F. McSherry, K. Talwar, Mechanism de-
sign via differential privacy, in: 48th An-
nual IEEE Symposium on Foundations of
Computer Science (FOCS’07), IEEE, 2007,
pp. 94–103.
[11] K. Nissim, S. Raskhodnikova, A. Smith,
Smooth sensitivity and sampling in private
data analysis, in: Proceedings of the thirty-
ninth annual ACM symposium on Theory
of computing, 2007, pp. 75–84.
[12] A. Blum, C. Dwork, F. McSherry, K. Nis-
sim, Practical privacy: the sulq framework,
in: Proceedings of the twenty-fourth ACM
SIGMOD-SIGACT-SIGART symposium on
Principles of database systems, 2005, pp.
128–138.
[13] A. Asuncion, Uci machine learning reposi-
tory, university of california, irvine, school
of information and computer sciences,
http://www. ics. uci. edu/~ mlearn/ML-
Repository. html (2007).
[14] J. Zhang, X. Xiao, Y. Yang, Z. Zhang,
M. Winslett, Privgene: differentially pri-
vate model fitting using genetic algorithms,
in: Proceedings of the 2013 ACM SIGMOD
International Conference on Management
of Data, 2013, pp. 665–676.