=Paper= {{Paper |id=Vol-3630/paper34 |storemode=property |title=Fast k-Nearest-Neighbor-Consistent Clustering |pdfUrl=https://ceur-ws.org/Vol-3630/LWDA2023-paper34.pdf |volume=Vol-3630 |authors=Lars Lenssen,Niklas Strahmann,Erich Schubert |dblpUrl=https://dblp.org/rec/conf/lwa/LenssenSS23 }} ==Fast k-Nearest-Neighbor-Consistent Clustering== https://ceur-ws.org/Vol-3630/LWDA2023-paper34.pdf
                                Fast k-Nearest-Neighbor-Consistent Clustering
                                Lars Lenssen1 , Niklas Strahmann1 and Erich Schubert1
                                1
                                    TU Dortmund University, Informatik VIII, Dortmund, 44221, Germany


                                                                         Abstract
                                                                         There are many ways to measure the quality of a clustering, both extrinsic (when labels are known) and
                                                                         intrinsic (when no labels are available). In this article, we focus on the k-Nearest-Neighbor Consistency
                                                                         measure, which considers a clustering as good if each object is within the same cluster as its nearest
                                                                         neighbors, and hence does not need labels. We propose a variant of the K-means clustering algorithm
                                                                         that uses the k-Nearest-Neighbor Consistency as a constraint while optimizing the sum-of-squares as in
                                                                         regular K-means, resulting in K-means clustering where the nearest neighbors are guaranteed to be in
                                                                         the same cluster. The new version provably yields the same results as the original consistency-preserving
                                                                         K-means algorithm of Ding and He, but needs fewer computation.

                                                                         Keywords
                                                                         k-Nearest-Neighbor Consistency, Cluster Analysis, Clustering-Quality Measure




                                1. Introduction
                                Evaluating the quality of a result in unsupervised learning is difficult, as we do not have labels.
                                Many different clustering objectives have been proposed, and we have just as many different
                                evaluation measures. Bonner [1] noted that “none of the many specific definitions that might
                                be given seems ‘best’ in any general sense”, and Estivill-Castro [2] wrote that clustering is “in
                                the eye of the beholder”, and every researcher and user may have different believes on what
                                constitutes a cluster. Common principles for clustering include that nearby objects should be in
                                the same cluster, whereas far away objects should belong to different clusters. In agglomerative
                                hierarchical clustering, for example, we always merge the closest two clusters, K-means assigns
                                points to the nearest cluster, DBSCAN connects neighboring points in dense areas, and spectral
                                clustering attempts to find a minimum cut of the nearest-neighbor graph. Similar principles can
                                be found in clustering quality measures (CQM) [3] such as the Silhouette [4], the Davies-Bouldin
                                index [5], the Variance-Ratio criterion [6], the Dunn index [7], and many more. The Silhouette,
                                for example, compares the average distance to points of the same cluster with the average
                                distance to points of the nearest other cluster. Any unsupervised clustering quality measure
                                implies an “optimal” clustering, that could also be found by exhaustive enumeration; whereas
                                clustering algorithms often only find an approximation to the “optimal” clustering, but in an
                                acceptable run time. For example, K-means may find different solutions when run multiple
                                times because it gets stuck in local optima. Similarly, K-medoid algorithms based on swapping

                                LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany
                                Envelope-Open lars.lenssen@tu-dortmund.de (L. Lenssen); niklas.strahmann@tu-dortmund.de (N. Strahmann);
                                erich.schubert@tu-dortmund.de (E. Schubert)
                                Orcid 0000-0003-0037-0418 (L. Lenssen); 0009-0000-1034-6097 (N. Strahmann); 0000-0001-9143-4880 (E. Schubert)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
can get stuck in local optima [8]. We can also optimize the Medoid Silhouette in a similar
fashion as an example where an evaluation criterion was turned into an efficient algorithm [9].
But an undesired trivial clustering may score well for surprisingly many quality measures. For
example the within-cluster-sum-of-squares (WCSS) criterion used by K-means clustering is
optimal when every point is its own cluster. Hence we may need to add additional constraints
such as keeping the number of clusters fixed. In other cases, it is common to use one criterion
to find candidate solutions (e.g., optimizing WCSS with K-means), and then use a different
criterion for model selection.
   In this article, we are interested in a relatively unknown quality criterion called the k-Nearest-
Neighbor Consistency [10], which measures if the nearest neighbors of each point belong to the
same cluster. This is closely related to the central idea of clustering that similar objects should
be in the same cluster, but also to nearest-neighbor classification, as we would then predict
the correct class. Ding and He [10] also proposed two clustering algorithms to optimize for
this criterion. In this article, we propose a fast variant of the consistency-preserving K-means
algorithm, that provably yields the same results.
   We first introduce nearest-neighbor consistency in Section 2 and clustering with consistency
in Section 3. We then propose a faster algorithm in Section 4. We show experimental evidence
of the performance benefits in Section 5 and conclude the paper in Section 6.


2. k-Nearest-Neighbor Consistency
The k-Nearest-Neighbor Consistency is motivated by k-Nearest-Neighbor classification, but
it can also be used to evaluate clustering validity, as proposed by Ding and He [10]. In the
following, we only consider clustering that form a total partitioning of the data set, encoded
as a cluster label 𝑙𝑖 for each sample 𝑖. Overlapping, hierarchical, or incomplete clustering (e.g.,
noise in DBSCAN) solutions are not considered. For the given samples 𝑋 = {𝑥1 , … , 𝑥𝑛 }, and the
cluster labels 𝐿 = {𝑙1 , … , 𝑙𝑛 }, the 𝑘 NN(𝑥𝑖 ) are the 𝑘 nearest neighbors of 𝑥𝑖 . One single sample 𝑖
is considered to be 𝑘-nearest-neighbor consistent if all neighbors 𝑝 ∈ 𝑘 NN(𝑥𝑖 ) have the same
cluster label as the sample 𝑖:

                                               1   if ∀𝑥𝑗 ∈ 𝑘 NN(𝑥𝑖 ) ∶ 𝑙𝑗 = 𝑙𝑖
                             𝑘𝑐𝑖 (𝑋 , 𝐿) = {
                                               0   otherwise
  To measure the 𝑘 NN consistency of a clustering where not all points are consistent, Ding
and He [10]also use a fractional 𝑘 NN consistency
                                                    1  𝑛
                                   𝐾 𝐶(𝑋 , 𝐿) =       ∑𝑖=1 𝑘𝑐𝑖 (𝑋 , 𝐿).
                                                    𝑛
To use the consistency as a quality measure, Handl and Knowles [11] developed a connectivity-
based measure out of the 𝑘 NN consistency, which gives more emphasis to the nearest neighbors.
The definition of the 𝑘 NN consistency can also be applied to other neighborhood concepts, for
example, k-Mutual-Nearest-Neighbors [10], which symmetrizes k-Nearest-Neighbors and is
denoted by 𝑘 MN(𝑥):
                         𝑥𝑗 ∈ 𝑘 MN(𝑥𝑖 ) ⟺ 𝑥𝑗 ∈ 𝑘 NN(𝑥𝑖 ) ∧ 𝑥𝑖 ∈ 𝑘 NN(𝑥𝑗 ).
 Algorithm 1: ENFORCE
1 𝑙 ← cluster assignments by input clustering algorithm;
2 𝑆 ← calculate sets of closed neighborhoods;
3 𝑁 ←null;
4 foreach 𝑆𝑖 ∈ 𝑆 = {𝑆1 , … , 𝑆ℎ } do                            // closed neighborhoods
 5    (𝑁 ) ←null;
 6    foreach 𝑥𝑗 ∈ 𝑆𝑖 do                                  // count cluster assignments
 7        𝑁𝑙𝑗 + +;
 8    𝑜 ← arg max 𝑁;
 9    foreach 𝑥𝑗 ∈ 𝑆𝑖 do 𝑙𝑗 ← 𝑜;                       // assign closed neighborhoods
10 return 𝐿;




Mutual nearest neighbors have the benefit of being a symmetric relation, but also being less
sensitive to outliers and more likely to contain multiple components. Intuitively, we remove all
unidirectional edges from the k-nearest-neighbor graph. Consider a data set where we have
one dense cluster and a far away outlier. The outlier’s neighbors will be points of the cluster,
but it will not be in the nearest neighbors of the cluster points. Both the 𝑘 NN and the 𝑘 MN are
often used in the context of spectral clustering, which in turn is related to DBSCAN [12, 13].
   Ding and He [10] aim at enforcing 𝑘 NN consistency in their algorithms. We can interpret
this as a form of constrained clustering, where we add must-link constraints for neighbors.
Such constraints have been previously used, e.g., for model selection [14] and for integrating
external constraints into K-means [15, 16].


3. Clustering with Nearest Neighbor Consistency
The k-Nearest-Neighbor consistency was originally proposed by Ding and He [10] along with
two algorithms to create k-Nearest-Neighbor-Consistent clustering, ENFORCE and consistency-
preserving K-means (K-means-CP). The ENFORCE algorithm modifies an existing clustering,
which can be created by any cluster analysis method, to improve its consistency by reassigning
neighborhoods to the most common cluster. To achieve this, ENFORCE uses closed neighbor-
hoods. A set 𝑆 ⊂ 𝑋 is a closed neighborhood if for all 𝑥 ∈ 𝑆 holds 𝑘 NN(𝑥) ⊂ 𝑆 and for every
two elements 𝑥𝑖 , 𝑥𝑗 ∈ 𝑆 there is a path of elements from 𝑘 NN(𝑥𝑖 ) or 𝑘 NN(𝑥𝑗 ). The definition
can be easily extended to any neighborhood function such as 𝑘 MN. It is easy to see that all
points in a closed neighborhood must be in the same cluster for the clustering to be consistent
and that every point can only belong to one closed neighborhood. Figure 1 shows the closed
neighbor sets on an example data set using mutual nearest neighbors with 𝑘 = 3. To enforce a
𝑘 NN-consistent clustering, ENFORCE iterates through all sets of closed neighborhoods and
counts the respective cluster labels. Then it assigns the entire set to the label that occurs most
often. There is also an interesting parallel to DBSCAN clustering here, which computes a similar
transitive closure, but with a density-based notion of the neighborhood. DBSCAN clusters are
closed neighborhoods with respect to density reachability.
   In contrast to ENFORCE, consistency-preserving K-means (K-means-CP) does not work on an
Figure 1: Closed neighbor sets for mutual nearest neighbors with 𝑘 = 3.
For neighbor-consistent clustering, these sets must be assigned to the same cluster.


existing clustering, but integrates consistency into the standard K-means procedure. K-means
uses the quadratic Euclidean distance, although there are other approaches, such as spherical
K-means [17, 18], they will not be considered here. First, initial cluster centers Μ = {𝜇1 , ..., 𝜇𝑘 }
are chosen with any of the standard heuristics. Then an alternating optimization as in the
standard algorithm is performed. But in order to obtain a 𝑘 NN-consistent result, K-means-CP,
like ENFORCE, always assigns entire closed neighborhoods to the same cluster. Thus, Ding and
He [10] define the nearest cluster center nearest(𝑆𝑖 ) of a closed neighborhood 𝑆𝑖 by the sum of
squared euclidean distances
                                                                    2
                               nearest(𝑆𝑖 ) = arg min𝑗 ∑ ‖𝑥 − 𝜇𝑗 ‖ .
                                                        𝑥∈𝑆𝑖

All points in the closed neighborhood 𝑆𝑖 are then assigned to this cluster. The assignment step
is alternated with a recalculation of the cluster centers. A new cluster center 𝜇𝑗 of the cluster 𝐶𝑗
is determined as usual in K-means using the arithmetic average of the assigned data
                                                   1
                                           𝜇𝑗 =         ∑𝑥.
                                                  |𝐶𝑗 | 𝑥∈𝐶
                                                        𝑗


These steps are repeated until no point is reassigned (and hence the cluster centers do not
change anymore). Convergence guarantees of the standard algorithm still apply. This algorithm
computes exactly 𝑁 ⋅ 𝐾 distances in each step to determine the nearest clusters, and these
distance computations make up a major part of the algorithm’s run time. We can also optimize
the recomputation of the cluster centers by using an incremental computation, but this has
much less effect. In the following, we discuss why many of the distance computations performed
above are unnecessary, and we can hence improve the run time of this algorithm.
 Algorithm 2: K-means-CP
1 𝐶 ← initialize 𝑘 cluster;
2 𝑆 ← calculate sets of closed neighborhoods;
3 𝑙 ← dummy assignments to non-existent cluster;
4 repeat
 5    foreach 𝑆𝑖 ∈ 𝑆 = {𝑆1 , … , 𝑆ℎ } do                           // closed neighborhoods
 6         𝑁 ←null;
 7         foreach 𝜇𝑗 ∈ 𝑀 = {𝜇1 , … , 𝜇𝑘 } do                 // determine nearest cluster
 8              foreach 𝑥𝑢 ∈ 𝑆𝑖 do
                                        2
 9                 𝑁𝑗 ← 𝑁𝑗 + ‖𝑥𝑢 − 𝜇𝑗 ‖ ;
10         𝑜 ← arg min 𝑁;
11         foreach 𝑥𝑗 ∈ 𝑆𝑖 do 𝑙𝑗 ← 𝑜;                    // assign closed neighborhoods
12    if 𝑙 is unchanged then break;
13    𝑀 ← calculate cluster centers for 𝐶;
14 return 𝐿;




4. Fast k-Nearest-Neighbor-Consistent Clustering
The K-means-CP algorithm determines the nearest cluster of a closed neighborhood set by the
sum of the squared Euclidean distances of all elements of a closed neighborhood set to the
different cluster centers. Using the parallel axis theorem of König, Huygens and Steiner, we
can prove that instead of considering the distances of all samples in a closed neighborhood set
𝑆𝑖 , it is sufficient to consider only the mean vector 𝑠𝑖̄ = |𝑆1 | ∑𝑥∈𝑆𝑖 𝑥. We can also trivially use
                                                                𝑖
the mean vector for updating the cluster centers, if we weight it by the number of points in
the closed neighborhood set. The same property has been previously used by Lee et al. [19] to
reduce uncertain UK-means clustering to regular K-means clustering. Our improved Neighbor-
consistent K-Means (NCK-means) hence uses the mean vectors of the closed neighborhood sets.
This yields a faster variant of consistency-preserving K-means with a significantly lower run
time and the guaranteed same result.
    We will reference the set which contains all closed neighborhood sets by 𝑆. Because all
elements of a closed neighborhood set will be assigned to the same cluster, we will reference
the set which contains all closed neighborhood sets whose elements have cluster label 𝑖 by 𝐶𝑖 .

4.1. Proof of correctness
To prove the equivalence of the results of NCK-means and K-means-CP, we show that the
substeps of the algorithm always produce the same results. We first show that for given cluster
centers, NCK-means and K-means-CP assign the same cluster labels to a closed neighborhood
set. We start by considering how the original K-means-CP assigns cluster labels:

                              𝑆𝑗 ∈ 𝐶𝑖 ⟺ 𝑖 = arg min𝑖 ∑ ‖𝑥 − 𝜇𝑖 ‖2 .
                                                       𝑥∈𝑆𝑗
 Algorithm 3: NCK-means
1 𝐶 ← initialize 𝑘 cluster;
2 𝑆 ̄ ← calculate representatives of closed neighborhood sets;
3 repeat
4     foreach 𝑠𝑖̄ ∈ 𝑆 ̄ = {𝑠1̄ , … , 𝑠ℎ̄ } do // representatives of closed neighborhoods
 5         𝑁 ←null;
 6         foreach 𝜇𝑗 ∈ 𝑀 = {𝜇1 , … , 𝜇𝑘 } do               // determine nearest cluster
                                         2
 7              𝑁𝑗 ← 𝑁𝑗 + ‖𝑠𝑖̄ − 𝜇𝑗 ‖ ;
 8         𝑜 ← arg min 𝑁;
 9         foreach 𝑥𝑗 ∈ 𝑆𝑖 do 𝑙𝑗 ← 𝑜;                     // assign closed neighborhoods
10    if 𝑙 is unchanged then break;
11    𝑀 ← calculate cluster centers for 𝐶;
12 return 𝐿;




NCK-means assigns its cluster labels by the equivalent optimization:
                                                                         2
                                𝑆𝑗 ∈ 𝐶𝑖 ⟺ 𝑖 = arg min𝑖 ‖𝑠𝑗̄ − 𝜇𝑖 ‖ .

    This equivalence easily follows from the following version of the parallel axis theorem:

                                  ∑ |𝑥 − 𝑎|2 = ∑ (|𝑥 − 𝑋 ̄ |2 ) + 𝑁 |𝑎 − 𝑋̄ |2
                                 𝑥∈𝑋               𝑥∈𝑋
by using the vector form (a sum over all components) and then 𝑎 = 𝜇𝑖 and 𝑋 = 𝑆𝑗 :
                                                                   2              2
                                ∑ ‖𝑥 − 𝜇𝑖 ‖2 = ∑ (‖𝑥 − 𝑠𝑗̄ ‖ ) + |𝑆𝑗 | ⋅ ‖𝜇𝑖 − 𝑠𝑗̄ ‖ .
                                𝑥∈𝑆𝑗               𝑥∈𝑆𝑗
Because only the last term depends on 𝑖, we can omit the others for the minimization over 𝑖:
                                                                         2
                      arg min𝑖 ∑ ‖𝑥 − 𝜇𝑖 ‖2 = arg min𝑖 ‖𝑠𝑗̄ − 𝜇𝑖 ‖ .
                                𝑥∈𝑆𝑗

    We can prove the above version of the parallel axis theorem as follows:
                                            2
    ∑ |𝑥 − 𝑎|2 = ∑ |(𝑥 − 𝑋 ̄ ) − (𝑎 − 𝑋 ̄ )| = ∑ ((𝑥 − 𝑋̄ )2 − 2(𝑥 − 𝑋 ̄ )(𝑎 − 𝑋 ̄ ) + (𝑎 − 𝑋 ̄ )2 )
    𝑥∈𝑋            𝑥∈𝑋                             𝑥∈𝑋
                = ∑ (𝑥 − 𝑋 ̄ )2 − 2(𝑎 − 𝑋 ̄ ) ∑ (𝑥 − 𝑋 ̄ ) +𝑁 |𝑎 − 𝑋̄ |2 = ∑ |𝑥 − 𝑋̄ |2 + 𝑁 |𝑎 − 𝑋 ̄ |2
                  𝑥∈𝑋                        ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
                                              𝑥∈𝑋                          𝑥∈𝑋
                                                      =0

The classic parallel axis theorem is obtained for 𝑎 = 0.
  We now consider the second step, updating the cluster centers. We start with the original
calculation of the cluster centers 𝜇𝑖 . K-means-CP calculates this via the mean of the elements.
                                                     1
                                       𝜇𝑖 =                      ∑ 𝑥
                                              |{𝑥𝑗 ∣ 𝑙𝑗 = 𝑖}| 𝑥 ∈𝑋 ∣𝑙 =𝑖 𝑗
                                                               𝑗    𝑗
Because the labels of each closed set are consistent, we can rewrite this to:
                                             1
                                   𝜇𝑖 =               ∑ ∑𝑥
                                        ∑𝑆𝑗 ∈𝐶𝑖 |𝑆𝑗 | 𝑆 ∈𝐶 𝑥 ∈𝑆 𝑘
                                                           𝑗   𝑖 𝑘   𝑗

and by using 𝑠𝑗̄ = |𝑆1 | ∑𝑥𝑘 ∈𝑆𝑗 𝑥𝑘 , we obtain
                      𝑗
                                                   1
                                       𝜇𝑖 =                 ∑ |𝑆 |𝑠 ̄ .
                                              ∑𝑆𝑗 ∈𝐶𝑖 |𝑆𝑗 | 𝑆 ∈𝐶 𝑗 𝑗
                                                           𝑗   𝑖


  This shows that for closed neighborhood sets with given cluster labels, NCK-means creates
the same cluster centers as K-means-CP. Because both steps of the algorithms produce the same
results, we have proven that the final result of the algorithms is the same.

4.2. Expected run time improvement
The run time improvement depends on the data reduction using closed neighbor sets. The
standard K-means algorithm has a run time of 𝑂(𝑁 𝐾 𝑑𝑖) where 𝑁 is the number of points, 𝐾 is
the number of clusters, 𝑑 is the dimensionality, and 𝑖 is the number of iterations. Finding the
closed neighbor sets additionally takes 𝑂(𝑁 2 ) time (although indexes for similarity search may
yield a considerable speedup), and K-means-CP of Ding and He [10] hence has a complexity of
𝑂(𝑁 2 + 𝑁 𝑘𝑑𝑖). NCK-means reduces this to 𝑂(𝑁 2 + |𝑆|𝐾 𝑑𝑖) where |𝑆| is the number of closed
neighbor sets, and we must assume |𝑆| ∈ 𝑂(𝑁 ). In a worst-case asymptotic analysis, the
improvements hence are likely negligible because the cost of finding the closed neighbor sets
dominates, but in practice it usually offers a decent run time improvement on the order of
|𝑆|/𝑁 in the clustering step, at only the cost of little additional memory to store the means of
each closed neighbor set. Hence, there is no reason not to use it. As it is a best practice to run
K-means several times and keep the best outcome [20], the cost to find the closed neighbor sets
can also be amortized over multiple restarts.
   The choice of the neighborhood has an important effect on the run time. Finding the nearest
neighbors becomes more expensive with a larger neighborhood size, but at the same time
increasing the number of edges decreases the number of closed neighbor sets. For NCK-means,
a lower number is beneficial, and both algorithms also benefit from an often lower number
of iterations because reassignments are less likely to happen with the additional consistency
constraints.

4.3. Further run-time improvements
Because our proof shows that it is sufficient to use the mean of each closed neighbor set, it is
trivially possible to combine this with acceleration techniques of 𝑘-means such as the algorithms
of Elkan [21], Hamerly [22], or the more recent Exponion [23] and Shallot [24] algorithms. The
weight can also be used to build a BETULA tree [25]. We have not experimentally verified these
additional speedups, as this would distract from the main objective of this paper. Furthermore,
the differences between these algorithms will often be small compared to the cost of finding the
nearest neighbors of all points to construct the closed neighborhoods in the beginning.
                       6000
                                          1NN K-means-CP
                                          1NN NCK-means
                                          2NN K-means-CP
   run time (ms)


                       4000               2NN NCK-means
                                          3NN K-means-CP
                                          3NN NCK-means
                       2000



                            0
                                   1000     2000   3000    4000     5000    6000    7000                          8000   9000   10000
                                                                  number of samples
Figure 2: Run time of K-means-CP and NCK-means for different number of samples of MNIST, for
k-nearest neighbor neighborhood with 𝑘 = 1, 2, 3.


                                                                       number of CNS (log scale)
                       40                                                                          1000   1NN
number of iterations




                                  1NN
                                  2NN                                                                     2NN
                                  3NN                                                               100   3NN
                       20
                                                                                                    10

                                                                                                      1
                       0
                                2000    4000 6000 8000 10000                                              2000    4000 6000 8000 10000
                                       number of samples                                                         number of samples

(a) Number of iterations required by the K-means- (b) Number of closed neighborhood sets (CNS) for
    CP for different number of samples for k-nearest  different number of samples for k-nearest neigh-
    neighbor with 𝑘 = 1, 2, 3.                        bor neighborhood with 𝑘 = 1, 2, 3.

Figure 3: Number of iterations and closed neighborhood sets on MNIST data


5. Experiments
To empirically verify the run time improvements, we implemented both versions with Java in the
ELKI 0.8.0 data mining framework [26]. All implementations are within the same codebase to
reduce confounding factors and improve comparability [27]. We perform 10 restarts on an Intel
i4690K processor using a single thread, and evaluate the average values. We analyze the run time
for the MNIST 784 data set. MNIST 784 contains grayscale pictures of handwritten digits with a
picture size of 28 × 28 pixels, which is vectorized to a vector of length 784. We compared the
algorithms with subsets of sizes 𝑛 = 1000, 2000, … , 10000 for the 𝑘 NN and 𝑘 MN neighborhood
for 𝑘 = 1, 2, 3 neighbors. The initial cluster centers were chosen with K-means++ [28]. The
number of clusters is set to 𝐾 = 10 because there are ten digits in the data set.
                              1MN K-means-CP
                              1MN NCK-means
                4000
                              2MN K-means-CP
run time (ms)


                              2MN NCK-means
                              3MN K-means-CP
                              3MN NCK-means
                2000



                   0
                       1000     2000   3000    4000     5000    6000    7000   8000   9000   10000
                                                      number of samples
Figure 4: Run time of K-means-CP and NCK-means for different number of samples for k-mutual-nearest
neighbor with 𝑘 = 1, 2, 3.


   The time measured is the time required for calculating the closed neighborhood sets and until
the algorithm converges. The time to calculate the 𝑘 NN of the data set is omitted because it is
required for both algorithms and is the most time-consuming part. Therefore time variances
in the calculation of the 𝑘 NNs would overshadow the actual time difference of the algorithms.
The ELKI framework automatically uses a k-d-tree or a vantage-point tree to accelerate nearest
neighbor search [26], depending on data set characteristics and the distance function used.
Because of the dimensionality, ELKI chose a VP-tree automatically for this data set. In Figure 6,
we plot the run time of 1-mutual-nearest neighbor computation and NCK-means for different
number of samples of MNIST data. For MNIST with 784 dimensions, the kNN/kMN computation
takes a large part compared to the clustering computation. For MNIST with 784 dimensions,
the kNN/kMN computation takes a large share. We expect the kNN/kMN computation to be
much more lightly weighted for lower dimensions.
   In Figure 2, we plot the results for asymmetric 𝑘 NN. While the run time improved in all
cases, there are only noticeable differences for 𝑘 = 1, which get larger at 𝑛 = 8000. The little
improvements for 𝑘 = 2, 3 are caused by 𝑘 NN inducing only very few closed neighborhood sets.
For 𝑘 = 2, the maximum amount of six closed neighborhood sets is reached with 𝑛 = 5000. For
𝑘 = 3 all data sets are covered by a single closed neighborhood set. This leads to algorithms
only needing one or two iterations until convergence. The spike for 𝑘 = 1 can also be explained
by the iterations needed for the algorithms to converge, which reaches its clear maximum for
𝑛 = 8000. In this case, the run time improves by over 34%.
   In Figure 4, we plot the results for symmetric 𝑘 MN. There are significant differences for
all values of 𝑘 used. This is because the amount of induced closed neighborhood sets seems
to scale approximately linearly for 𝑘 MN. The spikes in the run time differences can again be
explained by the varying number of iterations required for convergence. The greatest relative
improvement is reached by 𝑘 = 3 for 𝑛 = 8000, reducing the run time by over 56%. Across all 𝑘
and 𝑛, we observe an average run time reduction of 30%.
                                                                                   number of CNS (log scale)
                                                                                                               10000
                                                                                                                       1MN
number of iterations
                       60                 1MN
                                          2MN                                                                          2MN
                                          3MN                                                                          3MN
                       40
                                                                                                                1000
                       20

                                     2000    4000 6000 8000 10000                                                      2000    4000 6000 8000 10000
                                            number of samples                                                                 number of samples

(a) Number of iterations required by the K-means- (b) Number of closed neighborhood sets for different
    CP for different number of samples for k-mutual-  number of samples for k-mutual-nearest neigh-
    nearest neighbor with 𝑘 = 1, 2, 3.                bor with 𝑘 = 1, 2, 3..

Figure 5: Number of iterations and closed neighborhood sets on MNIST data.

                                   ⋅104

                               6                 NCK-means
                                                 1-mutual-nearest neighbor
               run time (ms)




                               4


                               2


                               0
                                          1000      2000    3000    4000       5000    6000    7000                           8000   9000   10000
                                                                             number of samples
Figure 6: Run time of 1-mutual-nearest neighbor computation and NCK-means for different number of
samples of MNIST data.


   The run time improvement per iteration would be the largest if there are very few closed
neighborhood sets because, in this case we save a lot of time by using the representative. But
if there are too few closed neighborhood sets, the algorithm converges very fast, e.g., in the
case of 3NN, where the initial clustering is already converged. The clustering quality for very
few closed neighborhood sets is also questionable but sometimes surprisingly good. This is
probably because of the relationship to DBSCAN, which also builds closed neighborhoods, and
spectral clustering, which partitions the nearest-neighbor graph.
6. Conclusion
We improve the K-means-CP algorithm for clustering with k-Nearest-Neighbor consistency by
avoiding unnecessary distance computations. Using the parallel axis theorem, we prove that
we can reduce all closed neighborhoods to a single representative, and use these in clustering
instead of considering all samples individually. This allows clustering with nearest-neighbor
consistency on larger data sets than before.
   The technique could be integrated in further algorithms. In COP-K-Means [15] and PCK-
means [16], for example, this allows us to reduce must-link constraints into using weighted
means, too. But the speedup is only noticeable if we have many such constraints, and the typical
scenario for constrained clustering is with only a few constraints given by the user to guide the
clustering in a semi-supervised way.


References
 [1] R. E. Bonner, On some clustering techniques, IBM Journal of Research and Development 8
     (1964) 22–32. doi:10.1147/rd.81.0022 .
 [2] V. Estivill-Castro, Why so many clustering algorithms – a position paper, SIGKDD
     Explorations 4 (2002) 65–75. doi:10.1145/568574.568575 .
 [3] M. Ackerman, S. Ben-David, Measures of clustering quality: A working set of axioms for
     clustering, in: NIPS, 2008, pp. 121–128.
 [4] P. J. Rousseeuw, Silhouettes: A graphical aid to the interpretation and validation of cluster
     analysis, J. Comput. Appl. Math. 20 (1987) 53–65.
 [5] D. L. Davies, D. W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Mach.
     Intell. (1979) 224–227. doi:10.1109/TPAMI.1979.4766909 .
 [6] T. Calinski, J. Harabasz, A dendrite method for cluster analysis, Communications in
     Statistics 3 (1974) 1–27. doi:10.1080/03610927408827101 .
 [7] J. C. Dunn, Well-separated clusters and optimal fuzzy partitions, Journal of Cybernetics 4
     (1974) 95–104. doi:10.1080/01969727408546059 .
 [8] E. Schubert, P. J. Rousseeuw, Fast and eager k-medoids clustering: O(k) runtime improve-
     ment of the PAM, CLARA, and CLARANS algorithms, Information Systems 101 (2021)
     101804. doi:10.1016/j.is.2021.101804 .
 [9] L. Lenssen, E. Schubert, Clustering by direct optimization of the medoid silhouette, in:
     Similarity Search and Applications, 2022, pp. 190–204. doi:10.1007/978- 3- 031- 17849- 8_
     15 .
[10] C. Ding, X. He, K-nearest-neighbor consistency in data clustering: Incorporating local
     information into global optimization, in: Symp. Applied Computing, SAC, 2004, p. 584–589.
     doi:10.1145/967900.968021 .
[11] J. Handl, J. Knowles, An evolutionary approach to multiobjective clustering, IEEE Trans.
     Evol. Comput. 11 (2007) 56–76. doi:10.1109/TEVC.2006.877146 .
[12] E. Schubert, S. Hess, K. Morik, The relationship of DBSCAN to matrix factorization and
     spectral clustering, in: Lernen, Wissen, Daten, Analysen, 2018, pp. 330–334.
[13] A. Beer, A. Draganov, E. Hohma, P. Jahn, C. M. Frey, I. Assent, Connecting the dots
     – density-connectivity distance unifies DBSCAN, k-center and spectral clustering, in:
     Knowledge Discovery and Data Mining (KDD), 2023. doi:10.1145/3580305.3599283 , to
     appear.
[14] M. Pourrajabi, D. Moulavi, R. J. G. B. Campello, A. Zimek, J. Sander, R. Goebel, Model
     selection for semi-supervised clustering, in: Extending Database Technology, EDBT, 2014,
     pp. 331–342. doi:10.5441/002/edbt.2014.31 .
[15] K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, Constrained k-means clustering with back-
     ground knowledge, in: Int. Conf. Machine Learning (ICML), 2001, pp. 577–584.
[16] S. Basu, A. Banerjee, R. J. Mooney, Active semi-supervision for pairwise constrained cluster-
     ing, in: SIAM Data Mining (SDM), 2004, pp. 333–344. doi:10.1137/1.9781611972740.31 .
[17] I. S. Dhillon, D. S. Modha, Concept decompositions for large sparse text data using
     clustering, Mach. Learn. 42 (2001) 143–175. doi:10.1023/A:1007612920971 .
[18] E. Schubert, A. Lang, G. Feher, Accelerating spherical k-means, in: Similarity Search and
     Applications (SISAP), 2021, pp. 217–231. doi:10.1007/978- 3- 030- 89657- 7_17 .
[19] S. D. Lee, B. Kao, R. Cheng, Reducing UK-means to K-means, in: Workshops Int. Conf.
     Data Mining (ICDM), 2007, pp. 483–488. doi:10.1109/ICDMW.2007.40 .
[20] E. Schubert, Stop using the elbow criterion for k-means and how to choose the number of
     clusters instead, SIGKDD Explorations 25 (2023) 36–42. doi:10.1145/3606274.3606278 .
[21] C. Elkan, Using the triangle inequality to accelerate k-means, in: Int. Conf. Machine
     Learning (ICML), 2003, pp. 147–153.
[22] G. Hamerly, Making k-means even faster, in: SIAM Data Mining (SDM), 2010, pp. 130–140.
     doi:10.1137/1.9781611972801.12 .
[23] J. Newling, F. Fleuret, Fast k-means with accurate bounds, in: Int. Conf. Machine Learning
     (ICML), 2016, pp. 936–944.
[24] C. Borgelt, Even faster exact k-means clustering, in: Symp. Intelligent Data Analysis
     (IDA), 2020, pp. 93–105. doi:10.1007/978- 3- 030- 44584- 3_8 .
[25] A. Lang, E. Schubert, BETULA: fast clustering of large data with improved BIRCH CF-trees,
     Inf. Syst. 108 (2022) 101918. doi:10.1016/j.is.2021.101918 .
[26] E. Schubert, Automatic indexing for similarity search in ELKI, in: Similarity Search and
     Applications (SISAP), 2022. doi:10.1007/978- 3- 031- 17849- 8_16 .
[27] H. Kriegel, E. Schubert, A. Zimek, The (black) art of runtime evaluation: Are we compar-
     ing algorithms or implementations?, Knowl. Inf. Syst. 52 (2017) 341–378. doi:10.1007/
     s10115- 016- 1004- 2 .
[28] D. Arthur, S. Vassilvitskii, k-means++: the advantages of careful seeding, in: Symp.
     Discrete Algorithms, SODA, 2007, pp. 1027–1035.