<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Fast k-Nearest-Neighbor-Consistent Clustering</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Lars</forename><surname>Lenssen</surname></persName>
							<email>lars.lenssen@tu-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="department">Informatik VIII</orgName>
								<orgName type="institution">TU Dortmund University</orgName>
								<address>
									<postCode>44221</postCode>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Niklas</forename><surname>Strahmann</surname></persName>
							<email>niklas.strahmann@tu-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="department">Informatik VIII</orgName>
								<orgName type="institution">TU Dortmund University</orgName>
								<address>
									<postCode>44221</postCode>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Erich</forename><surname>Schubert</surname></persName>
							<email>erich.schubert@tu-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="department">Informatik VIII</orgName>
								<orgName type="institution">TU Dortmund University</orgName>
								<address>
									<postCode>44221</postCode>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Fast k-Nearest-Neighbor-Consistent Clustering</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">6EB5D374E9199BFE157EF000619E53B4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:20+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>k-Nearest-Neighbor Consistency</term>
					<term>Cluster Analysis</term>
					<term>Clustering-Quality Measure</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>There are many ways to measure the quality of a clustering, both extrinsic (when labels are known) and intrinsic (when no labels are available). In this article, we focus on the k-Nearest-Neighbor Consistency measure, which considers a clustering as good if each object is within the same cluster as its nearest neighbors, and hence does not need labels. We propose a variant of the K-means clustering algorithm that uses the k-Nearest-Neighbor Consistency as a constraint while optimizing the sum-of-squares as in regular K-means, resulting in K-means clustering where the nearest neighbors are guaranteed to be in the same cluster. The new version provably yields the same results as the original consistency-preserving K-means algorithm of Ding and He, but needs fewer computation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Evaluating the quality of a result in unsupervised learning is difficult, as we do not have labels. Many different clustering objectives have been proposed, and we have just as many different evaluation measures. Bonner <ref type="bibr" target="#b0">[1]</ref> noted that "none of the many specific definitions that might be given seems 'best' in any general sense", and Estivill-Castro <ref type="bibr" target="#b1">[2]</ref> wrote that clustering is "in the eye of the beholder", and every researcher and user may have different believes on what constitutes a cluster. Common principles for clustering include that nearby objects should be in the same cluster, whereas far away objects should belong to different clusters. In agglomerative hierarchical clustering, for example, we always merge the closest two clusters, K-means assigns points to the nearest cluster, DBSCAN connects neighboring points in dense areas, and spectral clustering attempts to find a minimum cut of the nearest-neighbor graph. Similar principles can be found in clustering quality measures (CQM) <ref type="bibr" target="#b2">[3]</ref> such as the Silhouette <ref type="bibr" target="#b3">[4]</ref>, the Davies-Bouldin index <ref type="bibr" target="#b4">[5]</ref>, the Variance-Ratio criterion <ref type="bibr" target="#b5">[6]</ref>, the Dunn index <ref type="bibr" target="#b6">[7]</ref>, and many more. The Silhouette, for example, compares the average distance to points of the same cluster with the average distance to points of the nearest other cluster. Any unsupervised clustering quality measure implies an "optimal" clustering, that could also be found by exhaustive enumeration; whereas clustering algorithms often only find an approximation to the "optimal" clustering, but in an acceptable run time. For example, K-means may find different solutions when run multiple times because it gets stuck in local optima. Similarly, K-medoid algorithms based on swapping can get stuck in local optima <ref type="bibr" target="#b7">[8]</ref>. We can also optimize the Medoid Silhouette in a similar fashion as an example where an evaluation criterion was turned into an efficient algorithm <ref type="bibr" target="#b8">[9]</ref>. But an undesired trivial clustering may score well for surprisingly many quality measures. For example the within-cluster-sum-of-squares (WCSS) criterion used by K-means clustering is optimal when every point is its own cluster. Hence we may need to add additional constraints such as keeping the number of clusters fixed. In other cases, it is common to use one criterion to find candidate solutions (e.g., optimizing WCSS with K-means), and then use a different criterion for model selection.</p><p>In this article, we are interested in a relatively unknown quality criterion called the k-Nearest-Neighbor Consistency <ref type="bibr" target="#b9">[10]</ref>, which measures if the nearest neighbors of each point belong to the same cluster. This is closely related to the central idea of clustering that similar objects should be in the same cluster, but also to nearest-neighbor classification, as we would then predict the correct class. Ding and He <ref type="bibr" target="#b9">[10]</ref> also proposed two clustering algorithms to optimize for this criterion. In this article, we propose a fast variant of the consistency-preserving K-means algorithm, that provably yields the same results.</p><p>We first introduce nearest-neighbor consistency in Section 2 and clustering with consistency in Section 3. We then propose a faster algorithm in Section 4. We show experimental evidence of the performance benefits in Section 5 and conclude the paper in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">k-Nearest-Neighbor Consistency</head><p>The k-Nearest-Neighbor Consistency is motivated by k-Nearest-Neighbor classification, but it can also be used to evaluate clustering validity, as proposed by Ding and He <ref type="bibr" target="#b9">[10]</ref>. In the following, we only consider clustering that form a total partitioning of the data set, encoded as a cluster label 𝑙 𝑖 for each sample 𝑖. Overlapping, hierarchical, or incomplete clustering (e.g., noise in DBSCAN) solutions are not considered. For the given samples 𝑋 = {𝑥 1 , … , 𝑥 𝑛 }, and the cluster labels 𝐿 = {𝑙 1 , … , 𝑙 𝑛 }, the 𝑘 NN(𝑥 𝑖 ) are the 𝑘 nearest neighbors of 𝑥 𝑖 . One single sample 𝑖 is considered to be 𝑘-nearest-neighbor consistent if all neighbors 𝑝 ∈ 𝑘 NN(𝑥 𝑖 ) have the same cluster label as the sample 𝑖:</p><formula xml:id="formula_0">𝑘𝑐 𝑖 (𝑋 , 𝐿) = { 1 if ∀𝑥 𝑗 ∈ 𝑘 NN(𝑥 𝑖 ) ∶ 𝑙 𝑗 = 𝑙 𝑖 0 otherwise</formula><p>To measure the 𝑘 NN consistency of a clustering where not all points are consistent, Ding and He <ref type="bibr" target="#b9">[10]</ref> Mutual nearest neighbors have the benefit of being a symmetric relation, but also being less sensitive to outliers and more likely to contain multiple components. Intuitively, we remove all unidirectional edges from the k-nearest-neighbor graph. Consider a data set where we have one dense cluster and a far away outlier. The outlier's neighbors will be points of the cluster, but it will not be in the nearest neighbors of the cluster points. Both the 𝑘 NN and the 𝑘 MN are often used in the context of spectral clustering, which in turn is related to DBSCAN <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13]</ref>.</p><p>Ding and He <ref type="bibr" target="#b9">[10]</ref> aim at enforcing 𝑘 NN consistency in their algorithms. We can interpret this as a form of constrained clustering, where we add must-link constraints for neighbors. Such constraints have been previously used, e.g., for model selection <ref type="bibr" target="#b13">[14]</ref> and for integrating external constraints into K-means <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b15">16]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Clustering with Nearest Neighbor Consistency</head><p>The k-Nearest-Neighbor consistency was originally proposed by Ding and He <ref type="bibr" target="#b9">[10]</ref> along with two algorithms to create k-Nearest-Neighbor-Consistent clustering, ENFORCE and consistencypreserving K-means (K-means-CP). The ENFORCE algorithm modifies an existing clustering, which can be created by any cluster analysis method, to improve its consistency by reassigning neighborhoods to the most common cluster. To achieve this, ENFORCE uses closed neighborhoods. A set 𝑆 ⊂ 𝑋 is a closed neighborhood if for all 𝑥 ∈ 𝑆 holds 𝑘 NN(𝑥) ⊂ 𝑆 and for every two elements 𝑥 𝑖 , 𝑥 𝑗 ∈ 𝑆 there is a path of elements from 𝑘 NN(𝑥 𝑖 ) or 𝑘 NN(𝑥 𝑗 ). The definition can be easily extended to any neighborhood function such as 𝑘 MN. It is easy to see that all points in a closed neighborhood must be in the same cluster for the clustering to be consistent and that every point can only belong to one closed neighborhood. Figure <ref type="figure" target="#fig_0">1</ref> shows the closed neighbor sets on an example data set using mutual nearest neighbors with 𝑘 = 3. To enforce a 𝑘 NN-consistent clustering, ENFORCE iterates through all sets of closed neighborhoods and counts the respective cluster labels. Then it assigns the entire set to the label that occurs most often. There is also an interesting parallel to DBSCAN clustering here, which computes a similar transitive closure, but with a density-based notion of the neighborhood. DBSCAN clusters are closed neighborhoods with respect to density reachability.</p><p>In contrast to ENFORCE, consistency-preserving K-means (K-means-CP) does not work on an For neighbor-consistent clustering, these sets must be assigned to the same cluster.</p><p>existing clustering, but integrates consistency into the standard K-means procedure. K-means uses the quadratic Euclidean distance, although there are other approaches, such as spherical K-means <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>, they will not be considered here. First, initial cluster centers Μ = {𝜇 1 , ..., 𝜇 𝑘 } are chosen with any of the standard heuristics. Then an alternating optimization as in the standard algorithm is performed. But in order to obtain a 𝑘 NN-consistent result, K-means-CP, like ENFORCE, always assigns entire closed neighborhoods to the same cluster. Thus, Ding and He <ref type="bibr" target="#b9">[10]</ref> define the nearest cluster center nearest(𝑆 𝑖 ) of a closed neighborhood 𝑆 𝑖 by the sum of squared euclidean distances</p><formula xml:id="formula_1">nearest(𝑆 𝑖 ) = arg min 𝑗 ∑ 𝑥∈𝑆 𝑖 ‖𝑥 − 𝜇 𝑗 ‖ 2 .</formula><p>All points in the closed neighborhood 𝑆 𝑖 are then assigned to this cluster. The assignment step is alternated with a recalculation of the cluster centers. A new cluster center 𝜇 𝑗 of the cluster 𝐶 𝑗 is determined as usual in K-means using the arithmetic average of the assigned data</p><formula xml:id="formula_2">𝜇 𝑗 = 1 |𝐶 𝑗 | ∑ 𝑥∈𝐶 𝑗 𝑥 .</formula><p>These steps are repeated until no point is reassigned (and hence the cluster centers do not change anymore). Convergence guarantees of the standard algorithm still apply. This algorithm computes exactly 𝑁 ⋅ 𝐾 distances in each step to determine the nearest clusters, and these distance computations make up a major part of the algorithm's run time. We can also optimize the recomputation of the cluster centers by using an incremental computation, but this has much less effect. In the following, we discuss why many of the distance computations performed above are unnecessary, and we can hence improve the run time of this algorithm.</p><p>Algorithm 2: K-means-CP </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Fast k-Nearest-Neighbor-Consistent Clustering</head><p>The K-means-CP algorithm determines the nearest cluster of a closed neighborhood set by the sum of the squared Euclidean distances of all elements of a closed neighborhood set to the different cluster centers. Using the parallel axis theorem of König, Huygens and Steiner, we can prove that instead of considering the distances of all samples in a closed neighborhood set 𝑆 𝑖 , it is sufficient to consider only the mean vector s𝑖 = 1 |𝑆 𝑖 | ∑ 𝑥∈𝑆 𝑖 𝑥. We can also trivially use the mean vector for updating the cluster centers, if we weight it by the number of points in the closed neighborhood set. The same property has been previously used by Lee et al. <ref type="bibr" target="#b18">[19]</ref> to reduce uncertain UK-means clustering to regular K-means clustering. Our improved Neighborconsistent K-Means (NCK-means) hence uses the mean vectors of the closed neighborhood sets. This yields a faster variant of consistency-preserving K-means with a significantly lower run time and the guaranteed same result.</p><p>We will reference the set which contains all closed neighborhood sets by 𝑆. Because all elements of a closed neighborhood set will be assigned to the same cluster, we will reference the set which contains all closed neighborhood sets whose elements have cluster label 𝑖 by 𝐶 𝑖 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Proof of correctness</head><p>To prove the equivalence of the results of NCK-means and K-means-CP, we show that the substeps of the algorithm always produce the same results. We first show that for given cluster centers, NCK-means and K-means-CP assign the same cluster labels to a closed neighborhood set. We start by considering how the original K-means-CP assigns cluster labels:</p><formula xml:id="formula_3">𝑆 𝑗 ∈ 𝐶 𝑖 ⟺ 𝑖 = arg min 𝑖 ∑ 𝑥∈𝑆 𝑗 ‖𝑥 − 𝜇 𝑖 ‖ 2 .</formula><p>Algorithm 3: NCK-means NCK-means assigns its cluster labels by the equivalent optimization:</p><formula xml:id="formula_4">𝑆 𝑗 ∈ 𝐶 𝑖 ⟺ 𝑖 = arg min 𝑖 ‖ s𝑗 − 𝜇 𝑖 ‖ 2 .</formula><p>This equivalence easily follows from the following version of the parallel axis theorem:</p><formula xml:id="formula_5">∑ 𝑥∈𝑋 |𝑥 − 𝑎| 2 = ∑ 𝑥∈𝑋 (|𝑥 − X | 2 ) + 𝑁 |𝑎 − X | 2</formula><p>by using the vector form (a sum over all components) and then 𝑎 = 𝜇 𝑖 and 𝑋 = 𝑆 𝑗 :</p><formula xml:id="formula_6">∑ 𝑥∈𝑆 𝑗 ‖𝑥 − 𝜇 𝑖 ‖ 2 = ∑ 𝑥∈𝑆 𝑗 (‖𝑥 − s𝑗 ‖ 2 ) + |𝑆 𝑗 | ⋅ ‖𝜇 𝑖 − s𝑗 ‖ 2 .</formula><p>Because only the last term depends on 𝑖, we can omit the others for the minimization over 𝑖:</p><formula xml:id="formula_7">arg min 𝑖 ∑ 𝑥∈𝑆 𝑗 ‖𝑥 − 𝜇 𝑖 ‖ 2 = arg min 𝑖 ‖ s𝑗 − 𝜇 𝑖 ‖ 2 .</formula><p>We can prove the above version of the parallel axis theorem as follows:</p><formula xml:id="formula_8">∑ 𝑥∈𝑋 |𝑥 − 𝑎| 2 = ∑ 𝑥∈𝑋 |(𝑥 − X ) − (𝑎 − X )| 2 = ∑ 𝑥∈𝑋 ((𝑥 − X ) 2 − 2(𝑥 − X )(𝑎 − X ) + (𝑎 − X ) 2 ) = ∑ 𝑥∈𝑋 (𝑥 − X ) 2 − 2(𝑎 − X ) ∑ 𝑥∈𝑋 (𝑥 − X ) ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ =0 +𝑁 |𝑎 − X | 2 = ∑ 𝑥∈𝑋 |𝑥 − X | 2 + 𝑁 |𝑎 − X | 2</formula><p>The classic parallel axis theorem is obtained for 𝑎 = 0.</p><p>We now consider the second step, updating the cluster centers. We start with the original calculation of the cluster centers 𝜇 𝑖 . K-means-CP calculates this via the mean of the elements.</p><formula xml:id="formula_9">𝜇 𝑖 = 1 |{𝑥 𝑗 | 𝑙 𝑗 = 𝑖}| ∑ 𝑥 𝑗 ∈𝑋 |𝑙 𝑗 =𝑖 𝑥 𝑗</formula><p>Because the labels of each closed set are consistent, we can rewrite this to:</p><formula xml:id="formula_10">𝜇 𝑖 = 1 ∑ 𝑆 𝑗 ∈𝐶 𝑖 |𝑆 𝑗 | ∑ 𝑆 𝑗 ∈𝐶 𝑖 ∑ 𝑥 𝑘 ∈𝑆 𝑗 𝑥 𝑘</formula><p>and by using s𝑗 = 1 |𝑆 𝑗 | ∑ 𝑥 𝑘 ∈𝑆 𝑗 𝑥 𝑘 , we obtain</p><formula xml:id="formula_11">𝜇 𝑖 = 1 ∑ 𝑆 𝑗 ∈𝐶 𝑖 |𝑆 𝑗 | ∑ 𝑆 𝑗 ∈𝐶 𝑖 |𝑆 𝑗 | s𝑗 .</formula><p>This shows that for closed neighborhood sets with given cluster labels, NCK-means creates the same cluster centers as K-means-CP. Because both steps of the algorithms produce the same results, we have proven that the final result of the algorithms is the same.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Expected run time improvement</head><p>The run time improvement depends on the data reduction using closed neighbor sets. The standard K-means algorithm has a run time of 𝑂(𝑁 𝐾 𝑑𝑖) where 𝑁 is the number of points, 𝐾 is the number of clusters, 𝑑 is the dimensionality, and 𝑖 is the number of iterations. Finding the closed neighbor sets additionally takes 𝑂(𝑁 2 ) time (although indexes for similarity search may yield a considerable speedup), and K-means-CP of Ding and He <ref type="bibr" target="#b9">[10]</ref> hence has a complexity of 𝑂(𝑁 2 + 𝑁 𝑘𝑑𝑖). NCK-means reduces this to 𝑂(𝑁 2 + |𝑆|𝐾 𝑑𝑖) where |𝑆| is the number of closed neighbor sets, and we must assume |𝑆| ∈ 𝑂(𝑁 ). In a worst-case asymptotic analysis, the improvements hence are likely negligible because the cost of finding the closed neighbor sets dominates, but in practice it usually offers a decent run time improvement on the order of |𝑆|/𝑁 in the clustering step, at only the cost of little additional memory to store the means of each closed neighbor set. Hence, there is no reason not to use it. As it is a best practice to run K-means several times and keep the best outcome <ref type="bibr" target="#b19">[20]</ref>, the cost to find the closed neighbor sets can also be amortized over multiple restarts.</p><p>The choice of the neighborhood has an important effect on the run time. Finding the nearest neighbors becomes more expensive with a larger neighborhood size, but at the same time increasing the number of edges decreases the number of closed neighbor sets. For NCK-means, a lower number is beneficial, and both algorithms also benefit from an often lower number of iterations because reassignments are less likely to happen with the additional consistency constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Further run-time improvements</head><p>Because our proof shows that it is sufficient to use the mean of each closed neighbor set, it is trivially possible to combine this with acceleration techniques of 𝑘-means such as the algorithms of Elkan <ref type="bibr" target="#b20">[21]</ref>, Hamerly <ref type="bibr" target="#b21">[22]</ref>, or the more recent Exponion <ref type="bibr" target="#b22">[23]</ref> and Shallot <ref type="bibr" target="#b23">[24]</ref> algorithms. The weight can also be used to build a BETULA tree <ref type="bibr" target="#b24">[25]</ref>. We have not experimentally verified these additional speedups, as this would distract from the main objective of this paper. Furthermore, the differences between these algorithms will often be small compared to the cost of finding the nearest neighbors of all points to construct the closed neighborhoods in the beginning.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>To empirically verify the run time improvements, we implemented both versions with Java in the ELKI 0.8.0 data mining framework <ref type="bibr" target="#b25">[26]</ref>. All implementations are within the same codebase to reduce confounding factors and improve comparability <ref type="bibr" target="#b26">[27]</ref>. We perform 10 restarts on an Intel i4690K processor using a single thread, and evaluate the average values. We analyze the run time for the MNIST 784 data set. MNIST 784 contains grayscale pictures of handwritten digits with a picture size of 28 × 28 pixels, which is vectorized to a vector of length 784. We compared the algorithms with subsets of sizes 𝑛 = 1000, 2000, … , 10000 for the 𝑘 NN and 𝑘 MN neighborhood for 𝑘 = 1, 2, 3 neighbors. The initial cluster centers were chosen with K-means++ <ref type="bibr" target="#b27">[28]</ref>. The number of clusters is set to 𝐾 = 10 because there are ten digits in the data set. The time measured is the time required for calculating the closed neighborhood sets and until the algorithm converges. The time to calculate the 𝑘 NN of the data set is omitted because it is required for both algorithms and is the most time-consuming part. Therefore time variances in the calculation of the 𝑘 NNs would overshadow the actual time difference of the algorithms. The ELKI framework automatically uses a k-d-tree or a vantage-point tree to accelerate nearest neighbor search <ref type="bibr" target="#b25">[26]</ref>, depending on data set characteristics and the distance function used. Because of the dimensionality, ELKI chose a VP-tree automatically for this data set. In Figure <ref type="figure">6</ref>, we plot the run time of 1-mutual-nearest neighbor computation and NCK-means for different number of samples of MNIST data. For MNIST with 784 dimensions, the kNN/kMN computation takes a large part compared to the clustering computation. For MNIST with 784 dimensions, the kNN/kMN computation takes a large share. We expect the kNN/kMN computation to be much more lightly weighted for lower dimensions.</p><p>In Figure <ref type="figure" target="#fig_1">2</ref>, we plot the results for asymmetric 𝑘 NN. While the run time improved in all cases, there are only noticeable differences for 𝑘 = 1, which get larger at 𝑛 = 8000. The little improvements for 𝑘 = 2, 3 are caused by 𝑘 NN inducing only very few closed neighborhood sets. For 𝑘 = 2, the maximum amount of six closed neighborhood sets is reached with 𝑛 = 5000. For 𝑘 = 3 all data sets are covered by a single closed neighborhood set. This leads to algorithms only needing one or two iterations until convergence. The spike for 𝑘 = 1 can also be explained by the iterations needed for the algorithms to converge, which reaches its clear maximum for 𝑛 = 8000. In this case, the run time improves by over 34%.</p><p>In Figure <ref type="figure" target="#fig_3">4</ref>, we plot the results for symmetric 𝑘 MN. There are significant differences for all values of 𝑘 used. This is because the amount of induced closed neighborhood sets seems to scale approximately linearly for 𝑘 MN. The spikes in the run time differences can again be explained by the varying number of iterations required for convergence. The greatest relative improvement is reached by 𝑘 = 3 for 𝑛 = 8000, reducing the run time by over 56%. Across all 𝑘 and 𝑛, we observe an average run time reduction of 30%.  The run time improvement per iteration would be the largest if there are very few closed neighborhood sets because, in this case we save a lot of time by using the representative. But if there are too few closed neighborhood sets, the algorithm converges very fast, e.g., in the case of 3NN, where the initial clustering is already converged. The clustering quality for very few closed neighborhood sets is also questionable but sometimes surprisingly good. This is probably because of the relationship to DBSCAN, which also builds closed neighborhoods, and spectral clustering, which partitions the nearest-neighbor graph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>We improve the K-means-CP algorithm for clustering with k-Nearest-Neighbor consistency by avoiding unnecessary distance computations. Using the parallel axis theorem, we prove that we can reduce all closed neighborhoods to a single representative, and use these in clustering instead of considering all samples individually. This allows clustering with nearest-neighbor consistency on larger data sets than before.</p><p>The technique could be integrated in further algorithms. In COP-K-Means <ref type="bibr" target="#b14">[15]</ref> and PCKmeans <ref type="bibr" target="#b15">[16]</ref>, for example, this allows us to reduce must-link constraints into using weighted means, too. But the speedup is only noticeable if we have many such constraints, and the typical scenario for constrained clustering is with only a few constraints given by the user to guide the clustering in a semi-supervised way.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Closed neighbor sets for mutual nearest neighbors with 𝑘 = 3.For neighbor-consistent clustering, these sets must be assigned to the same cluster.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Run time of K-means-CP and NCK-means for different number of samples of MNIST, for k-nearest neighbor neighborhood with 𝑘 = 1, 2, 3.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Number of iterations and closed neighborhood sets on MNIST data</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Run time of K-means-CP and NCK-means for different number of samples for k-mutual-nearest neighbor with 𝑘 = 1, 2, 3.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head></head><label></label><figDesc>Number of iterations required by the K-means-CP for different number of samples for k-mutualnearest neighbor with 𝑘 = 1Number of closed neighborhood sets for different number of samples for k-mutual-nearest neighbor with 𝑘 = 1, 2, 3..</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :Figure 6 :</head><label>56</label><figDesc>Figure 5: Number of iterations and closed neighborhood sets on MNIST data.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>also use a fractional 𝑘 NN consistency</head><label></label><figDesc>𝑥 𝑗 ∈ 𝑘 MN(𝑥 𝑖 ) ⟺ 𝑥 𝑗 ∈ 𝑘 NN(𝑥 𝑖 ) ∧ 𝑥 𝑖 ∈ 𝑘 NN(𝑥 𝑗 ).</figDesc><table><row><cell cols="2">Algorithm 1: ENFORCE</cell><cell></cell></row><row><cell cols="4">1 𝑙 ← cluster assignments by input clustering algorithm;</cell></row><row><cell cols="3">2 𝑆 ← calculate sets of closed neighborhoods;</cell></row><row><cell cols="2">3 𝑁 ←null;</cell><cell></cell></row><row><cell cols="2">4 foreach 𝑆 𝑖 ∈ 𝑆 = {𝑆 1 , … , 𝑆 ℎ } do</cell><cell></cell><cell>// closed neighborhoods</cell></row><row><cell>5</cell><cell>(𝑁 ) ←null;</cell><cell></cell></row><row><cell>6</cell><cell>foreach 𝑥 𝑗 ∈ 𝑆 𝑖 do</cell><cell></cell><cell>// count cluster assignments</cell></row><row><cell>7</cell><cell>𝑁 𝑙 𝑗 + +;</cell><cell></cell></row><row><cell>8</cell><cell>𝑜 ← arg max 𝑁;</cell><cell></cell></row><row><cell>9</cell><cell>foreach 𝑥 𝑗 ∈ 𝑆 𝑖 do 𝑙 𝑗 ← 𝑜;</cell><cell></cell><cell>// assign closed neighborhoods</cell></row><row><cell cols="2">10 return 𝐿;</cell><cell></cell></row><row><cell></cell><cell>𝐾 𝐶(𝑋 , 𝐿) =</cell><cell>1 𝑛</cell><cell>∑ 𝑛 𝑖=1 𝑘𝑐 𝑖 (𝑋 , 𝐿).</cell></row><row><cell cols="4">To use the consistency as a quality measure, Handl and Knowles [11] developed a connectivity-</cell></row><row><cell cols="4">based measure out of the 𝑘 NN consistency, which gives more emphasis to the nearest neighbors.</cell></row><row><cell cols="4">The definition of the 𝑘 NN consistency can also be applied to other neighborhood concepts, for</cell></row><row><cell cols="4">example, k-Mutual-Nearest-Neighbors [10], which symmetrizes k-Nearest-Neighbors and is</cell></row><row><cell cols="2">denoted by 𝑘 MN(𝑥):</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>𝑁 𝑗 ← 𝑁 𝑗 + ‖𝑥 𝑢 − 𝜇 𝑗 ‖ 𝑗 ∈ 𝑆 𝑖 do 𝑙 𝑗 ← 𝑜;</figDesc><table><row><cell></cell><cell>// assign closed neighborhoods</cell></row><row><cell>12</cell><cell>if 𝑙 is unchanged then break;</cell></row></table><note>1 𝐶 ← initialize 𝑘 cluster; 2 𝑆 ← calculate sets of closed neighborhoods; 3 𝑙 ← dummy assignments to non-existent cluster; 4 repeat 5 foreach 𝑆 𝑖 ∈ 𝑆 = {𝑆 1 , … , 𝑆 ℎ } do // closed neighborhoods 6 𝑁 ←null; 7 foreach 𝜇 𝑗 ∈ 𝑀 = {𝜇 1 , … , 𝜇 𝑘 } do // determine nearest cluster 8 foreach 𝑥 𝑢 ∈ 𝑆 𝑖 do 9 2 ; 10 𝑜 ← arg min 𝑁; 11 foreach 𝑥 13 𝑀 ← calculate cluster centers for 𝐶; 14 return 𝐿;</note></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">On some clustering techniques</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">E</forename><surname>Bonner</surname></persName>
		</author>
		<idno type="DOI">10.1147/rd.81.0022</idno>
	</analytic>
	<monogr>
		<title level="j">IBM Journal of Research and Development</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="22" to="32" />
			<date type="published" when="1964">1964</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Why so many clustering algorithms -a position paper</title>
		<author>
			<persName><forename type="first">V</forename><surname>Estivill-Castro</surname></persName>
		</author>
		<idno type="DOI">10.1145/568574.568575</idno>
	</analytic>
	<monogr>
		<title level="j">SIGKDD Explorations</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="65" to="75" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Measures of clustering quality: A working set of axioms for clustering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ackerman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ben-David</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">NIPS</title>
		<imprint>
			<biblScope unit="page" from="121" to="128" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Silhouettes: A graphical aid to the interpretation and validation of cluster analysis</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Rousseeuw</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Comput. Appl. Math</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="53" to="65" />
			<date type="published" when="1987">1987</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A cluster separation measure</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Davies</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Bouldin</surname></persName>
		</author>
		<idno type="DOI">10.1109/TPAMI.1979.4766909</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Pattern Anal. Mach. Intell</title>
		<imprint>
			<biblScope unit="page" from="224" to="227" />
			<date type="published" when="1979">1979</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A dendrite method for cluster analysis</title>
		<author>
			<persName><forename type="first">T</forename><surname>Calinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Harabasz</surname></persName>
		</author>
		<idno type="DOI">10.1080/03610927408827101</idno>
	</analytic>
	<monogr>
		<title level="j">Communications in Statistics</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="1" to="27" />
			<date type="published" when="1974">1974</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Well-separated clusters and optimal fuzzy partitions</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Dunn</surname></persName>
		</author>
		<idno type="DOI">10.1080/01969727408546059</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Cybernetics</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="95" to="104" />
			<date type="published" when="1974">1974</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms</title>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Rousseeuw</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.is.2021.101804</idno>
	</analytic>
	<monogr>
		<title level="j">Information Systems</title>
		<imprint>
			<biblScope unit="volume">101</biblScope>
			<biblScope unit="page">101804</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Clustering by direct optimization of the medoid silhouette</title>
		<author>
			<persName><forename type="first">L</forename><surname>Lenssen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-17849-8_15</idno>
	</analytic>
	<monogr>
		<title level="m">Similarity Search and Applications</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="190" to="204" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">K-nearest-neighbor consistency in data clustering: Incorporating local information into global optimization</title>
		<author>
			<persName><forename type="first">C</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<idno type="DOI">10.1145/967900.968021</idno>
	</analytic>
	<monogr>
		<title level="m">Symp. Applied Computing, SAC</title>
				<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="584" to="589" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">An evolutionary approach to multiobjective clustering</title>
		<author>
			<persName><forename type="first">J</forename><surname>Handl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Knowles</surname></persName>
		</author>
		<idno type="DOI">10.1109/TEVC.2006.877146</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Evol. Comput</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="56" to="76" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The relationship of DBSCAN to matrix factorization and spectral clustering</title>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Morik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Lernen, Wissen, Daten, Analysen</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="330" to="334" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Connecting the dots -density-connectivity distance unifies DBSCAN, k-center and spectral clustering</title>
		<author>
			<persName><forename type="first">A</forename><surname>Beer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Draganov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hohma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Jahn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Frey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Assent</surname></persName>
		</author>
		<idno type="DOI">10.1145/3580305.3599283</idno>
	</analytic>
	<monogr>
		<title level="m">Knowledge Discovery and Data Mining (KDD)</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>to appear</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Model selection for semi-supervised clustering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Pourrajabi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Moulavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J G B</forename><surname>Campello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sander</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Goebel</surname></persName>
		</author>
		<idno type="DOI">10.5441/002/edbt.2014.31</idno>
	</analytic>
	<monogr>
		<title level="m">Extending Database Technology, EDBT</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="331" to="342" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Constrained k-means clustering with background knowledge</title>
		<author>
			<persName><forename type="first">K</forename><surname>Wagstaff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cardie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rogers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Schrödl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Int. Conf. Machine Learning (ICML)</title>
				<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="577" to="584" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Active semi-supervision for pairwise constrained clustering</title>
		<author>
			<persName><forename type="first">S</forename><surname>Basu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Mooney</surname></persName>
		</author>
		<idno type="DOI">10.1137/1.9781611972740.31</idno>
	</analytic>
	<monogr>
		<title level="j">SIAM Data Mining (SDM)</title>
		<imprint>
			<biblScope unit="page" from="333" to="344" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Concept decompositions for large sparse text data using clustering</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">S</forename><surname>Dhillon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Modha</surname></persName>
		</author>
		<idno type="DOI">10.1023/A:1007612920971</idno>
	</analytic>
	<monogr>
		<title level="j">Mach. Learn</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="143" to="175" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Accelerating spherical k-means</title>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Feher</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-89657-7_17</idno>
	</analytic>
	<monogr>
		<title level="m">Similarity Search and Applications</title>
				<imprint>
			<publisher>SISAP</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="217" to="231" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Reducing UK-means to K-means</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">D</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Kao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cheng</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICDMW.2007.40</idno>
	</analytic>
	<monogr>
		<title level="m">Workshops Int. Conf. Data Mining (ICDM)</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="483" to="488" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Stop using the elbow criterion for k-means and how to choose the number of clusters instead</title>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<idno type="DOI">10.1145/3606274.3606278</idno>
	</analytic>
	<monogr>
		<title level="j">SIGKDD Explorations</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page" from="36" to="42" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Using the triangle inequality to accelerate k-means</title>
		<author>
			<persName><forename type="first">C</forename><surname>Elkan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Int. Conf. Machine Learning (ICML)</title>
				<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="147" to="153" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Making k-means even faster</title>
		<author>
			<persName><forename type="first">G</forename><surname>Hamerly</surname></persName>
		</author>
		<idno type="DOI">10.1137/1.9781611972801.12</idno>
	</analytic>
	<monogr>
		<title level="m">SIAM Data Mining (SDM)</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="130" to="140" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Fast k-means with accurate bounds</title>
		<author>
			<persName><forename type="first">J</forename><surname>Newling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fleuret</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Int. Conf. Machine Learning (ICML)</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="936" to="944" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Even faster exact k-means clustering</title>
		<author>
			<persName><forename type="first">C</forename><surname>Borgelt</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-44584-3_8</idno>
	</analytic>
	<monogr>
		<title level="m">Symp. Intelligent Data Analysis (IDA)</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="93" to="105" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">BETULA: fast clustering of large data with improved BIRCH CF-trees</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.is.2021.101918</idno>
	</analytic>
	<monogr>
		<title level="j">Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">108</biblScope>
			<biblScope unit="page">101918</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Automatic indexing for similarity search in ELKI</title>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-17849-8_16</idno>
	</analytic>
	<monogr>
		<title level="m">Similarity Search and Applications</title>
				<meeting><address><addrLine>SISAP</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">The (black) art of runtime evaluation: Are we comparing algorithms or implementations?</title>
		<author>
			<persName><forename type="first">H</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimek</surname></persName>
		</author>
		<idno type="DOI">10.1007/s10115-016-1004-2</idno>
	</analytic>
	<monogr>
		<title level="j">Knowl. Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page" from="341" to="378" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">k-means++: the advantages of careful seeding</title>
		<author>
			<persName><forename type="first">D</forename><surname>Arthur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vassilvitskii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Symp. Discrete Algorithms, SODA</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="1027" to="1035" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
