=Paper= {{Paper |id=Vol-2572/paper7 |storemode=property |title=KD-means: Clustering Method for Massive Data based on KD-tree |pdfUrl=https://ceur-ws.org/Vol-2572/paper7.pdf |volume=Vol-2572 |authors=Nabil El Malki,Franck Ravat,Olivier Teste |dblpUrl=https://dblp.org/rec/conf/dolap/MalkiRT20 }} ==KD-means: Clustering Method for Massive Data based on KD-tree== https://ceur-ws.org/Vol-2572/paper7.pdf

KD-means: clustering method for massive data based on
kd-tree
Nabil El malki Franck Ravat Olivier Teste
Capgemini, Université de Toulouse, Université de Toulouse, UT1, Université de Toulouse, UT2J,
UT2J, IRIT(CNRS/UMR5505) IRIT(CNRS/UMR 5505) IRIT(CNRS/UMR 5505)
Toulouse, France Toulouse, France Toulouse, France
nabil.el-malki@capgemini.com franck.ravat@irit.fr olivier.teste@irit.fr

ABSTRACT solutions have been proposed to estimate the relevant number of
K-means clustering is a popular unsupervised classification algo- clusters 𝑘 [10, 16, 17]. Among the limitations of these solutions,
rithm employed in several domains, e.g., imaging, segmentation, we show three major ones that challenge their normal process:
or compression. Nevertheless, the number of clusters k, fixed apri- • they unroll k-means several times on the same detailed
ori, affects mainly the clustering quality. Current State-of-the-art data. However, k-means has difficulties in scaling because
k-means implementations could automatically set of the number it has a temporal complexity proportional to 𝑘𝑛𝑡 with 𝑡
of clusters. However, they result in unreasonable processing time the number of iterations [11]. Consequently, in a context
while classifying large volumes of data. In this paper, we propose of massive data, repetitive access to data caused by the
a novel solution based on kd-tree to determine the number of repetitive use of k-means on the same data make them
cluster k in the context of massive data for preprocessing data computational time consuming or unusable for applica-
science projects or in near-real-time applications. We demon- tions that require near-real-time responses;
strate how our solution outperforms current solutions in terms • they hardly supports overlaped clusters. This problem can
of clustering quality, and processing time on massive data. lead to two different cases. i) overestimating the number
of clusters exaggeratedly, compared to the actual number,
1 INTRODUCTION thus increasing the processing time significantly. ii) un-
derestimating the value of k, thus producing a very poor
Up to now, data clustering, also known as cluster analysis, has clustering quality;
been one of the most important tasks in exploratory data analy- • if we consider clusters that approach the sphere form,
sis. It is also applied in a variety of applications, e.g., web page which k-means identifies, then these k estimation solu-
clustering, pattern recognition, image segmentation, data com- tions tend to capture clusters of this form. While, in real
pression and nearest neighbor search [12]. Various clustering data sets, there are also clusters of spherical shapes that are
algorithms have been available since the early 1950s. The goal of more or less elongated, approaching the elliptical shape
data clustering is to classify a set of patterns, points or objects and its declines. These clusters could also have different di-
into groups known as clusters. Data of each group are as similar rections. We consider a cluster to be strictly spherical if it
as possible to one another, and different groups are as dissimilar represents a perfectly or almost spherical shape. Similarly,
as possible from one another [11]. In this paper, we consider one a cluster not strictly or less spherical when it has a shape
of the most widely used data clustering algorithm, i.e., k-means that tends towards the elliptical. All these clusters of dif-
[6, 8]. Due to its simplicity and understandability, this algorithm ferent sphericity will be called natural clusters. Formally,
has been popular for more than four decades [11][13]. Given a when a cluster has a strictly spherical data distribution
set of points 𝑋 = {𝑥1, ..., 𝑥𝑛 } where each point is defined in the then the variances of each data dimension are equal. If the
d-dimensional space R𝑑 and 𝑘 an integer, k-means aims to parti- cluster has a non-strict sphericity, then the dimensions
tion 𝑋 into a set of clusters 𝐶 = {𝐶1, ..., 𝐶𝑘 } by minimizing the have different variances. Statistically, these clusters can
euclidean distance between the points (data) within each cluster: be assimilated to gaussian distributions.
𝑘
Õ Õ In this paper, we propose a solution to these problems. We
||𝑥 − 𝐺𝑘 || 2 (1)
introduce an algorithm, based on the use of kd-tree[3], that esti-
𝑗=1 ∀𝑥 ∈𝐶𝑘
mates the number of 𝑘 clusters for massive data. This estimation
where 𝐺𝑘 the centroid of the cluster 𝐶𝑘 and ||.|| the euclidean performs in reasonable processing time, and thus it is possi-
distance. Minimizing this function requires an iterative process ble to integrate such solutions into data-intensive applications,
that ends when centroids do not change between two iterations. e.g., near-real-time applications, or the preprocessing stage of
This process consists of two steps: a data analysis project. Our solution is an algorithm, based on
• assigning each point to the nearest centroid using the kd-tree(k dimentional tree), which hierarchically structures data
euclidean distance; aggregates, e.g., clusters of data points and their corresponding
• recalculating the centroids, i.e., the average of the point metadata, with different levels of precision details. Furthermore,
values, of the newly formed clusters. we define several new clusters merge criteria to support more
Disadvantageously, k-means requires that the user should pre- efficiently overlapping and natural clusters.
define the value of 𝑘. Thus, determining the inaccurate value of We compare our solution to known methods of estimating 𝑘
k has a direct negative impact on the clustering quality. Different from literature (x-means [16] and g-means [7, 10]). We show that
our solution is faster than current methods while ensuring better
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of clustering quality on massive data.
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution
This paper is organized as follows. In Section 2, we study the
4.0 International (CC BY 4.0). state-of-the-art solutions. Next, in Section 3 we introduce our
Figure 1: Estimation of k by each of the three algorithms when clusters overlap. The respective estimates of x-means and g-means are
67 and 26. The centroids are represented by black crosses. The horizontal and vertical axes refer to both dimensions of the used dataset.

contribution. In Section 4, we compare our solution with that of • k-means is executed with k=2, two children’s centroids
our competitors on data sets containing up to 4 million points. are obtained
Finally we conclude in Section 5. • the bic of the same region is calculated but taking into
account the two children’s centroids instead of their parent
centroid.
2 BACKGROUND
For several decades, the problem of estimating the number of If the previous Bic is smaller than the next Bic, then the parent
clusters 𝑘 remains significant in the literature. Several studies centroid is replaced by its two children’s centroids or the parent
have been conducted to prevent the user from defining an inex- centroid is retained.
act value of 𝑘, thus leading to counterintuitive results, and it is G-MEANS. This algorithm [10] wraps k-means and adopts
disparate from the expected value of k. We identify two types of almost the same approach as x-means. Instead of the BIC criterion
clustering solutions close to k-means. Some integrate the estima- the gaussianity test (at the level of 𝛼 confidence defined by the
tion of the value of 𝑘. Others do not, as in the case of Birch [22]. user) on clusters is used to decide whether to divide a centroid
The latter is a hierarchical algorithm based on a micro-clustering in two.
approach. It consists of two main phases. The first tends to pro- It starts from a set of centroids of size 𝑘𝑚𝑖𝑛 and it increases
duce many small spherical clusters without having to specify the set size during the process. The value of 𝑘𝑚𝑖𝑛 is initialized to
the number to be produced. The second uses another clustering 1 because usually we have no a priori knowledge of the possible
algorithm to perform macro-clustering. We have not retained it values of k. In this case, the first centroid corresponds to the
in our study because in the second phase an algorithm should be artihmetic mean of the values of all data points. A list is defined
used to apply it on the representatives of small clusters (usually to contain the centroids whose sets of points surrounding them
centroids). The most used algorithms in the second phase and are gaussians so as not to be processed in subsequent iterations.
adapted for the first phase are often k-means or the hierarchical At each iteration, the centroids not stored in the list called
ascendant clustering [12] to which the value of 𝑘 should be given parent centroids, are divided into two children’s centroids c1 and
a priori. We focus on two algorithms specialized in the automatic c2. This division is operated via the principal component analysis
estimation of k: x-means [16] and g-means [7, 10]. (pca). The pca method is a time complexity of 𝑂 (𝑛𝑑 2 + 𝑑 3 ) [21]
X-MEANS. X-means [16] is a k-means-based algorithm that with 𝑑 the data point dimension and 𝑛 the size of the dataset. So
consists of searching for the set of centroids that best fit the data. the authors recommended using accelerated versions of pca such
Starting from a minimum 𝑘 (𝑘𝑚𝑖𝑛 ), generally equal to one, the as pca based on the power method [4]. The children’s centroids
algorithm iteratively adds new centroids if necessary until the are then refined through the execution of 2-means (k=2). Then
maximum value of k (𝑘𝑚𝑎𝑥 ) is reached. At each iteration, the the gaussianity test, via the Anderson-Darling test [19] (only
algorithm identifies the subset of centroids that must be divided works on points defined in one dimension), is performed on
in two. This identification is done using a statistical criterion points around the parent centroid. To do this, the points around
called bayesian information criterion (BIC). This is used for model the parent centroid are first projected on the vector 𝑣 = 𝑐1 − 𝑐2
selection and is based on the likelihood function. linking the two children centroids to obtain points defined on a
The x-means process consists of two operations that are iter- single dimension. If these points follow a gaussian distribution
ated until completion: Imporve_param and improve_structure. then the centroid is added to the list of centroids that will no
Imporve_param is only the execution of k-means with the cur- longer be processed in the following iterations. Otherwise, it is
rent k as parameter. Improve_structure researches where new replaced by its two children’s centroids. The value assigned to 𝛼
centroids must be added. For each centroid and its associated by the authors [10] is 0.0001 to test their solution on synthetic
region of points, three sub-operations follow each other: and real data.
LIMITATIONS. The limitations of the above methods are
• the corresponding bic is calculated related to the difficulty of dealing with overlapping clusters, the
Figure 2: Overall solution for estimating the value of k.

quality of clustering (and therefore also the value of k) they step is a regularization phase (Subsection 3.3). It decides whether
produce as well as the execution time to provide a result. the small cluster is a separate cluster or should be merged with
The both methods are of different computational complexity another cluster. This phase avoids having an overestimation of k
and each can identify only a limited number of more or less spher- due to small clusters. Note that the nodes are processed according
ical cluster types. X-means is suitable for data whose clusters to the post-fixed path, i.e., each node is processed after each of
are strictly spherical [10]. If other forms of clusters are present, its children is processed. This path avoids exploring a node un-
then the estimate of k could be overestimated. On the other hand, necessarily (time consuming task because it involves conditional
g-means could identify clusters whose spherical shape is less tests) in case it will not be processed because none of its children
strict than that identified by x-means. If the clusters are well are processed yet.
spaced apart, g-means could provide the relevant value of k [10].
In addition to this distribution, the strict sphericity requirement 3.1 KD-TREE
must be added to the clusters for x-means to have the same per-
Kd-tree puts points that are spatially close together in the same
formance. If the clusters overlap then none of them estimates the
node. This property is exploited by several machine learning
correct value of k. Under these conditions x-means and g-means
methods to accelerate calculations related to point space. One
tend to overestimate k.
of the best known cases is the k-closest neighbors algorithm [5].
These cases are illustrated by Figure 1. There are 5 gaussian
This property is advantageous to us in two cases. First of all,
clusters that have been generated but with a more or less different
it partly addresses the problem of k-means, which is to group
sphericity. Two clusters are well separated from each other. The
together the most similar points possible in the same cluster.
other three overlap. All three algorithms were run on this set of
Second, it provides an optimized spatial subdivision of space to
20,000 data. The 𝑘𝑚𝑎𝑥 has been set at 140. G-means identified 26
accelerate data processing. It recursively splits the whole dataset
clusters but detected the 2 clusters well separated from the others.
into partitions, in a manner similar to a decision tree acting on nu-
However, there is overfitting on the remaining three clusters, an
merical datasets [9]. The root node is the whole data space, while
overestimation of 24 clusters instead of only estimating 3. X-
the leaves are the smallest possible partitions of the data space.
means estimated k at 76. Overestimation is slightly higher for
A node is associated with several closed rectangular regions of
overlapping clusters than for separate clusters.
space, called hyper-rectangles. Traditionally, in the literature, the
In terms of execution time, the two algorithms are not suitable
hyper-rectangle is only used to represent the subspace that the
when the data are massive and it is even more accentuated when
node represents. In addition, we will also use it for other pur-
the estimated value of k tends to be large.
poses in 3.1.1 because of its advantages in terms of calculation
and representation of sets.
3 CONTRIBUTION
In this section, we will define our solution that addresses the 3.1.1 HYPER-RECTANGLE. Formally, it corresponds to the
problem of estimating k in a context of massive data and over- smallest possible rectangle (generalized to 𝑛 dimensions) covering
lapping clusters. Figure 2 illustrates the different steps of our all the data points of a set. It is defined by the following equation:
solution.
The solution is composed of two main parts; i) the storage and 𝐻 = {𝑋 |𝑋𝑚𝑖𝑛𝑠 ⩽ 𝑋𝑖 ⩽ 𝑋𝑚𝑎𝑥𝑒𝑠 ∀𝑖} (2)
organization of the 𝑋 data provided by the user in a data structure,
and ii) the processing is done on this structure to estimate 𝑘 where, 𝑚𝑖𝑛𝑠 and 𝑚𝑎𝑥𝑒𝑠 respectively represent the lower and
(Merging task and Cluster regularization). upper limits of the hyper-rectangle. Indeed, 𝑚𝑖𝑛𝑠(𝑚𝑎𝑥𝑒𝑠) is a
We opted for the kd-tree[3] data structure to fulfill the roles vector corresponding to the minimum (maximum) values of each
of storage and data organization. A kd-tree is a binary tree that dimension of the 𝑋 data set (see Figure 3).
represents a hierarchical subdivision of space using splitting In our case, we consider a hyper-rectangle as a geometric
planes that are orthogonal to the coordinate dimensions (do not object to approximate the overall spatial distribution of a set of
confuse the k of kd-tree which is just the dimension of the data points contained in the node. In other words, each of the clusters
stored in the tree and the k of k-means which corresponds to the of a node is geometrically represented by a hyper-rectangle. From
number of clusters). In Subsection 3.1 we discuss the advantages this representation results, in our solution, a time saving on the
of kd-tree as well as the method that builds the kd-tree in 3.1.2. calculations that involves sets (a set is a cluster of points). For
Concerning the processing part, first of all we proceed to the example, to calculate a distance between two sets or perform
estimation of k clusters in each of the leaves (see 3.1.4). This op- a set operation involving at least two sets, their corresponding
eration results, in each leaf, in the constitution of several clusters. hyper-rectangles could be used. Therefore, instead of visiting all
Then the clusters of the nodes are merged recursively from the the data points of the sets, it is only necessary to use the 𝑚𝑖𝑛𝑠 and
leaves to the root according to rules defined in Subsection 3.2.2. 𝑚𝑎𝑥𝑒𝑠 vectors of the hyper-rectangles for the above calculations.
These rules are built to manage overlapping clusters. The final This greatly saves a lot of time on large amounts of data.
associated with a list of indices of the data points that it rep-
resents, the arithmetic mean of these data and the sum of the
squared differences (𝑠𝑠𝑒) between each point and the arithmetic
mean. The data points are not stored in the tree but are accessed
via the indices. The internal nodes have in addition the splitting
dimension index 𝑠𝑑 as well as the associated value 𝑠𝑣.

3.1.4 INITIALIZATION OF LEAVES. During the tree building
Figure 3: A 2-dimensional hyper-rectangle 𝐻 and the associated and more precisely during the instantiation of the leaf, we es-
bounds 𝑚𝑎𝑥𝑒𝑠 and 𝑚𝑖𝑛𝑠 located on the corners (upper right and timate the number of clusters present in the subset of the data
lower left). Note that 𝑋 is the dataset with two dimensions (d0 represented by this leaf (see Algorithm 1). At the same time the
and d1) and the circles represent the points of X. The 𝑚𝑎𝑥(𝑚𝑖𝑛) clustering of these data is carried out, which results in a set of
function returns the maximum (minimum) of both dimensions. clusters. When clusters are close to each other or overlap, they
could overestimate the value of 𝑘 in an exaggerated way. For
this, the value of 𝑘 was limited to a maximum value (𝑙𝑒𝑎𝑓 _𝑘𝑚𝑎𝑥)
3.1.2 SPILITTING METHOD. The partitioning of a data space so that g-means (this version of g-means is called 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐾 in
of an internal node (i.e., not leaf) is performed mainly based Algorithm 1) cannot return more than a limited number of clus-
on two cutting elements that must be specified: a given data ters. This number is called 𝑘𝑡 with 𝑡 ∈ N∗ the 𝑖𝑡 ℎ iteration of
dimension (𝑠𝑑) of the node space and a value of this dimension 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐾. Indeed, at each iteration 𝑡 we test if 𝑘𝑡 >= 𝑙𝑒𝑎𝑓 _𝑘𝑚𝑎𝑥 .
(𝑠𝑣). Thus the points whose value at the dimension 𝑠𝑑 is less The value of 𝑘𝑡 could be up to 2𝑡 . This case only occurs if all
(resp. greater) than 𝑠𝑣 then they will be assigned to the child’s the centroids have been split in two because they have not been
node which is called 𝑙𝑒𝑠𝑠𝑐ℎ𝑖𝑙𝑑 (resp. 𝑔𝑟𝑒𝑎𝑡𝑒𝑟𝑐ℎ𝑖𝑙𝑑). This process evaluated as gaussian. Note that even if the total number of clus-
is carried out recursively from the root to the leaves. ters in all leaves exceeds 𝑘𝑚𝑎𝑥 then our merging algorithms will
Several rules for splitting nodes are proposed to choose the op- approach this number towards the real k of the dataset.
timal dimension and associated value for correct data separation. Let 𝑤 be the number of leaves that kd-tree has, 𝑙𝑒𝑎𝑓 _𝑘𝑚𝑎𝑥 is
Among them, we opted for sliding midpoint splitting rule [18] then defined as follows:
because it provides an optimized data organization than other (√ √
classical rules [14]. This performance on others is explained be- 𝑘𝑚𝑎𝑥 , if 𝑤 >= 𝑘𝑚𝑎𝑥
cause it does not produce empty nodes or nodes whose data 𝑙𝑒𝑎𝑓 _𝑘𝑚𝑎𝑥 = 𝑘𝑚𝑎𝑥 (3)
𝑤 , otherwise
space is very sparse (i.e., a very large space, specifically at the
sides, compared to the data it represents when it should be small). If we assume that in Equation 3 only the second condition ex-
In addition, the classical rules choose the median as the cutting ists and 𝑙𝑒𝑎𝑓 _𝑘𝑚𝑎𝑥 will always be equal to the result of this same
value 𝑠𝑣 while the sliding midpoint splitting rule chooses the mid- condition then this would induce 𝑙𝑒𝑎𝑓 _𝑘𝑚𝑎𝑥 to tend towards a
dle of the points ((𝑚𝑎𝑥 + 𝑚𝑖𝑛)/2) which is less expensive. The value of 1 or 0. Indeed for a given value of 𝑘𝑚𝑎𝑥 , the greater 𝑤 is
dimension 𝑠𝑑 chosen is the one that is the longest (𝑚𝑎𝑥 − 𝑚𝑖𝑛). the greater the 𝑙𝑒𝑎𝑓 _𝑘𝑚𝑎𝑥 tends towards 1 if 𝑤 <= 𝑘𝑚𝑎𝑥 . When
Note that theoretically there is no guarantee on the limit of 𝑤 > 𝑘𝑚𝑎𝑥 then 𝑙𝑒𝑎𝑓 _𝑘𝑚𝑎𝑥 tends towards 0. These cases could
the depth that kd-tree can have, the trees could be very deep (so make the solution ineffective because they would underestimate
the time of tree construction is increased). As a result, we have the number of clusters in the leaf and by extension underestimate
added two stopping conditions to avoid deep trees. These two the number of clusters in the tree dataset. The first condition
conditions are performed at the beginning of the method that is essential to maintain a balance between having an underesti-
partitions the node in two. If one of the conditions is verified mation of 𝑘 if there is only the second condition and having an
then the node will not be divided and will be considered as a leaf: overestimation of k if there is no condition that would limit the
• the depth of the leaf is limited to 𝑙𝑜𝑔(𝑛) to save construc- value of k.
tion time compared to the normal time taken if the depth After the estimation of k, each cluster of data points is repre-
is not limited. The root has a depth of 1; sented by a hyper-rectangle as well as the associated aggregated
• each node is limited by a minimum number of points, ψ. calculations (arithmetic mean (G) and the difference between the
It is not interesting to have leaves that have a number points and G (𝑠𝑠𝑒)).
of points close to one in a context of massive data. This
would result in a very long kd-tree construction time and
therefore make our solution unsuitable for near real-time Algorithm 1: InitializeLeaf
applications. In addition, if the sum of the number of points input :𝑘𝑚𝑖𝑛 ,𝑘𝑚𝑎𝑥 ,𝑑𝑎𝑡𝑎,𝑙𝑒𝑎𝑓
of the node is less than 2 ∗ ψ then it will be considered output : 𝐿𝐻
as a leaf. Indeed, two cases are possible if this limit is 1 𝐿𝐷 ← 𝑑𝑎𝑡𝑎𝑙𝑒𝑎𝑓 .𝑖𝑛𝑑𝑒𝑥𝑒𝑠
not defined. Either the two children will be leaves if their
2 𝐺, 𝑘, 𝐼 ← 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐾 (𝐿𝐷, 𝑘𝑚𝑖𝑛 , 𝑘𝑚𝑎𝑥 )
number of points (size) is less than ψ. Either the size of
one will be greater than ψ and therefore the other is a leaf. 3 for 𝑗 = 1 to 𝑗 = 𝑘 do
In the latter case a difference in depth will occur which 4 𝐿𝐻 𝑗 .𝑖𝑛𝑑𝑒𝑥𝑒𝑠 ← 𝐼 𝑗
brings a certain imbalance of the tree. This condition limits 5 𝐿𝐻 𝑗 .𝐺 ← 𝐺 𝑗
the number of small leaves, limits the depth of the tree 6 𝐿𝐻 𝑗 .𝑚𝑎𝑥𝑒𝑠 ← 𝑚𝑎𝑥 (𝐿𝐷 𝐼 𝑗 )
and contributes to the balance of the tree. 7 𝐿𝐻 𝑗 .𝑚𝑖𝑛𝑠 ← 𝑚𝑖𝑛(𝐿𝐷 𝐼 𝑗 )
3.1.3 NODE COMPONENTS. Each node contains a set of
hyper-rectangles. We call this set 𝐿𝐻 . Each hyper-rectangle is
3.2 NODE MERGING The minimum distance between two hyper-rectangles x and y
At the end of node initialization, clusters are obtained in each is calculated as follows:
node that are considered as subclusters of the final clusters. These 𝑚𝑖𝑛𝐻 (𝑥, 𝑦) = ||0−𝑚𝑎𝑥 (0, 𝑚𝑎𝑥 (𝑥 .𝑚𝑖𝑛𝑠−𝑦.𝑚𝑎𝑥𝑒𝑠, 𝑦.𝑚𝑖𝑛𝑠−𝑥 .𝑚𝑎𝑥𝑒𝑠))||
latter are located in the root and are considered to be the closest to (8)
real clusters. The merging of clusters is necessary to achieve the where the function 𝑚𝑎𝑥 () returns the maximum of the two num-
final clusters. In this subsection Algorithms 3 and 2 for merging bers given as parameters.
clusters will be presented. But first, we will list the functions and
definitions that will be used in these algorithms. Algorithm 2: Local merger
3.2.1 DEFINITIONS. input : 𝐿𝐻 , τ, 𝑙𝑒𝑛𝑔ℎ𝑡, 𝑡ℎ_𝑒𝑣𝑜
output : 𝐿𝐻
Definition 3.1. Let 𝐿𝐻 be a set of hyper-rectangles, the 𝑚𝑎𝑥𝑒𝑠
(resp. 𝑚𝑖𝑛𝑠) vector associated with 𝐿𝐻 is the maximum (resp. 1 𝑚𝑒𝑟𝑔𝑒𝑑 ← 𝑇𝑟𝑢𝑒

minimum) of each dimension of the 𝑚𝑎𝑥𝑒𝑠 (resp. 𝑚𝑖𝑛𝑠) vectors 2 𝑖𝑡 ← 0
of the hyper-rectangles of 𝐿𝐻 : 3 𝑜𝑙𝑑_𝑐𝑎𝑟𝑑_𝑙ℎ ← 𝑐𝑎𝑟𝑑 (𝐿𝐻 )

( 4 while 𝑚𝑒𝑟𝑔𝑒𝑑 = 𝑇𝑟𝑢𝑒 do
𝑚𝑖𝑛𝑠 = 𝑚𝑖𝑛({𝐿𝐻ℎ .𝑚𝑖𝑛𝑠 |ℎ ∈ [1; 𝑐𝑎𝑟𝑑 (𝐿𝐻 )]}) 5 𝑥 ← 𝐿𝐻𝑖𝑡
𝑔𝑒𝑡𝐵𝑜𝑢𝑛𝑑𝑠 (𝐿𝐻 ) = 6 𝑇 ← {𝑖𝑡 }
𝑚𝑎𝑥𝑒𝑠 = 𝑚𝑎𝑥 ({𝐿𝐻ℎ .𝑚𝑎𝑥𝑒𝑠 |ℎ ∈ [1; 𝑐𝑎𝑟𝑑 (𝐿𝐻 )])
(4) 7 𝑡𝑚𝑝_𝑖𝑛𝑑𝑒𝑥𝑒𝑠 ← {0, ..., 𝑜𝑙𝑑_𝑐𝑎𝑟𝑑_𝑙ℎ}
8 𝑡𝑚𝑝_𝑖𝑛𝑑𝑒𝑥𝑒𝑠 ← 𝑡𝑚𝑝_𝑖𝑛𝑑𝑒𝑥𝑒𝑠\{𝑖𝑡 }
Definition 3.2. Let 𝑃 = {(𝑢, 𝑣)|𝑢, 𝑣 ∈ N ∩ [1; 𝑘] and 𝑢 ≠ 𝑣 }
9 for 𝑖 𝑖𝑛 𝑡𝑚𝑝_𝑖𝑛𝑑𝑒𝑥𝑒𝑠 do
a list of index pairs of hyper-rectangles of a node. We consider
10 𝑦 ← 𝐿𝐻𝑖
(𝑢, 𝑣) = (𝑣, 𝑢) and consequently P contains only one of them. The
11 Δ𝐺 ← ||𝑥 .𝐺 − 𝑦.𝐺 ||
average of the pairwise distances is calculated by the following
function: 12 𝑟𝑎𝑡𝑖𝑜 ← Δ𝐺/𝑙𝑒𝑛𝑔ℎ𝑡
Í 13 if 𝑟𝑎𝑡𝑖𝑜 < τ /𝜖 then
(𝑢,𝑣) ∈𝑃 ||𝐿𝐻𝑢 .𝐺 − 𝐿𝐻 𝑣 .𝐺 ||
𝑎𝑝𝑑 (𝐿𝐻 ) = (5) 14 Δ𝐶 ← ||((𝑥 .𝑚𝑖𝑛𝑠 + 𝑥 .𝑚𝑎𝑥𝑒𝑠)/2) −
𝑐𝑎𝑟𝑑 (𝑃)
((𝑦.𝑚𝑖𝑛𝑠 + 𝑦.𝑚𝑎𝑥𝑒𝑠)/2)||
Lemma 3.3. Given two datasets 𝑈 ∈ R𝑑 and 𝑃 ∈ R𝑑 . 𝑛 1 and 𝑛 2 15 𝑒 ← 𝐹𝑎𝑙𝑠𝑒
are respectively the number of points contained in 𝑈 and 𝑃. If the 16 if Δ𝐺 < Δ𝐶 then
sum of squared errors of 𝑈 and 𝑃 are as follow: 17 𝑚𝑖𝑛𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑋𝑌 ← 𝑚𝑖𝑛𝐻 (𝑥, 𝑦)
𝑛1
Õ 18 if 𝑚𝑖𝑛𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑋𝑌 = 0 and
𝑆𝑆𝐸𝑈 = ||𝑢𝑖 − 𝑢¯ || 2 Δ𝐺 < Δ𝐶 ∗ 𝜆 then
𝑖=1
19 𝑒 ← 𝑇𝑟𝑢𝑒
𝑛2
Õ 20 if 𝑒 = 𝑇𝑟𝑢𝑒 and 𝑒𝑣𝑜 <= 𝑡ℎ_𝑒𝑣𝑜 and
𝑆𝑆𝐸𝑃 = ||𝑝𝑖 − 𝑝¯ || 2
𝑖=1
𝑚𝑖𝑛𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑋𝑌 /𝑙𝑒𝑛𝑔ℎ𝑡 <= τ /𝜖 then
then the the sum of squared errors of 𝑉 = 𝑈 ∪ 𝑃 is: 21 𝑒 ← 𝑇𝑟𝑢𝑒
22 ‘if 𝑒 = 𝑇𝑟𝑢𝑒 then
𝑆𝑆𝐸𝑉 = 𝑆𝑆𝐸𝑈 + 𝑆𝑆𝐸𝑃 + 𝑛 1 ||𝑈¯ − 𝑉¯ || 2 + 𝑛 2 ∗ ||𝑃¯ − 𝑉¯ || 2
(6)
23 𝑇 ←𝑇 ∪𝑖
where 𝑉 can computed as the average of the weighted averages
¯ 24 if 𝑐𝑎𝑟𝑑 (𝑇 ) > 1 then
𝑛 1𝑈¯ + 𝑛 2 𝑃¯ 25 𝐿𝐻 ← 𝑚𝑒𝑟𝑔𝑒 (𝑇 )
(7)
𝑛1 + 𝑛2 26 //The new hyper-rectangles are placed at
the end of the list
Proof. We are trying to calculate the sum of two sse: 27 𝑛𝑒𝑤_𝑐𝑎𝑟𝑑_𝑙ℎ ← 𝑐𝑎𝑟𝑑 (𝐿𝐻 )
𝑛1 𝑛2
Õ Õ 28 if 𝑜𝑙𝑑_𝑐𝑎𝑟𝑑_𝑙ℎ < 𝑛𝑒𝑤_𝑐𝑎𝑟𝑑_𝑙ℎ then
𝑆𝑆𝐸𝑉 = ||𝑈𝑖 − 𝑉¯ || 2 + ||𝑃𝑖 − 𝑉¯ || 2 29 𝑜𝑙𝑑_𝑐𝑎𝑟𝑑_𝑙ℎ ← 𝑛𝑒𝑤_𝑐𝑎𝑟𝑑_𝑙ℎ
𝑖=1 𝑖=1
| {z } | {z } 30 𝑖𝑡 ← 0
𝑆𝑆𝐸𝑈 𝑉 𝑆𝑆𝐸𝑃𝑉 31 else
Let focus on the first term of this equation: 32 if 𝑖𝑡 + 1 < 𝑛𝑒𝑤_𝑐𝑎𝑟𝑑_𝑙ℎ then
Õ𝑛1 𝑛1
Õ 33 𝑖𝑡 ← 𝑖𝑡 + 1
𝑆𝑆𝐸𝑈 𝑉 = ||𝑈𝑖 − 𝑉¯ || 2 = ||(𝑈𝑖 − 𝑈¯ ) + (𝑈¯ − 𝑉¯ )|| 2 34 else
𝑖=1 𝑖=1 35 𝑚𝑒𝑟𝑔𝑒𝑑 ← 𝐹𝑎𝑙𝑠𝑒
𝑛1
Õ 𝑛1
Õ
= ||𝑈𝑖 − 𝑈¯ || 2 + 2 (𝑈𝑖 − 𝑈¯ ) (𝑈¯ − 𝑉¯ ) + 𝑛 1 ||𝑈¯ − 𝑉¯ || 2
𝑖=1 𝑖=1 3.2.2 PROPOSED MERGE ALGORITHMS. It cannot be objec-
But: tively confirmed that two clusters naturally correspond to the
𝑛1
Õ 𝑛1
Õ 𝑛1
Õ same cluster (i.e., they must form a single cluster) because this
(𝑈𝑖 − 𝑈¯ ) = 𝑈𝑖 − 𝑈¯ = 𝑛 1𝑈¯ − 𝑛 1𝑈¯ = 0 confirmation implies the use of clustering related concepts such
𝑖=1 𝑖=1 𝑖=1 as close, similar or complementary whose definition is also sub-
So 𝑆𝑆𝐸𝑈 𝑉 = 𝑆𝑆𝐸𝑈 + 𝑛 1 ||𝑈¯ − 𝑉¯ || 2 . If we repeat the same cal- jective. Consequently, we focus on the detection of clusters that
culation for the second term 𝑆𝑆𝐸𝑃𝑉 then we obtain Equation k-means could detect (i.e., clusters close to strict sphericity) but
6. also less spherical clusters as present in real data. So our goal is,
first, to capture natural clusters as defined in Section 1. Secondly,
τ
to identify these clusters even if they overlap with each other. 𝑟𝑎𝑡𝑖𝑜 < (9)
𝜖
As a result, a natural cluster is a cluster with a certain sphericity 𝑎𝑝𝑑 (𝐿𝐻 ) | |𝑥 .𝐺−𝑦.𝐺 | |
and at the same time it has low connectivity with another cluster where τ = | |𝑚𝑎𝑥𝑒𝑠−𝑚𝑖𝑛𝑠 | | , 𝑟𝑎𝑡𝑖𝑜 = | |𝑚𝑎𝑥𝑒𝑠−𝑚𝑖𝑛𝑠 | | and 𝜖 ∈ N∗ .
if it overlaps with it. By connectivity between two clusters we Criterion 1 is based on the proximity measure. It checks whether
mean of how close two clusters are in form. If two clusters of dif- two hyper-rectangles are close enough to be possibly considered
ferent shapes overlap considerably, they could be considered as mergeable. It is based on the average of the pairwise distances
two separate clusters. The distance between clusters alone is not between centroids (𝑎𝑝𝑑) to approximate the average distance
enough. Note that considering only the distance between clusters between clusters of a node without calculating the distances
to conclude that two clusters form a single cluster could lead between all point pairs of all clusters. The pairwise distances
to the following situation: if two clusters that do not naturally reduces the bias brought by the very distant centroids from the
correspond to the same cluster (i.e., they have low connectivity) majority of other centroids. The values of 𝑟𝑎𝑡𝑖𝑜 and τ of the
and whose distance is zero or partially overlapping, could be inequality are normalized between 0 and 1 by the maximum
considered as a single cluster. This could perhaps contradict the length of the data space to be compared. The value of 𝜖 controls
reality of the meaning given to the data points. By taking into the boundary between the qualifiers "distant" and "close" when
account the distance and connectivity between two clusters, we referring to the distance between clusters of 𝑥 and 𝑦. If this crite-
consider that if two clusters are of the same shape and overlap rion is verified then other criteria must be verified to confirm the
only partially then they are two different clusters that could not fusion between 𝑥 and 𝑦. Note that the higher the value of 𝜖 is, the
be merged. So it is necessary to take into account the connectivity fewer hyper-rectangles validate the criterion. Moreover, if 𝜖 = 1,
between clusters in addition to distance when deciding to merge then a significant amount of hyper-rectangles will validate the
two clusters. criterion, which could lead to further tests and probably lead to
We define a set of criteria to evaluate how much two clus- a non fusion result because if two clusters are very distant then
ters correspond to the same cluster in order to decide whether there will be no fusion. The value of 𝜖 is to be adjusted according
two clusters can be merged in accordance with the definition of to the strictness level of the "enough close" notion required by
natural clusters that we have mentioned. These criteria involve the criterion including this parameter.
measures of proximity and connectivity between clusters. The
fusion algorithms 3 and 2 incorporate these criteria. criterion 2. Let 𝑥 and 𝑦 two hyper-rectangles, Δ𝐶 the distance
Note that merging several clusters signifies merging the asso- between the centers of x and y and Δ𝐺 the distance between the
ciated hyper-rectangles. It consists of calculating the weighted centroids of the x and y clusters. If Δ𝐺 < Δ𝐶 then x and y are
mean of the means of the point values contained in the hyper- possibly mergeable.
rectangles already calculated, concatenating the lists of the point Criterion 2 refines the set of hyper-rectangles from the first
indices, calculating the 𝑠𝑠𝑒 based on Equation 6 and to calculate criterion. It estimates how strong the connectivity between two
the 𝑚𝑖𝑛𝑠 and 𝑚𝑎𝑥𝑒𝑠 vectors according to Equation 4. clusters is. Knowing that a hyper-rectangle corresponds to the
Algorithm 2 (⁀local merger) allows us to merge clusters whose smallest rectangular envelope covering all the data points of
representative hyper-rectangles are included in the 𝐿𝐻 set. 𝐿𝐻 a cluster, then it could be ruled that the center of the hyper-
hyper-rectangles are associated with a list of ascending ordered rectangle would roughly correspond to the centroid of the cluster
numerical indices. Thus the indices of the first and last elements if the points of the cluster are uniformly distributed with a certain
of LH are respectively 0 and 𝑐𝑎𝑟𝑑 (𝐿𝐻 ) − 1. We consider 𝑥 to be sphericity. So the cluster with these characteristics is probably a
a temporary variable that runs through the 𝐿𝐻 elements. The al- natural cluster apart. As a result the closer the distance between
gorithm process is as follows. First we assign the first element of centroids (Δ𝐺) is to the distance between centers (Δ𝐶) the smaller
𝐿𝐻 to 𝑥. This one is now a hyper-rectangle representing a cluster. the probability of merging 𝑥 and 𝑦. This probability is considered
We then check if all the other clusters represented by hyper- null when Δ𝐺 >= Δ𝐶.
rectangles 𝑦 are mergeable with the cluster represented by 𝑥. If The first two criteria make it possible to identify relatively
𝑥 is not mergeable to any of them then the next hyper-rectangle close clusters but also to consider that two clusters are distant if
in 𝐿𝐻 is assigned to 𝑥. On the other hand, if it is mergeable to at their centroids are also distant even if the two clusters are con-
least one hyper-rectangle, then the concerned hyper-rectangles tiguous. The latter could occur if the majority of points surround
are merged resulting in a new hyper-rectangle (merging operated the centroid while a minority are far from the centroid. However,
by the 𝑚𝑒𝑟𝑔𝑒 function). In this case the first item of freshly recon- these criteria are based only on the distances applied to centroids
stituted 𝐿𝐻 is reassigned to 𝑥. This procedure is repeated until 𝑥 and centers. Two clusters could be considered as close, but one
is the second last element of 𝐿𝐻 and there are no more mergers. or more clusters could be located between them. In this case
During the fusion test between two hyper-rectangles 𝑥 and 𝑦, they cannot be merged directly. They could only be if at least
several criteria, formed from inequalities and equations, must be one cluster between them merges with them. The following two
verified. The first two criteria only select the hyper-rectangles criteria require stricter distances and stronger connectivity to
candidates 𝑦 for merger and at the same time they reduce the avoid directly merging two nearby clusters separated by other
number of hyper-rectangles to be processed in the other two clusters.
criteria. While the last two decide whether to merge 𝑥 and 𝑦 or
criterion 3. Let be 𝑥 and 𝑦 two hyper-rectangles. We merge
not.
𝑥 and 𝑦 when Criteria 1 and 2 are verified and if 𝑚𝑖𝑛𝐻 (𝑥, 𝑦) = 0
criterion 1. Let 𝑚𝑖𝑛𝑠 and 𝑚𝑎𝑥𝑒𝑠 the limits of a defined data and Δ𝐺 < Δ𝐶 ∗ 𝜆 with 𝜆 ∈]0; 1]
space ∈ R𝑑 and 𝐿𝐻 the list of hyper-retangles belonging to this Criterion 3 only deals with contiguous or overlapping hyper-
space then two clusters represented by the hyper-rectangles 𝑥 ∈ 𝐿𝐻 rectangles. Indeed 𝑚𝑖𝑛𝐻 (𝑥, 𝑦) returns 0 if the two hyper-rectangles
and 𝑦 ∈ 𝐿𝐻 are possibly mergeable if: overlap or if their distance is zero. Also, a connectivity constraint
is added. It results in a constraint on the proximity between cen- children are closer to each other than to the other child’s clusters.
troids: for two hyper-rectangles 𝑥 and 𝑦 the distance between So we process the clusters of a child node locally but in the data
centroids must be less than Δ𝐶 ∗𝜆. The smaller 𝜆, the more the no- space of the parent node. Hence the proposal of the global merger
tion of "close enough" between clusters to the point of matching algorithm. It performs the merging of clusters node by node
the same cluster is strict. using the local merger algorithm. It starts first with the children
of the "internal" node and then concatenates the hyper-rectangles
criterion 4. Let 𝑥 and 𝑦 two hyper-rectangles and 𝜖 ∈ N∗ . We
of the both children to form the list of hyper-rectangles of the
merge 𝑥 and 𝑦 when Criteria 1 and 2 are verified and if 𝑒𝑣𝑜 <=
internal node. Finally, local merger is applied to internal. Before
𝑡ℎ_𝑒𝑣𝑜 and 𝑚𝑖𝑛𝐻 (𝑥, 𝑦) normalized is less than τ/𝜖 with 𝑒𝑣𝑜 the
the children node is processed, the 𝑎𝑝𝑑 function is normalized
sum of the squared errors (𝑠𝑠𝑒) of the 𝑥 and 𝑦 data points union.
by the maximum length of the parent node hyper-rectangle. This
Unlike Criterion 3, to validate Criterion 4 hyper-rectangles normalization allows to have the same distance ratio in the three
are not necessary to be contiguous but a maximum distance is nodes (𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙,𝑔𝑟𝑒𝑎𝑡𝑒𝑟 and 𝑙𝑒𝑠𝑠). On the other hand, the value
required. The normalized distance 𝑚𝑖𝑛𝐻 (𝑥, 𝑦) must be less than of 𝑎𝑝𝑑 changes from one node to another because it depends
τ/𝜖. It may be that two clusters are not contiguous but almost only on the clusters of the concerned node. Consequently, if
when the distance is as small as it does not allow a cluster to two clusters have not been merged into one of the two children
interpose itself between two clusters concerned by the calcu- nodes, they could be merged into the parent node. Note that the
lations of the criterion. In addition to this, the criterion adds a processing of the children nodes can even be parallelized because
requirement on the compactness of 𝑥 and 𝑦 clusters. To do this, they are independent.
it uses the homogeneity measure called sum of square error (𝑠𝑠𝑒).
In this criterion we have the choice to limit the quantity of 𝑠𝑠𝑒 for
3.3 CLUSTER REGULARIZATION
possible mergers. Indeed, two clusters are merged only if the 𝑠𝑠𝑒
of their data is less than the limit is 𝑡ℎ_𝑒𝑣𝑜. Its value is entered by This subsection discusses the strategy we have defined to identify
the user. However, 𝑡ℎ_𝑒𝑣𝑜 is adapted according to the value of Δ𝑘 clusters that should not be as such but rather be part of another
provided Equation 10. The latter evaluates the difference between cluster. The algorithms of the previous step could produce small
the value of 𝑘𝑚𝑎𝑥 and the total sum of the values of k specific to (in number of points) but compact clusters. A small cluster is
Í
each leaf ( 𝑘 ϑ ). The purpose of this adaptation is to approach located in three cases: either it is a natural cluster different from
the estimated value k on the entire dataset at 𝑘𝑚𝑎𝑥 and avoid an the others, or it is part of an agglomeration of small clusters
overestimation of 𝑘 compared to 𝑘𝑚𝑎𝑥 . If 𝛿𝑘 >= 1 it means 𝑘 ϑ
Í that represent a single cluster, or it is part of a large cluster. The
is at least equal to 𝑘𝑚𝑎𝑥 and therefore 𝑡ℎ_𝑒𝑣𝑜 will be unchanged. definition of a small cluster has no objective purpose. We consider
Í
Otherwise, if 𝑘𝑚𝑎𝑥 > 𝑘 ϑ then 𝑡ℎ_𝑒𝑣𝑜 is recalculated according the definition of the small cluster resulting from the work [1] as
√
Equation 11. a cluster with a number of points less than 𝑛 with 𝑛 the size of
𝑘𝑚𝑎𝑥 the cluster.
𝛿𝑘 = 1 − Í (10) We have developed an algorithm that makes it possible to
𝑘ϑ
( match a small cluster to one of the three cases. The goal of the
𝑡ℎ_𝑒𝑣𝑜, if 𝛿𝑘 >= 1 algorithm is to get as many natural clusters as possible. Knowing
𝑡ℎ_𝑒𝑣𝑜 = (11) that the final result of our solution is in the root of the tree
𝑡ℎ_𝑒𝑣𝑜 + 𝛿𝑘, otherwise
then this algorithm is unrolled only in this one. The algorithm
consists of two parts that are executed consecutively once. The
Algorithm 3: Global merger first identifies the small clusters and then merges those that are
input :𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙,𝑡ℎ_𝑒𝑣𝑜 close to each other. The second re-identifies small clusters from
output : 𝐿𝐻 the new list of small clusters from the first part and affects those
Ð that are close to large clusters. If a small cluster has not undergone
1 𝐿𝐻 ← 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑙𝑒𝑠𝑠.𝐿𝐻 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑔𝑟𝑒𝑎𝑡𝑒𝑟 .𝐿𝐻 a merger operation in both parts then it will be considered as a
2 𝑚𝑖𝑛𝑠, 𝑚𝑎𝑥𝑒𝑠 ← 𝑔𝑒𝑡𝐵𝑜𝑢𝑛𝑑𝑠 (𝐿𝐻 ) separate natural cluster.
3 𝑙𝑒𝑛𝑔ℎ𝑡 ← ||𝑚𝑎𝑥𝑒𝑠 − 𝑚𝑖𝑛𝑠 || One of the reasons for the presence of small clusters is due to
4 if card(internal.greater.LH)>1 then Criteria 3 and 4 where the limit of the minimum distance between
5 τ ← 𝑎𝑝𝑑 (𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑔𝑟𝑒𝑎𝑡𝑒𝑟 .𝐿𝐻 )/𝑙𝑒𝑛𝑔ℎ𝑡 hyper-rectangles is very strict. Indeed, if the minimum distance
6 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑔𝑟𝑒𝑎𝑡𝑒𝑟 .𝐿𝐻 ← between two hyper-rectangles, at least one of which is a small
𝑙𝑜𝑐𝑎𝑙𝑀𝑒𝑟𝑔𝑒𝑟 (𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑔𝑟𝑒𝑎𝑡𝑒𝑟 .𝐿𝐻, τ , 𝑙𝑒𝑛𝑔ℎ𝑡, 𝑡ℎ_𝑒𝑣𝑜) cluster, is greater than this limit, then they cannot be merged.
This limit is respectively equal to 0 in Criterion 3 and τ /𝜖 in
7 if card(internal.less.LH)>1 then Criterion 4. The τ /𝜖 limit is more flexible than the first one, it
8 τ ← 𝑎𝑝𝑑 (𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑙𝑒𝑠𝑠.𝐿𝐻 )/𝑙𝑒𝑛𝑔ℎ𝑡 was used in both parts of the algorithm to define the boundary
9 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑙𝑒𝑠𝑠.𝐿𝐻 ← between "close" and "distant" regarding the distance between
𝑙𝑜𝑐𝑎𝑙𝑀𝑒𝑟𝑔𝑒𝑟 (𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑙𝑒𝑠𝑠.𝐿𝐻, τ , 𝑙𝑒𝑛𝑔ℎ𝑡, 𝑡ℎ_𝑒𝑣𝑜) hyper-rectangles.
Ð
10 𝐿𝐻 ← 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑙𝑒𝑠𝑠.𝐿𝐻 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 .𝑔𝑟𝑒𝑎𝑡𝑒𝑟 .𝐿𝐻
11 τ ← 𝑎𝑝𝑑 (𝐿𝐻 )/𝑙𝑒𝑛𝑔ℎ𝑡 4 EXPERIMENTAL ASSESSMENTS
12 𝐿𝐻 ← 𝑙𝑜𝑐𝑎𝑙𝑀𝑒𝑟𝑔𝑒𝑟 (𝐿𝐻, τ , 𝑙𝑒𝑛𝑔ℎ𝑡, 𝑡ℎ_𝑒𝑣𝑜)
In this section, we carried out experiments to study the perfor-
mance of our solution, to demonstrate its effectiveness compared
The global merger fusion algorithm can be seen as a layer to the algorithms listed in Section 2(x-means and g-means) and
that envelops the algorithm local merger. Intuitively, for a given to give recommendations on the values that the parameters of
internal and parent node, a good part of the clusters of one of its our solution could take (the minimum size of an internal node
ψ and limit 𝑡ℎ_𝑒𝑣𝑜 of Equation 11). In the experiments, real data • clusters do not have the same number of points. The car-
and synthetic data were used. dinalities are chosen at random so that their sum is equal
to the total number of points 𝑛 defined previously.
4.1 EXPERIMENTAL PROTOCOL 4.1.2 REAL DATA. The real data comes from three known
Our solution and the compared algorithms were implemented sources: Openml, UCI and Kaggle. They are all of a numerical
in python 3.6. All of them use the Elkan version of k-means [6]. type. These data are used in other research projects to perform
It corresponds to an exact version of standard k-means but is benchmarking machine learning methods [20]. The values of real
suitable for massive data in execution time. Moreover, k-means++ 𝑘, 𝑛 and 𝑑 are given in Table 2. The data sets used are diverse and
[2] is used for initializing the centroids before applying Elkan they can be organized into several groups:
algorithm. • black and white or grayscale images; clustering is ap-
Different values are assigned to parameters 𝑘,𝑑, 𝑛, 𝑡ℎ_𝑒𝑣𝑜 and plied in this case directly to the pixels. 𝐸𝑚𝑛𝑖𝑠𝑡 and 𝐹𝑎𝑠ℎ𝑖𝑜𝑛−
ψ to study the sensitivity of our algorithm to these parameters. 𝑚𝑛𝑖𝑠𝑡 contain images respectfully on the first 10 digits and
All combinations of the values of these parameters have been clothing and shoes. Each image was flattened to form a
tested: single vector. And each vector has been assigned its corre-
• 𝑘 ∈ [10, 32, 80] sponding class (label);
• 𝑑 ∈ [4, 6, 8] • 4-band multispectral images; they characterize differ-
• 𝑛 ∈ [1 × 106, 2 × 106, 4 × 106 ] ent types of soil. The corresponding dataset is 𝑠𝑎𝑡𝑖𝑚𝑎𝑔𝑒;
• 𝑡ℎ_𝑒𝑣𝑜 ∈ [0.05, 0.10, 0.15, 0.20, 0.25, 0.30] • sounds; the 𝐽𝑎𝑝𝑎𝑛𝑒𝑠𝑒𝑉 𝑜𝑤𝑒𝑙𝑠 dataset is a set of digitized
• ψ ∈ [4, 8, 10, 12] Japanese vowel sounds. Each sound is associated with a
speaker;
An experiment is structured as follows:
• historization of people’s activities; the 𝑙𝑑𝑝𝑎 dataset is
• first of all a combination of the above-mentioned parame- a collection of the positions of sensors present on people
ters is defined. Parameter 𝑘 represents the real number of in order to identify the movement they perform at a given
clusters in a dataset. It should be noted that parameters 𝑘, time. The 𝑤𝑎𝑙𝑘𝑖𝑛𝑔 dataset is a set of people’s activities
𝑑 and 𝑛 above refer only to synthetic data. The values of designed to determine the authors;
these parameters are defined in the characteristics of the • images represented by characteristics extracted from
real datasets, the object to be identified; for example, in the 𝐹𝑜𝑢𝑟𝑖𝑒𝑟
• then follows the data generation in the case of synthetic dataset each record is a set of coefficients of the charac-
data, otherwise the real data is loaded, ter (one of the first 10 digits) shapes; similarly in 𝑍𝑒𝑟𝑛𝑖𝑘𝑒
• finally, the estimation of 𝑘 is performed by the three algo- each instance of a digit is of rotation invariant Zernike
rithms. moments of the digit (between 0 and 9); in 𝑃𝑒𝑛𝑑𝑖𝑔𝑖𝑡𝑠 each
For each combination of these parameters the experiment is individual is a positions sequence characterizing a digit;
conducted five times. in 𝐿𝑒𝑡𝑡𝑒𝑟 each instance of a letter of the English alpha-
We set the values of 𝜖 and 𝜆 in the different criteria as follows: bet contains statistical moments and edge counts of the
given letter; in 𝑉 𝑒ℎ𝑖𝑐𝑙𝑒 dataset a data point is just a set of
• 𝜖 has been defined as 2 and 10 respectively in Criteria 1 geometric characteristics of a given vehicle.
and 4. The value 2 is the least strict while 10 is the most
strict; 4.1.3 METRICS FOR EVALUATING EXPERIMENTAL RESULTS.
• in the cluster regularization algorithm, 𝜖 = 8 in the first To evaluate the results of the algorithms we are comparing, three
part of the algorithm. If there are still small clusters left, metrics were used. First, the execution time (Δ𝑡) of the algorithm,
then the second part is executed. In this case 𝜖 = 4; the relative difference (Δ𝑘) between the actual value of k and the
• in Criterion 3, 𝜆 = 0.75. As a result, Δ𝐺 must be less than estimated value of k and finally the distance between the actual
75% of Δ𝐶. partitioning and the partitioning proposed by the algorithm.
The relative difference is calculated as follows:
We allow our algorithm and x-means to search up to 𝑘𝑚𝑎𝑥 =
|𝑘 − 𝑘𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 |
5𝑘 centroids. Δ𝑘 = 𝑟𝑒𝑎𝑙 (12)
𝑘𝑟𝑒𝑎𝑙
4.1.1 SYNTHETIC DATA. In order to get closer to the real The variation of the information (𝑣𝑖) [15] was used as a dis-
data, natural and overlapping clusters were generated. These tance between two partitionings. It measures the amount of infor-
synthetic data have the following properties: mation gained and lost for a dataset passing from a partition A to
• a cluster is a set of data that follows a normal distribution. another partition B. It respects the three properties of triangular
To generate this set, a semi positive covariance matrix and inequality. So the smaller the 𝑣𝑖 is, the closer the partitionings
an average (a vector) are generated at random. Indeed, the are to each other.
values of the covariance matrix allow to define a cluster
shape that is not isotropic (strictly spherical) and that can 4.2 RUNTIME ANALYSIS
have different variances separately on an arbitrary basis of If we consider the performance of the three algorithms according
directions and not necessarily on those of the dimensions; to the metric Δ𝑡, our algorithm takes the least time to provide
• there are at least two overlapping clusters; a clustering result. This is true in both synthetic and real data
• if the centroid of one cluster is in the hyperectangle of and regardless of the value of 𝑘, 𝑑 and 𝑛. Our algorithm is 5.5 to
another cluster then we consider that the maximum degree 12.9 times faster than g-means and 15.4 to 75.2 times faster than
of overlap has been exceeded. So as a result at least one of x-means in synthetic data, respectively. The same trend occurs
the clusters concerned is replaced by a new cluster; in real data, but from 16.3 to 1566.6 times compared to g-means
KD-means G-means X-means
k d n k(Δ𝑘 ) vi Δ𝑡 k(Δ𝑘 ) vi Δ𝑡 k(Δ𝑘 ) vi Δ𝑡
10 4 1 × 106 8.4(0.17) 0.4 39.7 74.0(6.4) 2.2 344.3(8.7) 45.7(3.57) 2.5 816.2(20.6)
2 × 106 8.3(0.21) 0.5 65.4 91.0(8.1) 2.4 774.1(11.8) 45.7(3.57) 2.5 1420.5(21.7)
4 × 106 8.3(0.18) 0.5 117.4 87.0(7.7) 2.3 1517.3(12.9) 44.6(3.46) 2.4 2203.6(18.8)
6 1 × 106 8.6(0.15) 0.4 41.9 69.2(5.92) 2.2 396.4(9.5) 46.0(3.6) 2.3 1101.5(26.3)
2 × 106 8.1(0.2) 0.5 72.9 89.0(7.9) 2.5 926.9(12.7) 46.1(3.61) 2.3 1869.9(25.7)
4 × 106 8.0(0.23) 0.5 133.7 79.1(6.91) 2.2 1628.0(12.2) 45.9(3.59) 2.3 3061.4(22.9)
8 1 × 106 7.4(0.39) 0.8 45.9 63.2(5.32) 2.6 386.0(8.4) 45.9(3.59) 1.5 789.4(17.2)
2 × 106 6.9(0.37) 0.8 81.2 71.3(6.13) 2.8 823.1(10.1) 45.5(3.55) 1.6 1253.5(15.4)
4 × 106 6.0(0.41) 0.9 153.6 78.9(6.89) 2.8 1727.2(11.2) 46.2(3.62) 1.6 2378.8(15.5)
32 4 1 × 106 35.4(0.38) 0.9 77.6 99.3(2.1) 1.9 535.6(6.9) 148.4(3.64) 2.9 2817.2(36.3)
2 × 106 31.4(0.35) 0.8 132.5 114.1(2.57) 2.0 1152.8(8.7) 147.6(3.61) 2.9 4831.0(36.5)
4 × 106 27.3(0.36) 1.0 233.7 132.1(3.13) 2.1 2468.4(10.6) 145.2(3.54) 2.9 7628.1(32.6)
6 1 × 106 40.2(0.47) 0.9 79.2 101.0(2.16) 1.8 586.7(7.4) 150.2(3.69) 2.7 3689.9(46.6)
2 × 106 34.6(0.44) 0.9 146.9 122.5(2.83) 2.1 1375.6(9.4) 149.1(3.66) 2.7 7880.7(53.7)
4 × 106 29.1(0.44) 1.0 260.2 107.9(2.37) 1.7 2485.1(9.5) 135.9(3.25) 2.7 7901.8(30.4)
8 1 × 106 35.7(0.36) 1.0 82.2 115.2(2.6) 2.2 665.7(8.1) 149.0(3.66) 1.9 2491.6(30.3)
2 × 106 37.2(0.42) 1.0 154.9 126.8(2.96) 2.3 1383.3(8.9) 148.1(3.63) 1.8 3740.0(24.2)
4 × 106 36.4(0.46) 0.9 299.5 136.5(3.27) 2.6 2883.7(9.6) 133.8(3.18) 1.8 5037.7(16.8)
80 4 1 × 106 90.5(0.24) 0.9 141.5 144.1(0.8) 1.5 781.9(5.5) 368.2(3.6) 3.3 7807.2(55.2)
2 × 106 88.3(0.29) 1.0 283.2 170.4(1.13) 1.8 1673.6(5.9) 373.1(3.66) 3.3 14147.0(50.0)
4 × 106 83.0(0.3) 1.2 570.8 186.9(1.34) 2.0 3417.8(6.0) 366.6(3.58) 3.4 21887.1(38.3)
6 1 × 106 96.2(0.37) 0.9 148.3 167.0(1.09) 1.6 893.0(6.0) 376.1(3.7) 3.2 10385.8(70.1)
2 × 106 97.4(0.39) 0.9 285.4 186.2(1.33) 1.8 1892.4(6.6) 375.0(3.69) 3.2 22403.6(78.5)
4 × 106 92.8(0.4) 1.1 578.0 200.4(1.51) 1.7 4039.2(7.0) 363.3(3.54) 2.9 37716.6(65.3)
8 1 × 106 57.5(0.28) 1.2 151.5 207.8(1.6) 2.0 1016.0(6.7) 377.2(3.71) 2.2 7834.6(51.7)
2 × 106 98.6(0.43) 0.9 308.2 150.2(0.88) 1.4 1708.6(5.5) 373.6(3.67) 3.2 21824.6(70.8)
4 × 106 99.2(0.47) 1.0 626.4 166.9(1.09) 1.5 3581.8(5.7) 377.7(3.72) 3.2 47130.8(75.2)
Table 1: Comparison of the three algorithms (our KD-means algorithm, g-means and x-means) executed on synthetic data. Note that Δ𝑡
is expressed in seconds. In the Δ𝑡 sub-column of the g-means and x-means columns, the number in brackets represents how many times
our algorithm is faster than the compared algorithm.

KD G-means X-means
dataname k d n k(Δ𝑘 ) vi Δ𝑡 k(Δ𝑘 ) vi Δ𝑡 k(Δ𝑘 ) vi Δ𝑡
vehicle 4 18 1 × 106 3.0(0.28) 2.9 38.6 2226(555.5) 11.8 60530.8(1566.6) 16.0(3.0) 5.3 281.5(7.3)
satimage 6 36 1 × 106 28.9(3.81) 2.7 54.5 1177(195.17) 8.8 34595.7(635.3) 28.7(3.78) 3.8 1151.0(21.1)
japaneseVowels 9 14 1 × 106 9.4(0.83) 4.5 75.6 3377(374.22) 12.5 55220.1(730.5) 32.0(2.56) 7.7 534.8(7.1)
pendigits 10 16 1 × 106 29.1(1.91) 3.6 31.0 2574(256.4) 9.1 40806.3(1316.9) 32.0(2.2) 4.0 759.2(24.5)
fourier 10 76 1 × 106 39.2(2.92) 3.0 91.6 1309(129.9) 8.0 38210.3(417.0) 57.3(4.73) 3.9 2791.2(30.5)
fashion-mnist 10 784 70000 10.9(0.11) 3.3 58.6 533(52.3) 7.1 954.1(16.3) 32.0(2.2) 4.0 697.0(11.9)
ldpa 11 5 164860 32.5(1.96) 6.3 12.5 1766(159.55) 10.8 5996.6(480.7) 47.2(3.3) 7.1 132.3(10.6)
walking 22 4 149332 39.6(1.35) 6.7 14.7 1257(56.14) 10.0 3254.1(220.8) 100.7(3.58) 8.4 93.7(6.4)
letter 26 16 999999 59.7(1.3) 7.2 57.8 1581(59.81) 11.1 22502.6(389.5) 117.0(3.5) 8.6 6272.4(108.6)
zernike 47 47 1 × 106 86.9(1.08) 5.3 118.1 1157(23.62) 9.0 35710.8(302.3) 128.0(1.72) 6.4 5583.1(47.3)
emnist 62 784 697932 45.6(0.26) 6.0 948.6 3073(48.56) 8.6 46224.8(48.7) 256.0(3.13) 6.7 36029.7(38.0)
Table 2: Comparison of the three algorithms executed on real data.

and from 6.4 to 108.6 times compared to x-means. Maximum presence of many gaussian clusters that are well separated from
execution times were reached for all three algorithms at 𝑑 = 8, each other. KD-means has a better estimate of good quality than
𝑛 = and 𝑘 = 80 in the synthetic data. In this combination, the the others. In terms of 𝑣𝑖, it is on average 2.46 and 3.05 better
elapsed execution times are 10.43 minutes for our algorithm, than g-means and x-means. Same performance in Δ𝑘, better on
59.41 minutes for g-means and 13h05 for x-means respectively. average by 2.83 (g-means) and 4.29 (x-means). It could be pointed
In the real data, the maximums are reached in the configuration out that g-means is better than x-means in terms of quality (𝑣𝑖)
𝑑 = 784, 𝑛 = 697932 and 𝑘 = 62 for x-means and KD-means. That in several combinations of (𝑘, 𝑑, 𝑛).
is 15.48 minutes for KD-means and 10 hours for x-means. In the In real data, the KD maxima do not exceed Δ𝑘 = 3.81 and
case of g-means, the maximum is even higher than that of the 𝑣𝑖 = 7.2. They are higher for x-means and g-means, they reach
others, it is reached at 16.48 hours (𝑘 = 4, 𝑑 = 18, 𝑛 = 1 × 106 ). respectively (Δ𝑘 = 4.73, 𝑣𝑖 = 8.6) and (Δ𝑘 = 555.5, 𝑣𝑖 = 12.5). The
values of Δ𝑘 and 𝑣𝑖 are higher in real data than in synthetic data.
4.3 CLUSTER QUALITY ANALYSIS (Δ𝑘/𝑣𝑖) Indeed, the distributions of clusters in real data are more complex
than synthetic clusters (gaussian). As a result, these complex rep-
KD-means is better than other algorithms in estimating k with
resentations of clusters are difficult to capture completely point
good clustering quality in synthetic and real data.
by point. This increase is much higher for g-means because it
In the synthetic data, KD-means does not exceed 0.47 in Δ𝑘
tends to identify gaussian clusters in particular. However, clus-
and 1.2 in 𝑣𝑖. X-means even reaches Δ𝑘 = 3.72 and 𝑣𝑖 = 3.4. The
ters in real data are not necessarily gaussian because they are
same for g-means with Δ𝑘 = 8.1 and 𝑣𝑖 = 2.8. The decrease in
not constituted by gaussian distribution generators. As a result,
Δ𝑘 g-means when 𝑘 increased in the synthetic data is due to the
Synthetic data Real data for scaling up and overlapping clusters. We plan to improve our
th_evo Δ𝑘 vi Δ𝑡 Δ𝑘 vi Δ𝑡 solution in several areas:
0.05 0.3 0.8 213.2 1.7 4.9 132.5 • implement a strategy that manages nested indexing. Be-
0.10 0.3 0.8 203.1 1.6 4.8 144.5 cause the indices of a node are also in the parent node;
0.15 0.3 0.8 198.2 1.5 4.7 134.0
0.20 0.3 0.8 193.5 1.4 4.7 141.1 • use random kd-tree and multiple kd-tree in case the num-
0.25 0.3 0.9 188.6 1.2 4.5 136.5 ber of dimensions is even larger to have more very good
0.30 0.4 1.0 183.8 1.3 4.6 130.5
performance in searching for information stored in kd-
Table 3: Performance of our algorithm as a function of the value tree or to have the best data separation. This performance
of 𝑡ℎ_𝑒𝑣𝑜 will help our solution to make better decisions on cluster
Synthetic data Real data merging.
ψ Δ𝑘 vi Δ𝑡 Δ𝑘 vi Δ𝑡
ACKNOWLEDGMENTS
4 0.5 1.1 195.0 1.3 4.2 131.1
8 0.3 0.8 201.5 1.2 4.4 141.3 These studies are supported by the ANRT and Capgemini funding
10 0.3 0.7 200.0 1.4 4.8 136.1 under CIFRE-Capgemini partnership.
12 0.3 0.8 190.5 1.9 5.3 137.5
Table 4: Performance of our algorithm as a function of the value REFERENCES
of ψ [1] Nir Ailon, Yudong Chen, and Huan Xu. 2013. Breaking the Small Cluster
Barrier of Graph Clustering. CoRR abs/1302.4549 (2013). arXiv:1302.4549
g-means makes an execessive overestimation of k (therefore Δ𝑘 [2] David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of
careful seeding. In ACM-SIAM symposium on Discrete algorithms. 1027–1025.
high) compared to KD. The KD-means algorithm is the least arXiv:1212.1121
affected by the gaussianity of clusters because in its operation [3] Jon Louis Bentley. 1975. Multidimensional binary search trees used for as-
gaussianity is not explicitly sought. sociative searching. Commun. ACM 18 (September 1975), 509–517. Issue 9.
https://doi.org/10.1145/361002.361007
[4] Gianna M. Del Corso. 1997. Estimating an Eigenvector by the Power Method
4.3.1 PARAMETER SENSITIVITY ANAlYSIS. We analyze the with a Random Start. SIAM J. Matrix Anal. Appl. 18, 4 (Oct. 1997), 913–937.
three metrics (Δ𝑡, 𝑣𝑖 and Δ𝑘) according to 𝑡ℎ_𝑒𝑣𝑜 and ψ. This is https://doi.org/10.1137/S0895479895296689
done in both data contexts (real and synthetic). [5] T. Cover and P. Hart. 2006. Nearest Neighbor Pattern Classification. IEEE Trans.
Inf. Theor. 13, 1 (Sept. 2006), 21–27. https://doi.org/10.1109/TIT.1967.1053964
In Tables 3 and 4, 𝑡ℎ_𝑒𝑣𝑜 and ψ do not have a particular in- [6] Charles Elkan. 2003. Using the Triangle Inequality to Accelerate K-means. In
fluence on the execution time of KD-means. Indeed, the range Proceedings of the Twentieth International Conference on International Confer-
ence on Machine Learning (ICML’03). AAAI Press, 147–153.
(difference between the maximum and minimum values) of Δ𝑡 [7] Aislan G. Foina, Judit Planas, Rosa M. Badia, and Francisco Javier Ramirez-
does not exceed 10.2 seconds with respect to ψ. It is about 30 Fernandez. 2011. P-means, a parallel clustering algorithm for a heterogeneous
seconds for 𝑡ℎ_𝑒𝑣𝑜. If we take into account the Δ𝑡 magnitude of multi-processor environment. In Proceedings of the 2011 International Con-
ference on High Performance Computing and Simulation, HPCS 2011. IEEE,
the three algorithms, these differences are negligible. 239–248.
In the synthetic data, concerning ψ, the values of Δ𝑘 are iden- [8] E.W. Forgy. 1965. Cluster analysis of multivariate data: efficiency versus
tical except for ψ = 4 where the value is slightly higher than the interpretability of classifications. Biometrics (1965).
[9] Guojun Gan, Chaoqun Ma, and Jianhong Wu. 2007. Data Clustering: The-
others by 0.2. The values of 𝑣𝑖 have a maximum difference of 0.4. ory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied
This difference is 0.1 when ψ ∈ [8, 10, 12] for which 𝑣𝑖 is minimal. Probability). Society for Industrial and Applied Mathematics.
[10] Greg Hamerly and Charles Elkan. 2003. Learning the K in K-means. In Pro-
At ψ = 4 Δ𝑘 is slightly higher by 0.3 compared to the rest. In real ceedings of the 16th International Conference on Neural Information Processing
data, the range is 1.1 for 𝑣𝑖 but at 0.2 if ψ ∈ [4, 8]. This range is Systems (NIPS’03). MIT Press, 281–288.
0.7 for Δ𝑘 but drops to 0.2 for ψ ∈ [4, 8, 10]. [11] Anil K. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recog-
nition Letters 31, 8 (jun 2010), 651–666.
Concerning 𝑡ℎ_𝑒𝑣𝑜 in the synthetic data, the values of Δ𝑘 and [12] A K Jain, M N Murty, and P. J. Flynn. 1999. Data clustering: a review.
𝑣𝑖 have a range of 0.2. In real data, the respective ranges of Δ𝑘 [13] Chuanren Liu, Tianming Hu, Yong Ge, and Hui Xiong. 2012. Which distance
and 𝑣𝑖 are 0.5 and 0.4. They drop to 0.2 when ψ ∈ [0.20, 0.25, 0.30] metric is right: An evolutionary k-means view. In Proceedings of the 12th SIAM
International Conference on Data Mining, SDM 2012.
for Δ𝑘 and ψ ∈ [0.15, 0.20, 0.25, 0.30] for 𝑣𝑖. [14] Songrit Maneewongvatana and David M. Mount. 1999. It’s Okay to Be Skinny,
The observed ranges of the two metrics 𝑣𝑖 and Δ𝑘 are lower If Your Friends Are Fat. In Center for Geometric Computing 4th Annual Work-
shop on Computational Geometry.
than the values of the same metrics of the competitors algo- [15] Marina Meilă. 2003. Comparing Clusterings by the Variation of Information.
rithms in the real and synthetic data. But if we are stricter by In Learning Theory and Kernel Machines, Bernhard Schölkopf and Manfred K.
just accepting ranges less than or equal to 0.2 then the optimal Warmuth (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 173–187.
values, when the data are gaussian, are ψ ∈ [8, 10, 12] and 𝑡ℎ_𝑒𝑣𝑜
[16] Dau Pelleg and Andrew Moore. 2000. X-means: Extending K-means with
Efficient Estimation of the Number of Clusters. In In Proceedings of the 17th
taking all the values tested. When the data is real (not neces- International Conf. on Machine Learning. Morgan Kaufmann, 727–734.
sarily gaussian clusters), the optimal values are ψ ∈ [4.8] and [17] D T Pham, S S Dimov, and C D Nguyen. 2005. Selection of K in K-means
clustering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal
𝑡ℎ_𝑒𝑣𝑜 ∈ [0.20, 0.25, 0.30]. of Mechanical Engineering Science 219, 1 (jan 2005), 103–119.
[18] Hanan Samet. 2005. Foundations of Multidimensional and Metric Data Structures.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
5 CONCLUSION AND PERSPECTIVES [19] M. A. Stephens. 1974. EDF Statistics for Goodness of Fit and Some Comparisons.
The estimation of the number of clusters (𝑘), that k-means must J. Amer. Statist. Assoc. 69, 347 (1974), 730–737.
[20] Jan N. van Rijn, Geoffrey Holmes, Bernhard Pfahringer, and Joaquin Van-
find, is a major problem in the literature. And it is more significant schoren. 2014. Algorithm Selection on Data Streams. In Discovery Science, Sašo
when data is massive and clusters overlap. We provided a solution, Džeroski, Panče Panov, Dragi Kocev, and Ljupčo Todorovski (Eds.). Springer
International Publishing, 325–336.
based on kd-tree, that automatically estimates the value of 𝑘 on [21] Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component
massive and high dimensional data. It is robust even if clusters analysis. Chemometrics and Intelligent Laboratory Systems 2, 1 (1987), 37 –
overlap. Four criteria were defined to guide the procedure for 52. Proceedings of the Multivariate Statistical Workshop for Geologists and
Geochemists.
estimating k. Through experiments, we have shown that our [22] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An Efficient
solution is very competitive compared to the known x-means Data Clustering Method for Very Large Databases. SIGMOD Record 25, 2 (jun
and g-means algorithms, which have proven to be unsuitable 1996), 103–114.