<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KD-means: clustering method for massive data based on kd-tree</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nabil El malki</string-name>
          <email>nabil.el-malki@capgemini.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franck Ravat</string-name>
          <email>franck.ravat@irit.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Teste</string-name>
          <email>olivier.teste@irit.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Capgemini, Université de Toulouse</institution>
          ,
          <addr-line>UT2J, IRIT(CNRS/UMR5505), Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université de Toulouse</institution>
          ,
          <addr-line>UT1, IRIT(CNRS/UMR 5505), Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Université de Toulouse</institution>
          ,
          <addr-line>UT2J, IRIT(CNRS/UMR 5505), Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>K-means clustering is a popular unsupervised classification algorithm employed in several domains, e.g., imaging, segmentation, or compression. Nevertheless, the number of clusters k, fixed apriori, afects mainly the clustering quality. Current State-of-the-art k-means implementations could automatically set of the number of clusters. However, they result in unreasonable processing time while classifying large volumes of data. In this paper, we propose a novel solution based on kd-tree to determine the number of cluster k in the context of massive data for preprocessing data science projects or in near-real-time applications. We demonstrate how our solution outperforms current solutions in terms of clustering quality, and processing time on massive data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Up to now, data clustering, also known as cluster analysis, has
been one of the most important tasks in exploratory data
analysis. It is also applied in a variety of applications, e.g., web page
clustering, pattern recognition, image segmentation, data
compression and nearest neighbor search [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Various clustering
algorithms have been available since the early 1950s. The goal of
data clustering is to classify a set of patterns, points or objects
into groups known as clusters. Data of each group are as similar
as possible to one another, and diferent groups are as dissimilar
as possible from one another [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In this paper, we consider one
of the most widely used data clustering algorithm, i.e., k-means
[
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ]. Due to its simplicity and understandability, this algorithm
has been popular for more than four decades [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ][
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Given a
set of points  = { 1, ...,  } where each point is defined in the
d-dimensional space R and  an integer, k-means aims to
partition  into a set of clusters  = {1, ...,  } by minimizing the
euclidean distance between the points (data) within each cluster:
Õ Õ
=1 ∀ ∈
|| −  ||
2
where  the centroid of the cluster  and ||.|| the euclidean
distance. Minimizing this function requires an iterative process
that ends when centroids do not change between two iterations.
This process consists of two steps:
• assigning each point to the nearest centroid using the
euclidean distance;
• recalculating the centroids, i.e., the average of the point
values, of the newly formed clusters.
      </p>
      <p>
        Disadvantageously, k-means requires that the user should
predefine the value of . Thus, determining the inaccurate value of
k has a direct negative impact on the clustering quality. Diferent
(1)
solutions have been proposed to estimate the relevant number of
clusters  [
        <xref ref-type="bibr" rid="ref10 ref16 ref17">10, 16, 17</xref>
        ]. Among the limitations of these solutions,
we show three major ones that challenge their normal process:
• they unroll k-means several times on the same detailed
data. However, k-means has dificulties in scaling because
it has a temporal complexity proportional to  with 
the number of iterations [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Consequently, in a context
of massive data, repetitive access to data caused by the
repetitive use of k-means on the same data make them
computational time consuming or unusable for
applications that require near-real-time responses;
• they hardly supports overlaped clusters. This problem can
lead to two diferent cases. i) overestimating the number
of clusters exaggeratedly, compared to the actual number,
thus increasing the processing time significantly. ii)
underestimating the value of k, thus producing a very poor
clustering quality;
• if we consider clusters that approach the sphere form,
which k-means identifies, then these k estimation
solutions tend to capture clusters of this form. While, in real
data sets, there are also clusters of spherical shapes that are
more or less elongated, approaching the elliptical shape
and its declines. These clusters could also have diferent
directions. We consider a cluster to be strictly spherical if it
represents a perfectly or almost spherical shape. Similarly,
a cluster not strictly or less spherical when it has a shape
that tends towards the elliptical. All these clusters of
different sphericity will be called natural clusters. Formally,
when a cluster has a strictly spherical data distribution
then the variances of each data dimension are equal. If the
cluster has a non-strict sphericity, then the dimensions
have diferent variances. Statistically, these clusters can
be assimilated to gaussian distributions.
      </p>
      <p>
        In this paper, we propose a solution to these problems. We
introduce an algorithm, based on the use of kd-tree[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], that
estimates the number of  clusters for massive data. This estimation
performs in reasonable processing time, and thus it is
possible to integrate such solutions into data-intensive applications,
e.g., near-real-time applications, or the preprocessing stage of
a data analysis project. Our solution is an algorithm, based on
kd-tree(k dimentional tree), which hierarchically structures data
aggregates, e.g., clusters of data points and their corresponding
metadata, with diferent levels of precision details. Furthermore,
we define several new clusters merge criteria to support more
eficiently overlapping and natural clusters.
      </p>
      <p>
        We compare our solution to known methods of estimating 
from literature (x-means [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and g-means [
        <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
        ]). We show that
our solution is faster than current methods while ensuring better
clustering quality on massive data.
      </p>
      <p>This paper is organized as follows. In Section 2, we study the
state-of-the-art solutions. Next, in Section 3 we introduce our
contribution. In Section 4, we compare our solution with that of
our competitors on data sets containing up to 4 million points.
Finally we conclude in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>
        For several decades, the problem of estimating the number of
clusters  remains significant in the literature. Several studies
have been conducted to prevent the user from defining an
inexact value of , thus leading to counterintuitive results, and it is
disparate from the expected value of k. We identify two types of
clustering solutions close to k-means. Some integrate the
estimation of the value of . Others do not, as in the case of Birch [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
The latter is a hierarchical algorithm based on a micro-clustering
approach. It consists of two main phases. The first tends to
produce many small spherical clusters without having to specify
the number to be produced. The second uses another clustering
algorithm to perform macro-clustering. We have not retained it
in our study because in the second phase an algorithm should be
used to apply it on the representatives of small clusters (usually
centroids). The most used algorithms in the second phase and
adapted for the first phase are often k-means or the hierarchical
ascendant clustering [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to which the value of  should be given
a priori. We focus on two algorithms specialized in the automatic
estimation of k: x-means [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and g-means [
        <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
        ].
      </p>
      <p>
        X-MEANS. X-means [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] is a k-means-based algorithm that
consists of searching for the set of centroids that best fit the data.
Starting from a minimum  (), generally equal to one, the
algorithm iteratively adds new centroids if necessary until the
maximum value of k ( ) is reached. At each iteration, the
algorithm identifies the subset of centroids that must be divided
in two. This identification is done using a statistical criterion
called bayesian information criterion (BIC). This is used for model
selection and is based on the likelihood function.
      </p>
      <p>The x-means process consists of two operations that are
iterated until completion: Imporve_param and improve_structure.
Imporve_param is only the execution of k-means with the
current k as parameter. Improve_structure researches where new
centroids must be added. For each centroid and its associated
region of points, three sub-operations follow each other:
• the corresponding bic is calculated
• k-means is executed with k=2, two children’s centroids
are obtained
• the bic of the same region is calculated but taking into
account the two children’s centroids instead of their parent
centroid.</p>
      <p>If the previous Bic is smaller than the next Bic, then the parent
centroid is replaced by its two children’s centroids or the parent
centroid is retained.</p>
      <p>
        G-MEANS. This algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] wraps k-means and adopts
almost the same approach as x-means. Instead of the BIC criterion
the gaussianity test (at the level of  confidence defined by the
user) on clusters is used to decide whether to divide a centroid
in two.
      </p>
      <p>It starts from a set of centroids of size  and it increases
the set size during the process. The value of  is initialized to
1 because usually we have no a priori knowledge of the possible
values of k. In this case, the first centroid corresponds to the
artihmetic mean of the values of all data points. A list is defined
to contain the centroids whose sets of points surrounding them
are gaussians so as not to be processed in subsequent iterations.</p>
      <p>
        At each iteration, the centroids not stored in the list called
parent centroids, are divided into two children’s centroids c1 and
c2. This division is operated via the principal component analysis
(pca). The pca method is a time complexity of  (2 + 3) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
with  the data point dimension and  the size of the dataset. So
the authors recommended using accelerated versions of pca such
as pca based on the power method [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The children’s centroids
are then refined through the execution of 2-means (k=2). Then
the gaussianity test, via the Anderson-Darling test [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] (only
works on points defined in one dimension), is performed on
points around the parent centroid. To do this, the points around
the parent centroid are first projected on the vector  = 1 − 2
linking the two children centroids to obtain points defined on a
single dimension. If these points follow a gaussian distribution
then the centroid is added to the list of centroids that will no
longer be processed in the following iterations. Otherwise, it is
replaced by its two children’s centroids. The value assigned to 
by the authors [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is 0.0001 to test their solution on synthetic
and real data.
      </p>
      <p>LIMITATIONS. The limitations of the above methods are
related to the dificulty of dealing with overlapping clusters, the
quality of clustering (and therefore also the value of k) they
produce as well as the execution time to provide a result.</p>
      <p>
        The both methods are of diferent computational complexity
and each can identify only a limited number of more or less
spherical cluster types. X-means is suitable for data whose clusters
are strictly spherical [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. If other forms of clusters are present,
then the estimate of k could be overestimated. On the other hand,
g-means could identify clusters whose spherical shape is less
strict than that identified by x-means. If the clusters are well
spaced apart, g-means could provide the relevant value of k [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
In addition to this distribution, the strict sphericity requirement
must be added to the clusters for x-means to have the same
performance. If the clusters overlap then none of them estimates the
correct value of k. Under these conditions x-means and g-means
tend to overestimate k.
      </p>
      <p>These cases are illustrated by Figure 1. There are 5 gaussian
clusters that have been generated but with a more or less diferent
sphericity. Two clusters are well separated from each other. The
other three overlap. All three algorithms were run on this set of
20,000 data. The  has been set at 140. G-means identified 26
clusters but detected the 2 clusters well separated from the others.
However, there is overfitting on the remaining three clusters, an
overestimation of 24 clusters instead of only estimating 3.
Xmeans estimated k at 76. Overestimation is slightly higher for
overlapping clusters than for separate clusters.</p>
      <p>In terms of execution time, the two algorithms are not suitable
when the data are massive and it is even more accentuated when
the estimated value of k tends to be large.
3</p>
    </sec>
    <sec id="sec-3">
      <title>CONTRIBUTION</title>
      <p>In this section, we will define our solution that addresses the
problem of estimating k in a context of massive data and
overlapping clusters. Figure 2 illustrates the diferent steps of our
solution.</p>
      <p>The solution is composed of two main parts; i) the storage and
organization of the  data provided by the user in a data structure,
and ii) the processing is done on this structure to estimate 
(Merging task and Cluster regularization).</p>
      <p>
        We opted for the kd-tree[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] data structure to fulfill the roles
of storage and data organization. A kd-tree is a binary tree that
represents a hierarchical subdivision of space using splitting
planes that are orthogonal to the coordinate dimensions (do not
confuse the k of kd-tree which is just the dimension of the data
stored in the tree and the k of k-means which corresponds to the
number of clusters). In Subsection 3.1 we discuss the advantages
of kd-tree as well as the method that builds the kd-tree in 3.1.2.
      </p>
      <p>Concerning the processing part, first of all we proceed to the
estimation of k clusters in each of the leaves (see 3.1.4). This
operation results, in each leaf, in the constitution of several clusters.
Then the clusters of the nodes are merged recursively from the
leaves to the root according to rules defined in Subsection 3.2.2.
These rules are built to manage overlapping clusters. The final
step is a regularization phase (Subsection 3.3). It decides whether
the small cluster is a separate cluster or should be merged with
another cluster. This phase avoids having an overestimation of k
due to small clusters. Note that the nodes are processed according
to the post-fixed path, i.e., each node is processed after each of
its children is processed. This path avoids exploring a node
unnecessarily (time consuming task because it involves conditional
tests) in case it will not be processed because none of its children
are processed yet.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>KD-TREE</title>
      <p>
        Kd-tree puts points that are spatially close together in the same
node. This property is exploited by several machine learning
methods to accelerate calculations related to point space. One
of the best known cases is the k-closest neighbors algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
This property is advantageous to us in two cases. First of all,
it partly addresses the problem of k-means, which is to group
together the most similar points possible in the same cluster.
Second, it provides an optimized spatial subdivision of space to
accelerate data processing. It recursively splits the whole dataset
into partitions, in a manner similar to a decision tree acting on
numerical datasets [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The root node is the whole data space, while
the leaves are the smallest possible partitions of the data space.
A node is associated with several closed rectangular regions of
space, called hyper-rectangles. Traditionally, in the literature, the
hyper-rectangle is only used to represent the subspace that the
node represents. In addition, we will also use it for other
purposes in 3.1.1 because of its advantages in terms of calculation
and representation of sets.
      </p>
      <p>3.1.1 HYPER-RECTANGLE. Formally, it corresponds to the
smallest possible rectangle (generalized to  dimensions) covering
all the data points of a set. It is defined by the following equation:
 = { | ⩽  ⩽  ∀ }
(2)
where,  and  respectively represent the lower and
upper limits of the hyper-rectangle. Indeed, () is a
vector corresponding to the minimum (maximum) values of each
dimension of the  data set (see Figure 3).</p>
      <p>In our case, we consider a hyper-rectangle as a geometric
object to approximate the overall spatial distribution of a set of
points contained in the node. In other words, each of the clusters
of a node is geometrically represented by a hyper-rectangle. From
this representation results, in our solution, a time saving on the
calculations that involves sets (a set is a cluster of points). For
example, to calculate a distance between two sets or perform
a set operation involving at least two sets, their corresponding
hyper-rectangles could be used. Therefore, instead of visiting all
the data points of the sets, it is only necessary to use the  and
 vectors of the hyper-rectangles for the above calculations.
This greatly saves a lot of time on large amounts of data.</p>
      <p>3.1.2 SPILITTING METHOD. The partitioning of a data space
of an internal node (i.e., not leaf) is performed mainly based
on two cutting elements that must be specified: a given data
dimension () of the node space and a value of this dimension
( ). Thus the points whose value at the dimension  is less
(resp. greater) than  then they will be assigned to the child’s
node which is called ℎ (resp. ℎ). This process
is carried out recursively from the root to the leaves.</p>
      <p>
        Several rules for splitting nodes are proposed to choose the
optimal dimension and associated value for correct data separation.
Among them, we opted for sliding midpoint splitting rule [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
because it provides an optimized data organization than other
classical rules [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This performance on others is explained
because it does not produce empty nodes or nodes whose data
space is very sparse (i.e., a very large space, specifically at the
sides, compared to the data it represents when it should be small).
In addition, the classical rules choose the median as the cutting
value  while the sliding midpoint splitting rule chooses the
middle of the points ( ( + )/2) which is less expensive. The
dimension  chosen is the one that is the longest ( − ).
      </p>
      <p>Note that theoretically there is no guarantee on the limit of
the depth that kd-tree can have, the trees could be very deep (so
the time of tree construction is increased). As a result, we have
added two stopping conditions to avoid deep trees. These two
conditions are performed at the beginning of the method that
partitions the node in two. If one of the conditions is verified
then the node will not be divided and will be considered as a leaf:
• the depth of the leaf is limited to () to save
construction time compared to the normal time taken if the depth
is not limited. The root has a depth of 1;
• each node is limited by a minimum number of points, ψ.</p>
      <p>It is not interesting to have leaves that have a number
of points close to one in a context of massive data. This
would result in a very long kd-tree construction time and
therefore make our solution unsuitable for near real-time
applications. In addition, if the sum of the number of points
of the node is less than 2 ∗ ψ then it will be considered
as a leaf. Indeed, two cases are possible if this limit is
not defined. Either the two children will be leaves if their
number of points (size) is less than ψ. Either the size of
one will be greater than ψ and therefore the other is a leaf.
In the latter case a diference in depth will occur which
brings a certain imbalance of the tree. This condition limits
the number of small leaves, limits the depth of the tree
and contributes to the balance of the tree.</p>
      <p>3.1.3 NODE COMPONENTS. Each node contains a set of
hyper-rectangles. We call this set  . Each hyper-rectangle is
associated with a list of indices of the data points that it
represents, the arithmetic mean of these data and the sum of the
squared diferences ( ) between each point and the arithmetic
mean. The data points are not stored in the tree but are accessed
via the indices. The internal nodes have in addition the splitting
dimension index  as well as the associated value  .</p>
      <p>3.1.4 INITIALIZATION OF LEAVES. During the tree building
and more precisely during the instantiation of the leaf, we
estimate the number of clusters present in the subset of the data
represented by this leaf (see Algorithm 1). At the same time the
clustering of these data is carried out, which results in a set of
clusters. When clusters are close to each other or overlap, they
could overestimate the value of  in an exaggerated way. For
this, the value of  was limited to a maximum value (  _ )
so that g-means (this version of g-means is called  in
Algorithm 1) cannot return more than a limited number of
clusters. This number is called  with  ∈ N∗ the  ℎ iteration of
 . Indeed, at each iteration  we test if  &gt;=   _ .
The value of  could be up to 2 . This case only occurs if all
the centroids have been split in two because they have not been
evaluated as gaussian. Note that even if the total number of
clusters in all leaves exceeds  then our merging algorithms will
approach this number towards the real k of the dataset.</p>
      <p>Let  be the number of leaves that kd-tree has,   _ is
then defined as follows:
  _ =
( √</p>
      <p>, if  &gt;= √
 , otherwise
(3)</p>
      <p>If we assume that in Equation 3 only the second condition
exists and   _ will always be equal to the result of this same
condition then this would induce   _ to tend towards a
value of 1 or 0. Indeed for a given value of  , the greater  is
the greater the   _ tends towards 1 if  &lt;=  . When
 &gt;  then   _ tends towards 0. These cases could
make the solution inefective because they would underestimate
the number of clusters in the leaf and by extension underestimate
the number of clusters in the tree dataset. The first condition
is essential to maintain a balance between having an
underestimation of  if there is only the second condition and having an
overestimation of k if there is no condition that would limit the
value of k.</p>
      <p>After the estimation of k, each cluster of data points is
represented by a hyper-rectangle as well as the associated aggregated
calculations (arithmetic mean (G) and the diference between the
points and G ()).</p>
      <p>Algorithm 1: InitializeLeaf
input : , ,, 
output : 
1  ←  .
2 , ,  ←  (, ,  )
3 for  = 1 to  =  do
4   . ←  
5   . ←  
6   . ←  ( )
7   . ←  ( )
At the end of node initialization, clusters are obtained in each
node that are considered as subclusters of the final clusters. These
latter are located in the root and are considered to be the closest to
real clusters. The merging of clusters is necessary to achieve the
ifnal clusters. In this subsection Algorithms 3 and 2 for merging
clusters will be presented. But first, we will list the functions and
definitions that will be used in these algorithms.</p>
      <p>Definition 3.1. Let  be a set of hyper-rectangles, the 
(resp. ) vector associated with  is the maximum (resp.
minimum) of each dimension of the  (resp. ) vectors
of the hyper-rectangles of  :
  ( ) =
(
 =  ({ℎ . |ℎ ∈ [1;  ( )] })</p>
      <p>Definition 3.2. Let  = {(,  ) |,  ∈ N ∩ [1; ] and  ≠  }
a list of index pairs of hyper-rectangles of a node. We consider
(,  ) = (, ) and consequently P contains only one of them. The
average of the pairwise distances is calculated by the following
function:
 ( ) =
Í(,) ∈ || . −  . ||</p>
      <p>( )</p>
      <p>Lemma 3.3. Given two datasets  ∈ R and  ∈ R . 1 and 2
are respectively the number of points contained in  and  . If the
sum of squared errors of  and  are as follow:
 =
 =
1
Õ
=1
2
Õ
|| − ¯ ||2
|| − ¯ ||2
=1
then the the sum of squared errors of  =  ∪  is:
 =  +  + 1 || − ¯ ||2 + 2 ∗ || − ¯ ||2</p>
      <p>¯ ¯
where ¯ can computed as the average of the weighted averages
(5)
(6)
(7)
1¯ + 2¯
1 + 2
2
Õ
=1
} |
1
Õ
=1
1
Õ
=1
1
Õ
=1
|
1
Õ
=1
1
Õ
=1
Proof. We are trying to calculate the sum of two sse:
 =
|| − ¯ ||2 +</p>
      <p>|| − ¯ ||2
{z {z
 
Let focus on the first term of this equation:
}
  =
|| − ¯ ||2 =</p>
      <p>¯ ¯ ¯ 2
|| ( −  ) + ( −  ) ||
|| − ¯ ||2 + 2
( −  ) ( −  ) + 1 || − ¯ ||2</p>
      <p>¯ ¯ ¯ ¯
1
= Õ</p>
      <p>=1
But:
1
Õ
=1
( − ¯ ) =
 −
1
Õ ¯ = 1¯ − 1¯ = 0
=1
So   =  + 1 || − ¯ ||2. If we repeat the same
cal¯
culation for the second term  then we obtain Equation
6.</p>
      <p>The minimum distance between two hyper-rectangles x and y
is calculated as follows:
 (, ) = ||0− (0,  ( .−., .− .)) ||
(8)
where the function  () returns the maximum of the two
numbers given as parameters.</p>
      <p>Algorithm 2: Local merger
input :  , τ, ℎ , ℎ_
output : 
3.2.2 PROPOSED MERGE ALGORITHMS. It cannot be
objectively confirmed that two clusters naturally correspond to the
same cluster (i.e., they must form a single cluster) because this
confirmation implies the use of clustering related concepts such
as close, similar or complementary whose definition is also
subjective. Consequently, we focus on the detection of clusters that
k-means could detect (i.e., clusters close to strict sphericity) but
also less spherical clusters as present in real data. So our goal is,
ifrst, to capture natural clusters as defined in Section 1. Secondly,
to identify these clusters even if they overlap with each other.
As a result, a natural cluster is a cluster with a certain sphericity
and at the same time it has low connectivity with another cluster
if it overlaps with it. By connectivity between two clusters we
mean of how close two clusters are in form. If two clusters of
different shapes overlap considerably, they could be considered as
two separate clusters. The distance between clusters alone is not
enough. Note that considering only the distance between clusters
to conclude that two clusters form a single cluster could lead
to the following situation: if two clusters that do not naturally
correspond to the same cluster (i.e., they have low connectivity)
and whose distance is zero or partially overlapping, could be
considered as a single cluster. This could perhaps contradict the
reality of the meaning given to the data points. By taking into
account the distance and connectivity between two clusters, we
consider that if two clusters are of the same shape and overlap
only partially then they are two diferent clusters that could not
be merged. So it is necessary to take into account the connectivity
between clusters in addition to distance when deciding to merge
two clusters.</p>
      <p>We define a set of criteria to evaluate how much two
clusters correspond to the same cluster in order to decide whether
two clusters can be merged in accordance with the definition of
natural clusters that we have mentioned. These criteria involve
measures of proximity and connectivity between clusters. The
fusion algorithms 3 and 2 incorporate these criteria.</p>
      <p>Note that merging several clusters signifies merging the
associated hyper-rectangles. It consists of calculating the weighted
mean of the means of the point values contained in the
hyperrectangles already calculated, concatenating the lists of the point
indices, calculating the  based on Equation 6 and to calculate
the  and  vectors according to Equation 4.</p>
      <p>Algorithm 2 (⁀local merger) allows us to merge clusters whose
representative hyper-rectangles are included in the  set. 
hyper-rectangles are associated with a list of ascending ordered
numerical indices. Thus the indices of the first and last elements
of LH are respectively 0 and  ( ) − 1. We consider  to be
a temporary variable that runs through the  elements. The
algorithm process is as follows. First we assign the first element of
 to  . This one is now a hyper-rectangle representing a cluster.
We then check if all the other clusters represented by
hyperrectangles  are mergeable with the cluster represented by  . If
 is not mergeable to any of them then the next hyper-rectangle
in  is assigned to  . On the other hand, if it is mergeable to at
least one hyper-rectangle, then the concerned hyper-rectangles
are merged resulting in a new hyper-rectangle (merging operated
by the  function). In this case the first item of freshly
reconstituted  is reassigned to  . This procedure is repeated until 
is the second last element of  and there are no more mergers.
During the fusion test between two hyper-rectangles  and ,
several criteria, formed from inequalities and equations, must be
verified. The first two criteria only select the hyper-rectangles
candidates  for merger and at the same time they reduce the
number of hyper-rectangles to be processed in the other two
criteria. While the last two decide whether to merge  and  or
not.</p>
      <p>criterion 1. Let  and  the limits of a defined data
space ∈ R and  the list of hyper-retangles belonging to this
space then two clusters represented by the hyper-rectangles  ∈ 
and  ∈  are possibly mergeable if:
 &lt;
τ

(9)
where τ = | |(−) | | ,  = | |||.−−.|| | | and  ∈ N∗.</p>
      <p>Criterion 1 is based on the proximity measure. It checks whether
two hyper-rectangles are close enough to be possibly considered
mergeable. It is based on the average of the pairwise distances
between centroids () to approximate the average distance
between clusters of a node without calculating the distances
between all point pairs of all clusters. The pairwise distances
reduces the bias brought by the very distant centroids from the
majority of other centroids. The values of  and τ of the
inequality are normalized between 0 and 1 by the maximum
length of the data space to be compared. The value of  controls
the boundary between the qualifiers "distant" and "close" when
referring to the distance between clusters of  and . If this
criterion is verified then other criteria must be verified to confirm the
fusion between  and . Note that the higher the value of  is, the
fewer hyper-rectangles validate the criterion. Moreover, if  = 1,
then a significant amount of hyper-rectangles will validate the
criterion, which could lead to further tests and probably lead to
a non fusion result because if two clusters are very distant then
there will be no fusion. The value of  is to be adjusted according
to the strictness level of the "enough close" notion required by
the criterion including this parameter.</p>
      <p>criterion 2. Let  and  two hyper-rectangles, Δ the distance
between the centers of x and y and Δ the distance between the
centroids of the x and y clusters. If Δ &lt; Δ then x and y are
possibly mergeable.</p>
      <p>Criterion 2 refines the set of hyper-rectangles from the first
criterion. It estimates how strong the connectivity between two
clusters is. Knowing that a hyper-rectangle corresponds to the
smallest rectangular envelope covering all the data points of
a cluster, then it could be ruled that the center of the
hyperrectangle would roughly correspond to the centroid of the cluster
if the points of the cluster are uniformly distributed with a certain
sphericity. So the cluster with these characteristics is probably a
natural cluster apart. As a result the closer the distance between
centroids (Δ ) is to the distance between centers (Δ) the smaller
the probability of merging  and . This probability is considered
null when Δ &gt;= Δ.</p>
      <p>The first two criteria make it possible to identify relatively
close clusters but also to consider that two clusters are distant if
their centroids are also distant even if the two clusters are
contiguous. The latter could occur if the majority of points surround
the centroid while a minority are far from the centroid. However,
these criteria are based only on the distances applied to centroids
and centers. Two clusters could be considered as close, but one
or more clusters could be located between them. In this case
they cannot be merged directly. They could only be if at least
one cluster between them merges with them. The following two
criteria require stricter distances and stronger connectivity to
avoid directly merging two nearby clusters separated by other
clusters.</p>
      <p>criterion 3. Let be  and  two hyper-rectangles. We merge
 and  when Criteria 1 and 2 are verified and if  (, ) = 0
and Δ &lt; Δ ∗  with  ∈]0; 1]</p>
      <p>Criterion 3 only deals with contiguous or overlapping
hyperrectangles. Indeed  (, ) returns 0 if the two hyper-rectangles
overlap or if their distance is zero. Also, a connectivity constraint
is added. It results in a constraint on the proximity between
centroids: for two hyper-rectangles  and  the distance between
centroids must be less than Δ ∗. The smaller , the more the
notion of "close enough" between clusters to the point of matching
the same cluster is strict.</p>
      <p>criterion 4. Let  and  two hyper-rectangles and  ∈ N∗. We
merge  and  when Criteria 1 and 2 are verified and if  &lt;=
ℎ_ and  (, ) normalized is less than τ/ with  the
sum of the squared errors () of the  and  data points union.</p>
      <p>Unlike Criterion 3, to validate Criterion 4 hyper-rectangles
are not necessary to be contiguous but a maximum distance is
required. The normalized distance  (, ) must be less than
τ/. It may be that two clusters are not contiguous but almost
when the distance is as small as it does not allow a cluster to
interpose itself between two clusters concerned by the
calculations of the criterion. In addition to this, the criterion adds a
requirement on the compactness of  and  clusters. To do this,
it uses the homogeneity measure called sum of square error ().
In this criterion we have the choice to limit the quantity of  for
possible mergers. Indeed, two clusters are merged only if the 
of their data is less than the limit is ℎ_ . Its value is entered by
the user. However, ℎ_ is adapted according to the value of Δ
provided Equation 10. The latter evaluates the diference between
the value of  and the total sum of the values of k specific to
each leaf (Í ϑ). The purpose of this adaptation is to approach
the estimated value k on the entire dataset at  and avoid an
overestimation of  compared to  . If  &gt;= 1 it means Í ϑ
is at least equal to  and therefore ℎ_ will be unchanged.
Otherwise, if  &gt; Í ϑ then ℎ_ is recalculated according
Equation 11.</p>
      <p>= 1 − Í ϑ
( ℎ_,</p>
      <p>if  &gt;= 1
ℎ_ =
ℎ_ + , otherwise
(10)
(11)
Algorithm 3: Global merger
input : ,ℎ_
output : 
1  ←  . . Ð  . .
2 ,  ←   ( )
3 ℎ ← || −  ||
4 if card(internal.greater.LH)&gt;1 then
5 τ ←  ( . . )/ℎ
6  . . ←</p>
      <p>( . ., τ , ℎ, ℎ_ )
7 if card(internal.less.LH)&gt;1 then
8 τ ←  ( . . )/ℎ
9  . . ←</p>
      <p>( . ., τ , ℎ, ℎ_ )
10  ←  . . Ð  . .
11 τ ←  ( )/ℎ
12  ←  (, τ , ℎ, ℎ_ )</p>
      <p>The global merger fusion algorithm can be seen as a layer
that envelops the algorithm local merger. Intuitively, for a given
internal and parent node, a good part of the clusters of one of its
children are closer to each other than to the other child’s clusters.
So we process the clusters of a child node locally but in the data
space of the parent node. Hence the proposal of the global merger
algorithm. It performs the merging of clusters node by node
using the local merger algorithm. It starts first with the children
of the "internal" node and then concatenates the hyper-rectangles
of the both children to form the list of hyper-rectangles of the
internal node. Finally, local merger is applied to internal. Before
the children node is processed, the  function is normalized
by the maximum length of the parent node hyper-rectangle. This
normalization allows to have the same distance ratio in the three
nodes ( , and ). On the other hand, the value
of  changes from one node to another because it depends
only on the clusters of the concerned node. Consequently, if
two clusters have not been merged into one of the two children
nodes, they could be merged into the parent node. Note that the
processing of the children nodes can even be parallelized because
they are independent.
3.3</p>
    </sec>
    <sec id="sec-5">
      <title>CLUSTER REGULARIZATION</title>
      <p>
        This subsection discusses the strategy we have defined to identify
clusters that should not be as such but rather be part of another
cluster. The algorithms of the previous step could produce small
(in number of points) but compact clusters. A small cluster is
located in three cases: either it is a natural cluster diferent from
the others, or it is part of an agglomeration of small clusters
that represent a single cluster, or it is part of a large cluster. The
definition of a small cluster has no objective purpose. We consider
the definition of the small cluster resulting from the work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as
a cluster with a number of points less than √ with  the size of
the cluster.
      </p>
      <p>We have developed an algorithm that makes it possible to
match a small cluster to one of the three cases. The goal of the
algorithm is to get as many natural clusters as possible. Knowing
that the final result of our solution is in the root of the tree
then this algorithm is unrolled only in this one. The algorithm
consists of two parts that are executed consecutively once. The
ifrst identifies the small clusters and then merges those that are
close to each other. The second re-identifies small clusters from
the new list of small clusters from the first part and afects those
that are close to large clusters. If a small cluster has not undergone
a merger operation in both parts then it will be considered as a
separate natural cluster.</p>
      <p>One of the reasons for the presence of small clusters is due to
Criteria 3 and 4 where the limit of the minimum distance between
hyper-rectangles is very strict. Indeed, if the minimum distance
between two hyper-rectangles, at least one of which is a small
cluster, is greater than this limit, then they cannot be merged.
This limit is respectively equal to 0 in Criterion 3 and τ / in
Criterion 4. The τ / limit is more flexible than the first one, it
was used in both parts of the algorithm to define the boundary
between "close" and "distant" regarding the distance between
hyper-rectangles.
4</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL ASSESSMENTS</title>
      <p>
        In this section, we carried out experiments to study the
performance of our solution, to demonstrate its efectiveness compared
to the algorithms listed in Section 2(x-means and g-means) and
to give recommendations on the values that the parameters of
our solution could take (the minimum size of an internal node
ψ and limit ℎ_ of Equation 11). In the experiments, real data
and synthetic data were used.
Our solution and the compared algorithms were implemented
in python 3.6. All of them use the Elkan version of k-means [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
It corresponds to an exact version of standard k-means but is
suitable for massive data in execution time. Moreover, k-means++
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is used for initializing the centroids before applying Elkan
algorithm.
      </p>
      <p>
        Diferent values are assigned to parameters ,, , ℎ_ and
ψ to study the sensitivity of our algorithm to these parameters.
All combinations of the values of these parameters have been
tested:
•  ∈ [
        <xref ref-type="bibr" rid="ref10">10, 32, 80</xref>
        ]
•  ∈ [
        <xref ref-type="bibr" rid="ref4 ref6 ref8">4, 6, 8</xref>
        ]
•  ∈ [1 × 106, 2 × 106, 4 × 106]
• ℎ_ ∈ [0.05, 0.10, 0.15, 0.20, 0.25, 0.30]
• ψ ∈ [
        <xref ref-type="bibr" rid="ref10 ref12 ref4 ref8">4, 8, 10, 12</xref>
        ]
An experiment is structured as follows:
• first of all a combination of the above-mentioned
parameters is defined. Parameter  represents the real number of
clusters in a dataset. It should be noted that parameters ,
 and  above refer only to synthetic data. The values of
these parameters are defined in the characteristics of the
real datasets,
• then follows the data generation in the case of synthetic
data, otherwise the real data is loaded,
• finally, the estimation of  is performed by the three
algorithms.
      </p>
      <p>For each combination of these parameters the experiment is
conducted five times.</p>
      <p>We set the values of  and  in the diferent criteria as follows:
•  has been defined as 2 and 10 respectively in Criteria 1
and 4. The value 2 is the least strict while 10 is the most
strict;
• in the cluster regularization algorithm,  = 8 in the first
part of the algorithm. If there are still small clusters left,
then the second part is executed. In this case  = 4;
• in Criterion 3,  = 0.75. As a result, Δ must be less than
75% of Δ.</p>
      <p>We allow our algorithm and x-means to search up to  =
5 centroids.</p>
      <p>4.1.1 SYNTHETIC DATA. In order to get closer to the real
data, natural and overlapping clusters were generated. These
synthetic data have the following properties:
• a cluster is a set of data that follows a normal distribution.</p>
      <p>To generate this set, a semi positive covariance matrix and
an average (a vector) are generated at random. Indeed, the
values of the covariance matrix allow to define a cluster
shape that is not isotropic (strictly spherical) and that can
have diferent variances separately on an arbitrary basis of
directions and not necessarily on those of the dimensions;
• there are at least two overlapping clusters;
• if the centroid of one cluster is in the hyperectangle of
another cluster then we consider that the maximum degree
of overlap has been exceeded. So as a result at least one of
the clusters concerned is replaced by a new cluster;
• clusters do not have the same number of points. The
cardinalities are chosen at random so that their sum is equal
to the total number of points  defined previously.</p>
      <p>
        4.1.2 REAL DATA. The real data comes from three known
sources: Openml, UCI and Kaggle. They are all of a numerical
type. These data are used in other research projects to perform
benchmarking machine learning methods [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. The values of real
,  and  are given in Table 2. The data sets used are diverse and
they can be organized into several groups:
• black and white or grayscale images; clustering is
applied in this case directly to the pixels.  and  ℎ−
 contain images respectfully on the first 10 digits and
clothing and shoes. Each image was flattened to form a
single vector. And each vector has been assigned its
corresponding class (label);
• 4-band multispectral images; they characterize
diferent types of soil. The corresponding dataset is ;
• sounds; the    dataset is a set of digitized
Japanese vowel sounds. Each sound is associated with a
speaker;
• historization of people’s activities; the  dataset is
a collection of the positions of sensors present on people
in order to identify the movement they perform at a given
time. The  dataset is a set of people’s activities
designed to determine the authors;
• images represented by characteristics extracted from
the object to be identified ; for example, in the 
dataset each record is a set of coeficients of the
character (one of the first 10 digits) shapes; similarly in 
each instance of a digit is of rotation invariant Zernike
moments of the digit (between 0 and 9); in  each
individual is a positions sequence characterizing a digit;
in  each instance of a letter of the English
alphabet contains statistical moments and edge counts of the
given letter; in  ℎ dataset a data point is just a set of
geometric characteristics of a given vehicle.
      </p>
      <p>4.1.3 METRICS FOR EVALUATING EXPERIMENTAL RESULTS.
To evaluate the results of the algorithms we are comparing, three
metrics were used. First, the execution time (Δ ) of the algorithm,
the relative diference ( Δ) between the actual value of k and the
estimated value of k and finally the distance between the actual
partitioning and the partitioning proposed by the algorithm.</p>
      <p>The relative diference is calculated as follows:
Δ = | −  | (12)</p>
      <p />
      <p>
        The variation of the information () [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] was used as a
distance between two partitionings. It measures the amount of
information gained and lost for a dataset passing from a partition A to
another partition B. It respects the three properties of triangular
inequality. So the smaller the  is, the closer the partitionings
are to each other.
4.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>RUNTIME ANALYSIS</title>
      <p>If we consider the performance of the three algorithms according
to the metric Δ , our algorithm takes the least time to provide
a clustering result. This is true in both synthetic and real data
and regardless of the value of ,  and . Our algorithm is 5.5 to
12.9 times faster than g-means and 15.4 to 75.2 times faster than
x-means in synthetic data, respectively. The same trend occurs
in real data, but from 16.3 to 1566.6 times compared to g-means
k
10
32
80
d
4
6
8
4
6
8
4
6
8
k
is expressed in seconds. In the Δ sub-column of the g-means and x-means columns, the number in brackets represents how many times
our algorithm is faster than the compared algorithm.
and from 6.4 to 108.6 times compared to x-means. Maximum
execution times were reached for all three algorithms at  = 8,
 = and  = 80 in the synthetic data. In this combination, the
elapsed execution times are 10.43 minutes for our algorithm,
59.41 minutes for g-means and 13h05 for x-means respectively.</p>
      <p>In the real data, the maximums are reached in the configuration
 = 784,  = 697932 and  = 62 for x-means and KD-means. That
is 15.48 minutes for KD-means and 10 hours for x-means. In the
case of g-means, the maximum is even higher than that of the
others, it is reached at 16.48 hours ( = 4,  = 18,  = 1 × 106).
4.3</p>
    </sec>
    <sec id="sec-8">
      <title>CLUSTER QUALITY ANALYSIS (Δ / )</title>
      <p>KD-means is better than other algorithms in estimating k with
good clustering quality in synthetic and real data.</p>
      <p>In the synthetic data, KD-means does not exceed 0.47 in Δ
and 1.2 in  . X-means even reaches Δ = 3.72 and  = 3.4. The
same for g-means with Δ = 8.1 and  = 2.8. The decrease in
Δ g-means when  increased in the synthetic data is due to the
presence of many gaussian clusters that are well separated from
each other. KD-means has a better estimate of good quality than
the others. In terms of  , it is on average 2.46 and 3.05 better
than g-means and x-means. Same performance in Δ , better on
average by 2.83 (g-means) and 4.29 (x-means). It could be pointed
out that g-means is better than x-means in terms of quality ( )
in several combinations of (, , ).</p>
      <p>In real data, the KD</p>
      <p>maxima do not exceed Δ = 3.81 and
 = 7.2. They are higher for x-means and g-means, they reach
respectively (Δ = 4.73,  = 8.6) and (Δ = 555.5,  = 12.5). The
values of Δ and  are higher in real data than in synthetic data.
Indeed, the distributions of clusters in real data are more complex
than synthetic clusters (gaussian). As a result, these complex
representations of clusters are dificult to capture completely point
by point. This increase is much higher for g-means because it
tends to identify gaussian clusters in particular. However,
clusters in real data are not necessarily gaussian because they are
not constituted by gaussian distribution generators. As a result,
Synthetic data</p>
      <p>Real data
Δ
g-means makes an execessive overestimation of k (therefore Δ
high) compared to KD. The KD-means algorithm is the least
afected by the gaussianity of clusters because in its operation
gaussianity is not explicitly sought.</p>
      <p>PARAMETER SENSITIVITY ANAlYSIS. We analyze the
three metrics (Δ ,  and Δ ) according to ℎ_ and ψ. This is
done in both data contexts (real and synthetic).</p>
      <p>In Tables 3 and 4, ℎ_ and ψ do not have a particular
inlfuence on the execution time of KD-means. Indeed, the range
(diference between the maximum and minimum values) of
Δ
does not exceed 10.2 seconds with respect to ψ. It is about 30
seconds for ℎ_ . If we take into account the Δ magnitude of
the three algorithms, these diferences are negligible.</p>
      <p>In the synthetic data, concerning ψ, the values of Δ are
identical except for ψ = 4 where the value is slightly higher than the
others by 0.2. The values of  have a maximum diference of 0.4.
This diference is 0.1 when</p>
      <p>
        ψ ∈ [
        <xref ref-type="bibr" rid="ref10 ref12 ref8">8, 10, 12</xref>
        ] for which  is minimal.
0.7 for Δ but drops to 0.2 for ψ ∈ [
        <xref ref-type="bibr" rid="ref10 ref4 ref8">4, 8, 10</xref>
        ].
      </p>
      <p>
        At ψ = 4 Δ is slightly higher by 0.3 compared to the rest. In real
data, the range is 1.1 for  but at 0.2 if ψ ∈ [
        <xref ref-type="bibr" rid="ref4 ref8">4, 8</xref>
        ]. This range is
      </p>
      <p>Concerning ℎ_ in the synthetic data, the values of Δ and
for Δ and ψ ∈ [0.15, 0.20, 0.25, 0.30] for  .
 have a range of 0.2. In real data, the respective ranges of Δ
and  are 0.5 and 0.4. They drop to 0.2 when ψ ∈ [0.20, 0.25, 0.30]</p>
      <p>
        The observed ranges of the two metrics  and Δ are lower
than the values of the same metrics of the competitors
algorithms in the real and synthetic data. But if we are stricter by
just accepting ranges less than or equal to 0.2 then the optimal
values, when the data are gaussian, are ψ ∈ [
        <xref ref-type="bibr" rid="ref10 ref12 ref8">8, 10, 12</xref>
        ] and ℎ_
taking all the values tested. When the data is real (not
necessarily gaussian clusters), the optimal values are ψ ∈ [4.8] and
ℎ_ ∈ [0.20, 0.25, 0.30].
5
      </p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION AND PERSPECTIVES</title>
      <p>The estimation of the number of clusters ( ), that k-means must
ifnd, is a major problem in the literature. And it is more significant
when data is massive and clusters overlap. We provided a solution,
based on kd-tree, that automatically estimates the value of  on
massive and high dimensional data. It is robust even if clusters
overlap. Four criteria were defined to guide the procedure for
estimating k. Through experiments, we have shown that our
solution is very competitive compared to the known x-means
and g-means algorithms, which have proven to be unsuitable
th_evo
0.05
0.10
0.15
0.20
0.25
0.30
ψ
for scaling up and overlapping clusters. We plan to improve our
solution in several areas:
• implement a strategy that manages nested indexing.
Because the indices of a node are also in the parent node;
• use random kd-tree and multiple kd-tree in case the
number of dimensions is even larger to have more very good
performance in searching for information stored in
kdtree or to have the best data separation. This performance
will help our solution to make better decisions on cluster</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>These studies are supported by the ANRT and Capgemini funding
under CIFRE-Capgemini partnership.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Nir</given-names>
            <surname>Ailon</surname>
          </string-name>
          , Yudong Chen, and
          <string-name>
            <given-names>Huan</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Breaking the Small Cluster Barrier of Graph Clustering</article-title>
          .
          <source>CoRR abs/1302</source>
          .4549 (
          <year>2013</year>
          ). arXiv:
          <volume>1302</volume>
          .
          <fpage>4549</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>David</given-names>
            <surname>Arthur</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sergei</given-names>
            <surname>Vassilvitskii</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>k-means++: the advantages of careful seeding</article-title>
          .
          <source>In ACM-SIAM symposium on Discrete algorithms</source>
          .
          <fpage>1027</fpage>
          -
          <lpage>1025</lpage>
          . arXiv:
          <volume>1212</volume>
          .
          <fpage>1121</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jon</given-names>
            <surname>Louis Bentley</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>Multidimensional binary search trees used for associative searching</article-title>
          .
          <source>Commun. ACM</source>
          <volume>18</volume>
          (
          <year>September 1975</year>
          ),
          <fpage>509</fpage>
          -
          <lpage>517</lpage>
          . Issue 9. https://doi.org/10.1145/361002.361007
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Gianna</surname>
            <given-names>M. Del</given-names>
          </string-name>
          <string-name>
            <surname>Corso</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Estimating an Eigenvector by the Power Method with a Random Start</article-title>
          .
          <source>SIAM J. Matrix Anal. Appl</source>
          .
          <volume>18</volume>
          ,
          <issue>4</issue>
          (Oct.
          <year>1997</year>
          ),
          <fpage>913</fpage>
          -
          <lpage>937</lpage>
          . https://doi.org/10.1137/S0895479895296689
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Cover</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Hart</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Nearest Neighbor Pattern Classification</article-title>
          .
          <source>IEEE Trans. Inf. Theor</source>
          .
          <volume>13</volume>
          ,
          <issue>1</issue>
          (Sept.
          <year>2006</year>
          ),
          <fpage>21</fpage>
          -
          <lpage>27</lpage>
          . https://doi.org/10.1109/TIT.
          <year>1967</year>
          .1053964
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Charles</given-names>
            <surname>Elkan</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Using the Triangle Inequality to Accelerate K-means</article-title>
          .
          <source>In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (ICML'03)</source>
          . AAAI Press,
          <fpage>147</fpage>
          -
          <lpage>153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Aislan</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Foina</surname>
          </string-name>
          , Judit Planas,
          <string-name>
            <surname>Rosa M. Badia</surname>
          </string-name>
          , and Francisco Javier RamirezFernandez.
          <year>2011</year>
          .
          <article-title>P-means, a parallel clustering algorithm for a heterogeneous multi-processor environment</article-title>
          .
          <source>In Proceedings of the 2011 International Conference on High Performance Computing and Simulation</source>
          ,
          <string-name>
            <surname>HPCS</surname>
          </string-name>
          <year>2011</year>
          . IEEE,
          <fpage>239</fpage>
          -
          <lpage>248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.W.</given-names>
            <surname>Forgy</surname>
          </string-name>
          .
          <year>1965</year>
          .
          <article-title>Cluster analysis of multivariate data: eficiency versus interpretability of classifications</article-title>
          .
          <source>Biometrics</source>
          (
          <year>1965</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Guojun</given-names>
            <surname>Gan</surname>
          </string-name>
          , Chaoqun Ma, and
          <string-name>
            <given-names>Jianhong</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Data Clustering: Theory, Algorithms,</article-title>
          and
          <string-name>
            <surname>Applications</surname>
          </string-name>
          (ASA-SIAM Series on Statistics and Applied Probability).
          <source>Society for Industrial and Applied Mathematics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Greg</given-names>
            <surname>Hamerly</surname>
          </string-name>
          and
          <string-name>
            <given-names>Charles</given-names>
            <surname>Elkan</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Learning the K in K-means</article-title>
          .
          <source>In Proceedings of the 16th International Conference on Neural Information Processing Systems (NIPS'03)</source>
          . MIT Press,
          <fpage>281</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Anil</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Jain</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Data clustering: 50 years beyond K-means</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>31</volume>
          ,
          <issue>8</issue>
          (jun
          <year>2010</year>
          ),
          <fpage>651</fpage>
          -
          <lpage>666</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A K</given-names>
            <surname>Jain</surname>
            , M N Murty
          </string-name>
          , and
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Flynn</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Data clustering: a review.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Chuanren</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Tianming Hu, Yong Ge, and
          <string-name>
            <given-names>Hui</given-names>
            <surname>Xiong</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Which distance metric is right: An evolutionary k-means view</article-title>
          .
          <source>In Proceedings of the 12th SIAM International Conference on Data Mining</source>
          ,
          <string-name>
            <surname>SDM</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Songrit</given-names>
            <surname>Maneewongvatana</surname>
          </string-name>
          and
          <string-name>
            <given-names>David M.</given-names>
            <surname>Mount</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>It's Okay to Be Skinny, If Your Friends Are Fat</article-title>
          .
          <source>In Center for Geometric Computing 4th Annual Workshop on Computational Geometry.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Marina</given-names>
            <surname>Meilă</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Comparing Clusterings by the Variation of Information</article-title>
          .
          <source>In Learning Theory and Kernel Machines</source>
          , Bernhard Schölkopf and
          <string-name>
            <surname>Manfred K. Warmuth</surname>
          </string-name>
          (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg,
          <fpage>173</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Dau</given-names>
            <surname>Pelleg</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Moore</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>X-means: Extending K-means with Eficient Estimation of the Number of Clusters</article-title>
          . In
          <source>In Proceedings of the 17th International Conf. on Machine Learning</source>
          . Morgan Kaufmann,
          <fpage>727</fpage>
          -
          <lpage>734</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D T</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S S</given-names>
            <surname>Dimov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C D</given-names>
            <surname>Nguyen</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Selection of K in K-means clustering</article-title>
          .
          <source>Proceedings of the Institution of Mechanical Engineers</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>C</given-names>
          </string-name>
          :
          <source>Journal of Mechanical Engineering Science</source>
          <volume>219</volume>
          ,
          <issue>1</issue>
          (jan
          <year>2005</year>
          ),
          <fpage>103</fpage>
          -
          <lpage>119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Hanan</given-names>
            <surname>Samet</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Foundations of Multidimensional and Metric Data Structures</article-title>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Stephens</surname>
          </string-name>
          .
          <year>1974</year>
          .
          <article-title>EDF Statistics for Goodness of Fit and Some Comparisons</article-title>
          .
          <source>J. Amer. Statist. Assoc</source>
          .
          <volume>69</volume>
          ,
          <issue>347</issue>
          (
          <year>1974</year>
          ),
          <fpage>730</fpage>
          -
          <lpage>737</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Jan</surname>
            <given-names>N. van Rijn</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geofrey Holmes</surname>
            , Bernhard Pfahringer, and
            <given-names>Joaquin</given-names>
          </string-name>
          <string-name>
            <surname>Vanschoren</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Algorithm Selection on Data Streams</article-title>
          . In Discovery Science, Sašo Džeroski, Panče Panov, Dragi Kocev, and Ljupčo Todorovski (Eds.). Springer International Publishing,
          <volume>325</volume>
          -
          <fpage>336</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Svante</surname>
            <given-names>Wold</given-names>
          </string-name>
          , Kim Esbensen, and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Geladi</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>Principal component analysis</article-title>
          .
          <source>Chemometrics and Intelligent Laboratory Systems</source>
          <volume>2</volume>
          ,
          <issue>1</issue>
          (
          <year>1987</year>
          ),
          <fpage>37</fpage>
          -
          <lpage>52</lpage>
          .
          <source>Proceedings of the Multivariate Statistical Workshop for Geologists and Geochemists.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Tian</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Raghu Ramakrishnan, and
          <string-name>
            <given-names>Miron</given-names>
            <surname>Livny</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>BIRCH: An Eficient Data Clustering Method for Very Large Databases</article-title>
          .
          <source>SIGMOD Record 25</source>
          ,
          <issue>2</issue>
          (jun
          <year>1996</year>
          ),
          <fpage>103</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>