KD-means: clustering method for massive data based on kd-tree Nabil El malki Franck Ravat Olivier Teste Capgemini, UniversitΓ© de Toulouse, UniversitΓ© de Toulouse, UT1, UniversitΓ© de Toulouse, UT2J, UT2J, IRIT(CNRS/UMR5505) IRIT(CNRS/UMR 5505) IRIT(CNRS/UMR 5505) Toulouse, France Toulouse, France Toulouse, France nabil.el-malki@capgemini.com franck.ravat@irit.fr olivier.teste@irit.fr ABSTRACT solutions have been proposed to estimate the relevant number of K-means clustering is a popular unsupervised classification algo- clusters π‘˜ [10, 16, 17]. Among the limitations of these solutions, rithm employed in several domains, e.g., imaging, segmentation, we show three major ones that challenge their normal process: or compression. Nevertheless, the number of clusters k, fixed apri- β€’ they unroll k-means several times on the same detailed ori, affects mainly the clustering quality. Current State-of-the-art data. However, k-means has difficulties in scaling because k-means implementations could automatically set of the number it has a temporal complexity proportional to π‘˜π‘›π‘‘ with 𝑑 of clusters. However, they result in unreasonable processing time the number of iterations [11]. Consequently, in a context while classifying large volumes of data. In this paper, we propose of massive data, repetitive access to data caused by the a novel solution based on kd-tree to determine the number of repetitive use of k-means on the same data make them cluster k in the context of massive data for preprocessing data computational time consuming or unusable for applica- science projects or in near-real-time applications. We demon- tions that require near-real-time responses; strate how our solution outperforms current solutions in terms β€’ they hardly supports overlaped clusters. This problem can of clustering quality, and processing time on massive data. lead to two different cases. i) overestimating the number of clusters exaggeratedly, compared to the actual number, 1 INTRODUCTION thus increasing the processing time significantly. ii) un- derestimating the value of k, thus producing a very poor Up to now, data clustering, also known as cluster analysis, has clustering quality; been one of the most important tasks in exploratory data analy- β€’ if we consider clusters that approach the sphere form, sis. It is also applied in a variety of applications, e.g., web page which k-means identifies, then these k estimation solu- clustering, pattern recognition, image segmentation, data com- tions tend to capture clusters of this form. While, in real pression and nearest neighbor search [12]. Various clustering data sets, there are also clusters of spherical shapes that are algorithms have been available since the early 1950s. The goal of more or less elongated, approaching the elliptical shape data clustering is to classify a set of patterns, points or objects and its declines. These clusters could also have different di- into groups known as clusters. Data of each group are as similar rections. We consider a cluster to be strictly spherical if it as possible to one another, and different groups are as dissimilar represents a perfectly or almost spherical shape. Similarly, as possible from one another [11]. In this paper, we consider one a cluster not strictly or less spherical when it has a shape of the most widely used data clustering algorithm, i.e., k-means that tends towards the elliptical. All these clusters of dif- [6, 8]. Due to its simplicity and understandability, this algorithm ferent sphericity will be called natural clusters. Formally, has been popular for more than four decades [11][13]. Given a when a cluster has a strictly spherical data distribution set of points 𝑋 = {π‘₯1, ..., π‘₯𝑛 } where each point is defined in the then the variances of each data dimension are equal. If the d-dimensional space R𝑑 and π‘˜ an integer, k-means aims to parti- cluster has a non-strict sphericity, then the dimensions tion 𝑋 into a set of clusters 𝐢 = {𝐢1, ..., πΆπ‘˜ } by minimizing the have different variances. Statistically, these clusters can euclidean distance between the points (data) within each cluster: be assimilated to gaussian distributions. π‘˜ Γ• Γ• In this paper, we propose a solution to these problems. We ||π‘₯ βˆ’ πΊπ‘˜ || 2 (1) introduce an algorithm, based on the use of kd-tree[3], that esti- 𝑗=1 βˆ€π‘₯ βˆˆπΆπ‘˜ mates the number of π‘˜ clusters for massive data. This estimation where πΊπ‘˜ the centroid of the cluster πΆπ‘˜ and ||.|| the euclidean performs in reasonable processing time, and thus it is possi- distance. Minimizing this function requires an iterative process ble to integrate such solutions into data-intensive applications, that ends when centroids do not change between two iterations. e.g., near-real-time applications, or the preprocessing stage of This process consists of two steps: a data analysis project. Our solution is an algorithm, based on β€’ assigning each point to the nearest centroid using the kd-tree(k dimentional tree), which hierarchically structures data euclidean distance; aggregates, e.g., clusters of data points and their corresponding β€’ recalculating the centroids, i.e., the average of the point metadata, with different levels of precision details. Furthermore, values, of the newly formed clusters. we define several new clusters merge criteria to support more Disadvantageously, k-means requires that the user should pre- efficiently overlapping and natural clusters. define the value of π‘˜. Thus, determining the inaccurate value of We compare our solution to known methods of estimating π‘˜ k has a direct negative impact on the clustering quality. Different from literature (x-means [16] and g-means [7, 10]). We show that our solution is faster than current methods while ensuring better Β© Copyright 2020 for this paper held by its author(s). Published in the proceedings of clustering quality on massive data. DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution This paper is organized as follows. In Section 2, we study the 4.0 International (CC BY 4.0). state-of-the-art solutions. Next, in Section 3 we introduce our Figure 1: Estimation of k by each of the three algorithms when clusters overlap. The respective estimates of x-means and g-means are 67 and 26. The centroids are represented by black crosses. The horizontal and vertical axes refer to both dimensions of the used dataset. contribution. In Section 4, we compare our solution with that of β€’ k-means is executed with k=2, two children’s centroids our competitors on data sets containing up to 4 million points. are obtained Finally we conclude in Section 5. β€’ the bic of the same region is calculated but taking into account the two children’s centroids instead of their parent centroid. 2 BACKGROUND For several decades, the problem of estimating the number of If the previous Bic is smaller than the next Bic, then the parent clusters π‘˜ remains significant in the literature. Several studies centroid is replaced by its two children’s centroids or the parent have been conducted to prevent the user from defining an inex- centroid is retained. act value of π‘˜, thus leading to counterintuitive results, and it is G-MEANS. This algorithm [10] wraps k-means and adopts disparate from the expected value of k. We identify two types of almost the same approach as x-means. Instead of the BIC criterion clustering solutions close to k-means. Some integrate the estima- the gaussianity test (at the level of 𝛼 confidence defined by the tion of the value of π‘˜. Others do not, as in the case of Birch [22]. user) on clusters is used to decide whether to divide a centroid The latter is a hierarchical algorithm based on a micro-clustering in two. approach. It consists of two main phases. The first tends to pro- It starts from a set of centroids of size π‘˜π‘šπ‘–π‘› and it increases duce many small spherical clusters without having to specify the set size during the process. The value of π‘˜π‘šπ‘–π‘› is initialized to the number to be produced. The second uses another clustering 1 because usually we have no a priori knowledge of the possible algorithm to perform macro-clustering. We have not retained it values of k. In this case, the first centroid corresponds to the in our study because in the second phase an algorithm should be artihmetic mean of the values of all data points. A list is defined used to apply it on the representatives of small clusters (usually to contain the centroids whose sets of points surrounding them centroids). The most used algorithms in the second phase and are gaussians so as not to be processed in subsequent iterations. adapted for the first phase are often k-means or the hierarchical At each iteration, the centroids not stored in the list called ascendant clustering [12] to which the value of π‘˜ should be given parent centroids, are divided into two children’s centroids c1 and a priori. We focus on two algorithms specialized in the automatic c2. This division is operated via the principal component analysis estimation of k: x-means [16] and g-means [7, 10]. (pca). The pca method is a time complexity of 𝑂 (𝑛𝑑 2 + 𝑑 3 ) [21] X-MEANS. X-means [16] is a k-means-based algorithm that with 𝑑 the data point dimension and 𝑛 the size of the dataset. So consists of searching for the set of centroids that best fit the data. the authors recommended using accelerated versions of pca such Starting from a minimum π‘˜ (π‘˜π‘šπ‘–π‘› ), generally equal to one, the as pca based on the power method [4]. The children’s centroids algorithm iteratively adds new centroids if necessary until the are then refined through the execution of 2-means (k=2). Then maximum value of k (π‘˜π‘šπ‘Žπ‘₯ ) is reached. At each iteration, the the gaussianity test, via the Anderson-Darling test [19] (only algorithm identifies the subset of centroids that must be divided works on points defined in one dimension), is performed on in two. This identification is done using a statistical criterion points around the parent centroid. To do this, the points around called bayesian information criterion (BIC). This is used for model the parent centroid are first projected on the vector 𝑣 = 𝑐1 βˆ’ 𝑐2 selection and is based on the likelihood function. linking the two children centroids to obtain points defined on a The x-means process consists of two operations that are iter- single dimension. If these points follow a gaussian distribution ated until completion: Imporve_param and improve_structure. then the centroid is added to the list of centroids that will no Imporve_param is only the execution of k-means with the cur- longer be processed in the following iterations. Otherwise, it is rent k as parameter. Improve_structure researches where new replaced by its two children’s centroids. The value assigned to 𝛼 centroids must be added. For each centroid and its associated by the authors [10] is 0.0001 to test their solution on synthetic region of points, three sub-operations follow each other: and real data. LIMITATIONS. The limitations of the above methods are β€’ the corresponding bic is calculated related to the difficulty of dealing with overlapping clusters, the Figure 2: Overall solution for estimating the value of k. quality of clustering (and therefore also the value of k) they step is a regularization phase (Subsection 3.3). It decides whether produce as well as the execution time to provide a result. the small cluster is a separate cluster or should be merged with The both methods are of different computational complexity another cluster. This phase avoids having an overestimation of k and each can identify only a limited number of more or less spher- due to small clusters. Note that the nodes are processed according ical cluster types. X-means is suitable for data whose clusters to the post-fixed path, i.e., each node is processed after each of are strictly spherical [10]. If other forms of clusters are present, its children is processed. This path avoids exploring a node un- then the estimate of k could be overestimated. On the other hand, necessarily (time consuming task because it involves conditional g-means could identify clusters whose spherical shape is less tests) in case it will not be processed because none of its children strict than that identified by x-means. If the clusters are well are processed yet. spaced apart, g-means could provide the relevant value of k [10]. In addition to this distribution, the strict sphericity requirement 3.1 KD-TREE must be added to the clusters for x-means to have the same per- Kd-tree puts points that are spatially close together in the same formance. If the clusters overlap then none of them estimates the node. This property is exploited by several machine learning correct value of k. Under these conditions x-means and g-means methods to accelerate calculations related to point space. One tend to overestimate k. of the best known cases is the k-closest neighbors algorithm [5]. These cases are illustrated by Figure 1. There are 5 gaussian This property is advantageous to us in two cases. First of all, clusters that have been generated but with a more or less different it partly addresses the problem of k-means, which is to group sphericity. Two clusters are well separated from each other. The together the most similar points possible in the same cluster. other three overlap. All three algorithms were run on this set of Second, it provides an optimized spatial subdivision of space to 20,000 data. The π‘˜π‘šπ‘Žπ‘₯ has been set at 140. G-means identified 26 accelerate data processing. It recursively splits the whole dataset clusters but detected the 2 clusters well separated from the others. into partitions, in a manner similar to a decision tree acting on nu- However, there is overfitting on the remaining three clusters, an merical datasets [9]. The root node is the whole data space, while overestimation of 24 clusters instead of only estimating 3. X- the leaves are the smallest possible partitions of the data space. means estimated k at 76. Overestimation is slightly higher for A node is associated with several closed rectangular regions of overlapping clusters than for separate clusters. space, called hyper-rectangles. Traditionally, in the literature, the In terms of execution time, the two algorithms are not suitable hyper-rectangle is only used to represent the subspace that the when the data are massive and it is even more accentuated when node represents. In addition, we will also use it for other pur- the estimated value of k tends to be large. poses in 3.1.1 because of its advantages in terms of calculation and representation of sets. 3 CONTRIBUTION In this section, we will define our solution that addresses the 3.1.1 HYPER-RECTANGLE. Formally, it corresponds to the problem of estimating k in a context of massive data and over- smallest possible rectangle (generalized to 𝑛 dimensions) covering lapping clusters. Figure 2 illustrates the different steps of our all the data points of a set. It is defined by the following equation: solution. The solution is composed of two main parts; i) the storage and 𝐻 = {𝑋 |π‘‹π‘šπ‘–π‘›π‘  β©½ 𝑋𝑖 β©½ π‘‹π‘šπ‘Žπ‘₯𝑒𝑠 βˆ€π‘–} (2) organization of the 𝑋 data provided by the user in a data structure, and ii) the processing is done on this structure to estimate π‘˜ where, π‘šπ‘–π‘›π‘  and π‘šπ‘Žπ‘₯𝑒𝑠 respectively represent the lower and (Merging task and Cluster regularization). upper limits of the hyper-rectangle. Indeed, π‘šπ‘–π‘›π‘ (π‘šπ‘Žπ‘₯𝑒𝑠) is a We opted for the kd-tree[3] data structure to fulfill the roles vector corresponding to the minimum (maximum) values of each of storage and data organization. A kd-tree is a binary tree that dimension of the 𝑋 data set (see Figure 3). represents a hierarchical subdivision of space using splitting In our case, we consider a hyper-rectangle as a geometric planes that are orthogonal to the coordinate dimensions (do not object to approximate the overall spatial distribution of a set of confuse the k of kd-tree which is just the dimension of the data points contained in the node. In other words, each of the clusters stored in the tree and the k of k-means which corresponds to the of a node is geometrically represented by a hyper-rectangle. From number of clusters). In Subsection 3.1 we discuss the advantages this representation results, in our solution, a time saving on the of kd-tree as well as the method that builds the kd-tree in 3.1.2. calculations that involves sets (a set is a cluster of points). For Concerning the processing part, first of all we proceed to the example, to calculate a distance between two sets or perform estimation of k clusters in each of the leaves (see 3.1.4). This op- a set operation involving at least two sets, their corresponding eration results, in each leaf, in the constitution of several clusters. hyper-rectangles could be used. Therefore, instead of visiting all Then the clusters of the nodes are merged recursively from the the data points of the sets, it is only necessary to use the π‘šπ‘–π‘›π‘  and leaves to the root according to rules defined in Subsection 3.2.2. π‘šπ‘Žπ‘₯𝑒𝑠 vectors of the hyper-rectangles for the above calculations. These rules are built to manage overlapping clusters. The final This greatly saves a lot of time on large amounts of data. associated with a list of indices of the data points that it rep- resents, the arithmetic mean of these data and the sum of the squared differences (𝑠𝑠𝑒) between each point and the arithmetic mean. The data points are not stored in the tree but are accessed via the indices. The internal nodes have in addition the splitting dimension index 𝑠𝑑 as well as the associated value 𝑠𝑣. 3.1.4 INITIALIZATION OF LEAVES. During the tree building Figure 3: A 2-dimensional hyper-rectangle 𝐻 and the associated and more precisely during the instantiation of the leaf, we es- bounds π‘šπ‘Žπ‘₯𝑒𝑠 and π‘šπ‘–π‘›π‘  located on the corners (upper right and timate the number of clusters present in the subset of the data lower left). Note that 𝑋 is the dataset with two dimensions (d0 represented by this leaf (see Algorithm 1). At the same time the and d1) and the circles represent the points of X. The π‘šπ‘Žπ‘₯(π‘šπ‘–π‘›) clustering of these data is carried out, which results in a set of function returns the maximum (minimum) of both dimensions. clusters. When clusters are close to each other or overlap, they could overestimate the value of π‘˜ in an exaggerated way. For this, the value of π‘˜ was limited to a maximum value (π‘™π‘’π‘Žπ‘“ _π‘˜π‘šπ‘Žπ‘₯) 3.1.2 SPILITTING METHOD. The partitioning of a data space so that g-means (this version of g-means is called π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’πΎ in of an internal node (i.e., not leaf) is performed mainly based Algorithm 1) cannot return more than a limited number of clus- on two cutting elements that must be specified: a given data ters. This number is called π‘˜π‘‘ with 𝑑 ∈ Nβˆ— the 𝑖𝑑 β„Ž iteration of dimension (𝑠𝑑) of the node space and a value of this dimension π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’πΎ. Indeed, at each iteration 𝑑 we test if π‘˜π‘‘ >= π‘™π‘’π‘Žπ‘“ _π‘˜π‘šπ‘Žπ‘₯ . (𝑠𝑣). Thus the points whose value at the dimension 𝑠𝑑 is less The value of π‘˜π‘‘ could be up to 2𝑑 . This case only occurs if all (resp. greater) than 𝑠𝑣 then they will be assigned to the child’s the centroids have been split in two because they have not been node which is called π‘™π‘’π‘ π‘ π‘β„Žπ‘–π‘™π‘‘ (resp. π‘”π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘Ÿπ‘β„Žπ‘–π‘™π‘‘). This process evaluated as gaussian. Note that even if the total number of clus- is carried out recursively from the root to the leaves. ters in all leaves exceeds π‘˜π‘šπ‘Žπ‘₯ then our merging algorithms will Several rules for splitting nodes are proposed to choose the op- approach this number towards the real k of the dataset. timal dimension and associated value for correct data separation. Let 𝑀 be the number of leaves that kd-tree has, π‘™π‘’π‘Žπ‘“ _π‘˜π‘šπ‘Žπ‘₯ is Among them, we opted for sliding midpoint splitting rule [18] then defined as follows: because it provides an optimized data organization than other (√ √ classical rules [14]. This performance on others is explained be- π‘˜π‘šπ‘Žπ‘₯ , if 𝑀 >= π‘˜π‘šπ‘Žπ‘₯ cause it does not produce empty nodes or nodes whose data π‘™π‘’π‘Žπ‘“ _π‘˜π‘šπ‘Žπ‘₯ = π‘˜π‘šπ‘Žπ‘₯ (3) 𝑀 , otherwise space is very sparse (i.e., a very large space, specifically at the sides, compared to the data it represents when it should be small). If we assume that in Equation 3 only the second condition ex- In addition, the classical rules choose the median as the cutting ists and π‘™π‘’π‘Žπ‘“ _π‘˜π‘šπ‘Žπ‘₯ will always be equal to the result of this same value 𝑠𝑣 while the sliding midpoint splitting rule chooses the mid- condition then this would induce π‘™π‘’π‘Žπ‘“ _π‘˜π‘šπ‘Žπ‘₯ to tend towards a dle of the points ((π‘šπ‘Žπ‘₯ + π‘šπ‘–π‘›)/2) which is less expensive. The value of 1 or 0. Indeed for a given value of π‘˜π‘šπ‘Žπ‘₯ , the greater 𝑀 is dimension 𝑠𝑑 chosen is the one that is the longest (π‘šπ‘Žπ‘₯ βˆ’ π‘šπ‘–π‘›). the greater the π‘™π‘’π‘Žπ‘“ _π‘˜π‘šπ‘Žπ‘₯ tends towards 1 if 𝑀 <= π‘˜π‘šπ‘Žπ‘₯ . When Note that theoretically there is no guarantee on the limit of 𝑀 > π‘˜π‘šπ‘Žπ‘₯ then π‘™π‘’π‘Žπ‘“ _π‘˜π‘šπ‘Žπ‘₯ tends towards 0. These cases could the depth that kd-tree can have, the trees could be very deep (so make the solution ineffective because they would underestimate the time of tree construction is increased). As a result, we have the number of clusters in the leaf and by extension underestimate added two stopping conditions to avoid deep trees. These two the number of clusters in the tree dataset. The first condition conditions are performed at the beginning of the method that is essential to maintain a balance between having an underesti- partitions the node in two. If one of the conditions is verified mation of π‘˜ if there is only the second condition and having an then the node will not be divided and will be considered as a leaf: overestimation of k if there is no condition that would limit the β€’ the depth of the leaf is limited to π‘™π‘œπ‘”(𝑛) to save construc- value of k. tion time compared to the normal time taken if the depth After the estimation of k, each cluster of data points is repre- is not limited. The root has a depth of 1; sented by a hyper-rectangle as well as the associated aggregated β€’ each node is limited by a minimum number of points, ψ. calculations (arithmetic mean (G) and the difference between the It is not interesting to have leaves that have a number points and G (𝑠𝑠𝑒)). of points close to one in a context of massive data. This would result in a very long kd-tree construction time and therefore make our solution unsuitable for near real-time Algorithm 1: InitializeLeaf applications. In addition, if the sum of the number of points input :π‘˜π‘šπ‘–π‘› ,π‘˜π‘šπ‘Žπ‘₯ ,π‘‘π‘Žπ‘‘π‘Ž,π‘™π‘’π‘Žπ‘“ of the node is less than 2 βˆ— ψ then it will be considered output : 𝐿𝐻 as a leaf. Indeed, two cases are possible if this limit is 1 𝐿𝐷 ← π‘‘π‘Žπ‘‘π‘Žπ‘™π‘’π‘Žπ‘“ .𝑖𝑛𝑑𝑒π‘₯𝑒𝑠 not defined. Either the two children will be leaves if their 2 𝐺, π‘˜, 𝐼 ← π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’πΎ (𝐿𝐷, π‘˜π‘šπ‘–π‘› , π‘˜π‘šπ‘Žπ‘₯ ) number of points (size) is less than ψ. Either the size of one will be greater than ψ and therefore the other is a leaf. 3 for 𝑗 = 1 to 𝑗 = π‘˜ do In the latter case a difference in depth will occur which 4 𝐿𝐻 𝑗 .𝑖𝑛𝑑𝑒π‘₯𝑒𝑠 ← 𝐼 𝑗 brings a certain imbalance of the tree. This condition limits 5 𝐿𝐻 𝑗 .𝐺 ← 𝐺 𝑗 the number of small leaves, limits the depth of the tree 6 𝐿𝐻 𝑗 .π‘šπ‘Žπ‘₯𝑒𝑠 ← π‘šπ‘Žπ‘₯ (𝐿𝐷 𝐼 𝑗 ) and contributes to the balance of the tree. 7 𝐿𝐻 𝑗 .π‘šπ‘–π‘›π‘  ← π‘šπ‘–π‘›(𝐿𝐷 𝐼 𝑗 ) 3.1.3 NODE COMPONENTS. Each node contains a set of hyper-rectangles. We call this set 𝐿𝐻 . Each hyper-rectangle is 3.2 NODE MERGING The minimum distance between two hyper-rectangles x and y At the end of node initialization, clusters are obtained in each is calculated as follows: node that are considered as subclusters of the final clusters. These π‘šπ‘–π‘›π» (π‘₯, 𝑦) = ||0βˆ’π‘šπ‘Žπ‘₯ (0, π‘šπ‘Žπ‘₯ (π‘₯ .π‘šπ‘–π‘›π‘ βˆ’π‘¦.π‘šπ‘Žπ‘₯𝑒𝑠, 𝑦.π‘šπ‘–π‘›π‘ βˆ’π‘₯ .π‘šπ‘Žπ‘₯𝑒𝑠))|| latter are located in the root and are considered to be the closest to (8) real clusters. The merging of clusters is necessary to achieve the where the function π‘šπ‘Žπ‘₯ () returns the maximum of the two num- final clusters. In this subsection Algorithms 3 and 2 for merging bers given as parameters. clusters will be presented. But first, we will list the functions and definitions that will be used in these algorithms. Algorithm 2: Local merger 3.2.1 DEFINITIONS. input : 𝐿𝐻 , Ο„, π‘™π‘’π‘›π‘”β„Žπ‘‘, π‘‘β„Ž_π‘’π‘£π‘œ output : 𝐿𝐻 Definition 3.1. Let 𝐿𝐻 be a set of hyper-rectangles, the π‘šπ‘Žπ‘₯𝑒𝑠 (resp. π‘šπ‘–π‘›π‘ ) vector associated with 𝐿𝐻 is the maximum (resp. 1 π‘šπ‘’π‘Ÿπ‘”π‘’π‘‘ ← π‘‡π‘Ÿπ‘’π‘’ minimum) of each dimension of the π‘šπ‘Žπ‘₯𝑒𝑠 (resp. π‘šπ‘–π‘›π‘ ) vectors 2 𝑖𝑑 ← 0 of the hyper-rectangles of 𝐿𝐻 : 3 π‘œπ‘™π‘‘_π‘π‘Žπ‘Ÿπ‘‘_π‘™β„Ž ← π‘π‘Žπ‘Ÿπ‘‘ (𝐿𝐻 ) ( 4 while π‘šπ‘’π‘Ÿπ‘”π‘’π‘‘ = π‘‡π‘Ÿπ‘’π‘’ do π‘šπ‘–π‘›π‘  = π‘šπ‘–π‘›({πΏπ»β„Ž .π‘šπ‘–π‘›π‘  |β„Ž ∈ [1; π‘π‘Žπ‘Ÿπ‘‘ (𝐿𝐻 )]}) 5 π‘₯ ← 𝐿𝐻𝑖𝑑 π‘”π‘’π‘‘π΅π‘œπ‘’π‘›π‘‘π‘  (𝐿𝐻 ) = 6 𝑇 ← {𝑖𝑑 } π‘šπ‘Žπ‘₯𝑒𝑠 = π‘šπ‘Žπ‘₯ ({πΏπ»β„Ž .π‘šπ‘Žπ‘₯𝑒𝑠 |β„Ž ∈ [1; π‘π‘Žπ‘Ÿπ‘‘ (𝐿𝐻 )]) (4) 7 π‘‘π‘šπ‘_𝑖𝑛𝑑𝑒π‘₯𝑒𝑠 ← {0, ..., π‘œπ‘™π‘‘_π‘π‘Žπ‘Ÿπ‘‘_π‘™β„Ž} 8 π‘‘π‘šπ‘_𝑖𝑛𝑑𝑒π‘₯𝑒𝑠 ← π‘‘π‘šπ‘_𝑖𝑛𝑑𝑒π‘₯𝑒𝑠\{𝑖𝑑 } Definition 3.2. Let 𝑃 = {(𝑒, 𝑣)|𝑒, 𝑣 ∈ N ∩ [1; π‘˜] and 𝑒 β‰  𝑣 } 9 for 𝑖 𝑖𝑛 π‘‘π‘šπ‘_𝑖𝑛𝑑𝑒π‘₯𝑒𝑠 do a list of index pairs of hyper-rectangles of a node. We consider 10 𝑦 ← 𝐿𝐻𝑖 (𝑒, 𝑣) = (𝑣, 𝑒) and consequently P contains only one of them. The 11 Δ𝐺 ← ||π‘₯ .𝐺 βˆ’ 𝑦.𝐺 || average of the pairwise distances is calculated by the following function: 12 π‘Ÿπ‘Žπ‘‘π‘–π‘œ ← Δ𝐺/π‘™π‘’π‘›π‘”β„Žπ‘‘ Í 13 if π‘Ÿπ‘Žπ‘‘π‘–π‘œ < Ο„ /πœ– then (𝑒,𝑣) βˆˆπ‘ƒ ||𝐿𝐻𝑒 .𝐺 βˆ’ 𝐿𝐻 𝑣 .𝐺 || π‘Žπ‘π‘‘ (𝐿𝐻 ) = (5) 14 Δ𝐢 ← ||((π‘₯ .π‘šπ‘–π‘›π‘  + π‘₯ .π‘šπ‘Žπ‘₯𝑒𝑠)/2) βˆ’ π‘π‘Žπ‘Ÿπ‘‘ (𝑃) ((𝑦.π‘šπ‘–π‘›π‘  + 𝑦.π‘šπ‘Žπ‘₯𝑒𝑠)/2)|| Lemma 3.3. Given two datasets π‘ˆ ∈ R𝑑 and 𝑃 ∈ R𝑑 . 𝑛 1 and 𝑛 2 15 𝑒 ← πΉπ‘Žπ‘™π‘ π‘’ are respectively the number of points contained in π‘ˆ and 𝑃. If the 16 if Δ𝐺 < Δ𝐢 then sum of squared errors of π‘ˆ and 𝑃 are as follow: 17 π‘šπ‘–π‘›π·π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’π‘‹π‘Œ ← π‘šπ‘–π‘›π» (π‘₯, 𝑦) 𝑛1 Γ• 18 if π‘šπ‘–π‘›π·π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’π‘‹π‘Œ = 0 and π‘†π‘†πΈπ‘ˆ = ||𝑒𝑖 βˆ’ 𝑒¯ || 2 Δ𝐺 < Δ𝐢 βˆ— πœ† then 𝑖=1 19 𝑒 ← π‘‡π‘Ÿπ‘’π‘’ 𝑛2 Γ• 20 if 𝑒 = π‘‡π‘Ÿπ‘’π‘’ and π‘’π‘£π‘œ <= π‘‘β„Ž_π‘’π‘£π‘œ and 𝑆𝑆𝐸𝑃 = ||𝑝𝑖 βˆ’ 𝑝¯ || 2 𝑖=1 π‘šπ‘–π‘›π·π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’π‘‹π‘Œ /π‘™π‘’π‘›π‘”β„Žπ‘‘ <= Ο„ /πœ– then then the the sum of squared errors of 𝑉 = π‘ˆ βˆͺ 𝑃 is: 21 𝑒 ← π‘‡π‘Ÿπ‘’π‘’ 22 β€˜if 𝑒 = π‘‡π‘Ÿπ‘’π‘’ then 𝑆𝑆𝐸𝑉 = π‘†π‘†πΈπ‘ˆ + 𝑆𝑆𝐸𝑃 + 𝑛 1 ||π‘ˆΒ― βˆ’ 𝑉¯ || 2 + 𝑛 2 βˆ— ||𝑃¯ βˆ’ 𝑉¯ || 2 (6) 23 𝑇 ←𝑇 βˆͺ𝑖 where 𝑉 can computed as the average of the weighted averages Β― 24 if π‘π‘Žπ‘Ÿπ‘‘ (𝑇 ) > 1 then 𝑛 1π‘ˆΒ― + 𝑛 2 𝑃¯ 25 𝐿𝐻 ← π‘šπ‘’π‘Ÿπ‘”π‘’ (𝑇 ) (7) 𝑛1 + 𝑛2 26 //The new hyper-rectangles are placed at the end of the list Proof. We are trying to calculate the sum of two sse: 27 𝑛𝑒𝑀_π‘π‘Žπ‘Ÿπ‘‘_π‘™β„Ž ← π‘π‘Žπ‘Ÿπ‘‘ (𝐿𝐻 ) 𝑛1 𝑛2 Γ• Γ• 28 if π‘œπ‘™π‘‘_π‘π‘Žπ‘Ÿπ‘‘_π‘™β„Ž < 𝑛𝑒𝑀_π‘π‘Žπ‘Ÿπ‘‘_π‘™β„Ž then 𝑆𝑆𝐸𝑉 = ||π‘ˆπ‘– βˆ’ 𝑉¯ || 2 + ||𝑃𝑖 βˆ’ 𝑉¯ || 2 29 π‘œπ‘™π‘‘_π‘π‘Žπ‘Ÿπ‘‘_π‘™β„Ž ← 𝑛𝑒𝑀_π‘π‘Žπ‘Ÿπ‘‘_π‘™β„Ž 𝑖=1 𝑖=1 | {z } | {z } 30 𝑖𝑑 ← 0 π‘†π‘†πΈπ‘ˆ 𝑉 𝑆𝑆𝐸𝑃𝑉 31 else Let focus on the first term of this equation: 32 if 𝑖𝑑 + 1 < 𝑛𝑒𝑀_π‘π‘Žπ‘Ÿπ‘‘_π‘™β„Ž then Õ𝑛1 𝑛1 Γ• 33 𝑖𝑑 ← 𝑖𝑑 + 1 π‘†π‘†πΈπ‘ˆ 𝑉 = ||π‘ˆπ‘– βˆ’ 𝑉¯ || 2 = ||(π‘ˆπ‘– βˆ’ π‘ˆΒ― ) + (π‘ˆΒ― βˆ’ 𝑉¯ )|| 2 34 else 𝑖=1 𝑖=1 35 π‘šπ‘’π‘Ÿπ‘”π‘’π‘‘ ← πΉπ‘Žπ‘™π‘ π‘’ 𝑛1 Γ• 𝑛1 Γ•  = ||π‘ˆπ‘– βˆ’ π‘ˆΒ― || 2 + 2 (π‘ˆπ‘– βˆ’ π‘ˆΒ― ) (π‘ˆΒ― βˆ’ 𝑉¯ ) + 𝑛 1 ||π‘ˆΒ― βˆ’ 𝑉¯ || 2 𝑖=1 𝑖=1 3.2.2 PROPOSED MERGE ALGORITHMS. It cannot be objec- But: tively confirmed that two clusters naturally correspond to the 𝑛1 Γ• 𝑛1 Γ• 𝑛1 Γ• same cluster (i.e., they must form a single cluster) because this (π‘ˆπ‘– βˆ’ π‘ˆΒ― ) = π‘ˆπ‘– βˆ’ π‘ˆΒ― = 𝑛 1π‘ˆΒ― βˆ’ 𝑛 1π‘ˆΒ― = 0 confirmation implies the use of clustering related concepts such 𝑖=1 𝑖=1 𝑖=1 as close, similar or complementary whose definition is also sub- So π‘†π‘†πΈπ‘ˆ 𝑉 = π‘†π‘†πΈπ‘ˆ + 𝑛 1 ||π‘ˆΒ― βˆ’ 𝑉¯ || 2 . If we repeat the same cal- jective. Consequently, we focus on the detection of clusters that culation for the second term 𝑆𝑆𝐸𝑃𝑉 then we obtain Equation k-means could detect (i.e., clusters close to strict sphericity) but 6. also less spherical clusters as present in real data. So our goal is, first, to capture natural clusters as defined in Section 1. Secondly, Ο„ to identify these clusters even if they overlap with each other. π‘Ÿπ‘Žπ‘‘π‘–π‘œ < (9) πœ– As a result, a natural cluster is a cluster with a certain sphericity π‘Žπ‘π‘‘ (𝐿𝐻 ) | |π‘₯ .πΊβˆ’π‘¦.𝐺 | | and at the same time it has low connectivity with another cluster where Ο„ = | |π‘šπ‘Žπ‘₯π‘’π‘ βˆ’π‘šπ‘–π‘›π‘  | | , π‘Ÿπ‘Žπ‘‘π‘–π‘œ = | |π‘šπ‘Žπ‘₯π‘’π‘ βˆ’π‘šπ‘–π‘›π‘  | | and πœ– ∈ Nβˆ— . if it overlaps with it. By connectivity between two clusters we Criterion 1 is based on the proximity measure. It checks whether mean of how close two clusters are in form. If two clusters of dif- two hyper-rectangles are close enough to be possibly considered ferent shapes overlap considerably, they could be considered as mergeable. It is based on the average of the pairwise distances two separate clusters. The distance between clusters alone is not between centroids (π‘Žπ‘π‘‘) to approximate the average distance enough. Note that considering only the distance between clusters between clusters of a node without calculating the distances to conclude that two clusters form a single cluster could lead between all point pairs of all clusters. The pairwise distances to the following situation: if two clusters that do not naturally reduces the bias brought by the very distant centroids from the correspond to the same cluster (i.e., they have low connectivity) majority of other centroids. The values of π‘Ÿπ‘Žπ‘‘π‘–π‘œ and Ο„ of the and whose distance is zero or partially overlapping, could be inequality are normalized between 0 and 1 by the maximum considered as a single cluster. This could perhaps contradict the length of the data space to be compared. The value of πœ– controls reality of the meaning given to the data points. By taking into the boundary between the qualifiers "distant" and "close" when account the distance and connectivity between two clusters, we referring to the distance between clusters of π‘₯ and 𝑦. If this crite- consider that if two clusters are of the same shape and overlap rion is verified then other criteria must be verified to confirm the only partially then they are two different clusters that could not fusion between π‘₯ and 𝑦. Note that the higher the value of πœ– is, the be merged. So it is necessary to take into account the connectivity fewer hyper-rectangles validate the criterion. Moreover, if πœ– = 1, between clusters in addition to distance when deciding to merge then a significant amount of hyper-rectangles will validate the two clusters. criterion, which could lead to further tests and probably lead to We define a set of criteria to evaluate how much two clus- a non fusion result because if two clusters are very distant then ters correspond to the same cluster in order to decide whether there will be no fusion. The value of πœ– is to be adjusted according two clusters can be merged in accordance with the definition of to the strictness level of the "enough close" notion required by natural clusters that we have mentioned. These criteria involve the criterion including this parameter. measures of proximity and connectivity between clusters. The fusion algorithms 3 and 2 incorporate these criteria. criterion 2. Let π‘₯ and 𝑦 two hyper-rectangles, Δ𝐢 the distance Note that merging several clusters signifies merging the asso- between the centers of x and y and Δ𝐺 the distance between the ciated hyper-rectangles. It consists of calculating the weighted centroids of the x and y clusters. If Δ𝐺 < Δ𝐢 then x and y are mean of the means of the point values contained in the hyper- possibly mergeable. rectangles already calculated, concatenating the lists of the point Criterion 2 refines the set of hyper-rectangles from the first indices, calculating the 𝑠𝑠𝑒 based on Equation 6 and to calculate criterion. It estimates how strong the connectivity between two the π‘šπ‘–π‘›π‘  and π‘šπ‘Žπ‘₯𝑒𝑠 vectors according to Equation 4. clusters is. Knowing that a hyper-rectangle corresponds to the Algorithm 2 (⁀local merger) allows us to merge clusters whose smallest rectangular envelope covering all the data points of representative hyper-rectangles are included in the 𝐿𝐻 set. 𝐿𝐻 a cluster, then it could be ruled that the center of the hyper- hyper-rectangles are associated with a list of ascending ordered rectangle would roughly correspond to the centroid of the cluster numerical indices. Thus the indices of the first and last elements if the points of the cluster are uniformly distributed with a certain of LH are respectively 0 and π‘π‘Žπ‘Ÿπ‘‘ (𝐿𝐻 ) βˆ’ 1. We consider π‘₯ to be sphericity. So the cluster with these characteristics is probably a a temporary variable that runs through the 𝐿𝐻 elements. The al- natural cluster apart. As a result the closer the distance between gorithm process is as follows. First we assign the first element of centroids (Δ𝐺) is to the distance between centers (Δ𝐢) the smaller 𝐿𝐻 to π‘₯. This one is now a hyper-rectangle representing a cluster. the probability of merging π‘₯ and 𝑦. This probability is considered We then check if all the other clusters represented by hyper- null when Δ𝐺 >= Δ𝐢. rectangles 𝑦 are mergeable with the cluster represented by π‘₯. If The first two criteria make it possible to identify relatively π‘₯ is not mergeable to any of them then the next hyper-rectangle close clusters but also to consider that two clusters are distant if in 𝐿𝐻 is assigned to π‘₯. On the other hand, if it is mergeable to at their centroids are also distant even if the two clusters are con- least one hyper-rectangle, then the concerned hyper-rectangles tiguous. The latter could occur if the majority of points surround are merged resulting in a new hyper-rectangle (merging operated the centroid while a minority are far from the centroid. However, by the π‘šπ‘’π‘Ÿπ‘”π‘’ function). In this case the first item of freshly recon- these criteria are based only on the distances applied to centroids stituted 𝐿𝐻 is reassigned to π‘₯. This procedure is repeated until π‘₯ and centers. Two clusters could be considered as close, but one is the second last element of 𝐿𝐻 and there are no more mergers. or more clusters could be located between them. In this case During the fusion test between two hyper-rectangles π‘₯ and 𝑦, they cannot be merged directly. They could only be if at least several criteria, formed from inequalities and equations, must be one cluster between them merges with them. The following two verified. The first two criteria only select the hyper-rectangles criteria require stricter distances and stronger connectivity to candidates 𝑦 for merger and at the same time they reduce the avoid directly merging two nearby clusters separated by other number of hyper-rectangles to be processed in the other two clusters. criteria. While the last two decide whether to merge π‘₯ and 𝑦 or criterion 3. Let be π‘₯ and 𝑦 two hyper-rectangles. We merge not. π‘₯ and 𝑦 when Criteria 1 and 2 are verified and if π‘šπ‘–π‘›π» (π‘₯, 𝑦) = 0 criterion 1. Let π‘šπ‘–π‘›π‘  and π‘šπ‘Žπ‘₯𝑒𝑠 the limits of a defined data and Δ𝐺 < Δ𝐢 βˆ— πœ† with πœ† ∈]0; 1] space ∈ R𝑑 and 𝐿𝐻 the list of hyper-retangles belonging to this Criterion 3 only deals with contiguous or overlapping hyper- space then two clusters represented by the hyper-rectangles π‘₯ ∈ 𝐿𝐻 rectangles. Indeed π‘šπ‘–π‘›π» (π‘₯, 𝑦) returns 0 if the two hyper-rectangles and 𝑦 ∈ 𝐿𝐻 are possibly mergeable if: overlap or if their distance is zero. Also, a connectivity constraint is added. It results in a constraint on the proximity between cen- children are closer to each other than to the other child’s clusters. troids: for two hyper-rectangles π‘₯ and 𝑦 the distance between So we process the clusters of a child node locally but in the data centroids must be less than Δ𝐢 βˆ—πœ†. The smaller πœ†, the more the no- space of the parent node. Hence the proposal of the global merger tion of "close enough" between clusters to the point of matching algorithm. It performs the merging of clusters node by node the same cluster is strict. using the local merger algorithm. It starts first with the children of the "internal" node and then concatenates the hyper-rectangles criterion 4. Let π‘₯ and 𝑦 two hyper-rectangles and πœ– ∈ Nβˆ— . We of the both children to form the list of hyper-rectangles of the merge π‘₯ and 𝑦 when Criteria 1 and 2 are verified and if π‘’π‘£π‘œ <= internal node. Finally, local merger is applied to internal. Before π‘‘β„Ž_π‘’π‘£π‘œ and π‘šπ‘–π‘›π» (π‘₯, 𝑦) normalized is less than Ο„/πœ– with π‘’π‘£π‘œ the the children node is processed, the π‘Žπ‘π‘‘ function is normalized sum of the squared errors (𝑠𝑠𝑒) of the π‘₯ and 𝑦 data points union. by the maximum length of the parent node hyper-rectangle. This Unlike Criterion 3, to validate Criterion 4 hyper-rectangles normalization allows to have the same distance ratio in the three are not necessary to be contiguous but a maximum distance is nodes (π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™,π‘”π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘Ÿ and 𝑙𝑒𝑠𝑠). On the other hand, the value required. The normalized distance π‘šπ‘–π‘›π» (π‘₯, 𝑦) must be less than of π‘Žπ‘π‘‘ changes from one node to another because it depends Ο„/πœ–. It may be that two clusters are not contiguous but almost only on the clusters of the concerned node. Consequently, if when the distance is as small as it does not allow a cluster to two clusters have not been merged into one of the two children interpose itself between two clusters concerned by the calcu- nodes, they could be merged into the parent node. Note that the lations of the criterion. In addition to this, the criterion adds a processing of the children nodes can even be parallelized because requirement on the compactness of π‘₯ and 𝑦 clusters. To do this, they are independent. it uses the homogeneity measure called sum of square error (𝑠𝑠𝑒). In this criterion we have the choice to limit the quantity of 𝑠𝑠𝑒 for 3.3 CLUSTER REGULARIZATION possible mergers. Indeed, two clusters are merged only if the 𝑠𝑠𝑒 of their data is less than the limit is π‘‘β„Ž_π‘’π‘£π‘œ. Its value is entered by This subsection discusses the strategy we have defined to identify the user. However, π‘‘β„Ž_π‘’π‘£π‘œ is adapted according to the value of Ξ”π‘˜ clusters that should not be as such but rather be part of another provided Equation 10. The latter evaluates the difference between cluster. The algorithms of the previous step could produce small the value of π‘˜π‘šπ‘Žπ‘₯ and the total sum of the values of k specific to (in number of points) but compact clusters. A small cluster is Í each leaf ( π‘˜ Ο‘ ). The purpose of this adaptation is to approach located in three cases: either it is a natural cluster different from the estimated value k on the entire dataset at π‘˜π‘šπ‘Žπ‘₯ and avoid an the others, or it is part of an agglomeration of small clusters overestimation of π‘˜ compared to π‘˜π‘šπ‘Žπ‘₯ . If π›Ώπ‘˜ >= 1 it means π‘˜ Ο‘ Í that represent a single cluster, or it is part of a large cluster. The is at least equal to π‘˜π‘šπ‘Žπ‘₯ and therefore π‘‘β„Ž_π‘’π‘£π‘œ will be unchanged. definition of a small cluster has no objective purpose. We consider Í Otherwise, if π‘˜π‘šπ‘Žπ‘₯ > π‘˜ Ο‘ then π‘‘β„Ž_π‘’π‘£π‘œ is recalculated according the definition of the small cluster resulting from the work [1] as √ Equation 11. a cluster with a number of points less than 𝑛 with 𝑛 the size of π‘˜π‘šπ‘Žπ‘₯ the cluster. π›Ώπ‘˜ = 1 βˆ’ Í (10) We have developed an algorithm that makes it possible to π‘˜Ο‘ ( match a small cluster to one of the three cases. The goal of the π‘‘β„Ž_π‘’π‘£π‘œ, if π›Ώπ‘˜ >= 1 algorithm is to get as many natural clusters as possible. Knowing π‘‘β„Ž_π‘’π‘£π‘œ = (11) that the final result of our solution is in the root of the tree π‘‘β„Ž_π‘’π‘£π‘œ + π›Ώπ‘˜, otherwise then this algorithm is unrolled only in this one. The algorithm consists of two parts that are executed consecutively once. The Algorithm 3: Global merger first identifies the small clusters and then merges those that are input :π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™,π‘‘β„Ž_π‘’π‘£π‘œ close to each other. The second re-identifies small clusters from output : 𝐿𝐻 the new list of small clusters from the first part and affects those Ð that are close to large clusters. If a small cluster has not undergone 1 𝐿𝐻 ← π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .𝑙𝑒𝑠𝑠.𝐿𝐻 π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .π‘”π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘Ÿ .𝐿𝐻 a merger operation in both parts then it will be considered as a 2 π‘šπ‘–π‘›π‘ , π‘šπ‘Žπ‘₯𝑒𝑠 ← π‘”π‘’π‘‘π΅π‘œπ‘’π‘›π‘‘π‘  (𝐿𝐻 ) separate natural cluster. 3 π‘™π‘’π‘›π‘”β„Žπ‘‘ ← ||π‘šπ‘Žπ‘₯𝑒𝑠 βˆ’ π‘šπ‘–π‘›π‘  || One of the reasons for the presence of small clusters is due to 4 if card(internal.greater.LH)>1 then Criteria 3 and 4 where the limit of the minimum distance between 5 Ο„ ← π‘Žπ‘π‘‘ (π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .π‘”π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘Ÿ .𝐿𝐻 )/π‘™π‘’π‘›π‘”β„Žπ‘‘ hyper-rectangles is very strict. Indeed, if the minimum distance 6 π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .π‘”π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘Ÿ .𝐿𝐻 ← between two hyper-rectangles, at least one of which is a small π‘™π‘œπ‘π‘Žπ‘™π‘€π‘’π‘Ÿπ‘”π‘’π‘Ÿ (π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .π‘”π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘Ÿ .𝐿𝐻, Ο„ , π‘™π‘’π‘›π‘”β„Žπ‘‘, π‘‘β„Ž_π‘’π‘£π‘œ) cluster, is greater than this limit, then they cannot be merged. This limit is respectively equal to 0 in Criterion 3 and Ο„ /πœ– in 7 if card(internal.less.LH)>1 then Criterion 4. The Ο„ /πœ– limit is more flexible than the first one, it 8 Ο„ ← π‘Žπ‘π‘‘ (π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .𝑙𝑒𝑠𝑠.𝐿𝐻 )/π‘™π‘’π‘›π‘”β„Žπ‘‘ was used in both parts of the algorithm to define the boundary 9 π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .𝑙𝑒𝑠𝑠.𝐿𝐻 ← between "close" and "distant" regarding the distance between π‘™π‘œπ‘π‘Žπ‘™π‘€π‘’π‘Ÿπ‘”π‘’π‘Ÿ (π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .𝑙𝑒𝑠𝑠.𝐿𝐻, Ο„ , π‘™π‘’π‘›π‘”β„Žπ‘‘, π‘‘β„Ž_π‘’π‘£π‘œ) hyper-rectangles. Ð 10 𝐿𝐻 ← π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .𝑙𝑒𝑠𝑠.𝐿𝐻 π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™ .π‘”π‘Ÿπ‘’π‘Žπ‘‘π‘’π‘Ÿ .𝐿𝐻 11 Ο„ ← π‘Žπ‘π‘‘ (𝐿𝐻 )/π‘™π‘’π‘›π‘”β„Žπ‘‘ 4 EXPERIMENTAL ASSESSMENTS 12 𝐿𝐻 ← π‘™π‘œπ‘π‘Žπ‘™π‘€π‘’π‘Ÿπ‘”π‘’π‘Ÿ (𝐿𝐻, Ο„ , π‘™π‘’π‘›π‘”β„Žπ‘‘, π‘‘β„Ž_π‘’π‘£π‘œ) In this section, we carried out experiments to study the perfor- mance of our solution, to demonstrate its effectiveness compared The global merger fusion algorithm can be seen as a layer to the algorithms listed in Section 2(x-means and g-means) and that envelops the algorithm local merger. Intuitively, for a given to give recommendations on the values that the parameters of internal and parent node, a good part of the clusters of one of its our solution could take (the minimum size of an internal node ψ and limit π‘‘β„Ž_π‘’π‘£π‘œ of Equation 11). In the experiments, real data β€’ clusters do not have the same number of points. The car- and synthetic data were used. dinalities are chosen at random so that their sum is equal to the total number of points 𝑛 defined previously. 4.1 EXPERIMENTAL PROTOCOL 4.1.2 REAL DATA. The real data comes from three known Our solution and the compared algorithms were implemented sources: Openml, UCI and Kaggle. They are all of a numerical in python 3.6. All of them use the Elkan version of k-means [6]. type. These data are used in other research projects to perform It corresponds to an exact version of standard k-means but is benchmarking machine learning methods [20]. The values of real suitable for massive data in execution time. Moreover, k-means++ π‘˜, 𝑛 and 𝑑 are given in Table 2. The data sets used are diverse and [2] is used for initializing the centroids before applying Elkan they can be organized into several groups: algorithm. β€’ black and white or grayscale images; clustering is ap- Different values are assigned to parameters π‘˜,𝑑, 𝑛, π‘‘β„Ž_π‘’π‘£π‘œ and plied in this case directly to the pixels. πΈπ‘šπ‘›π‘–π‘ π‘‘ and πΉπ‘Žπ‘ β„Žπ‘–π‘œπ‘›βˆ’ ψ to study the sensitivity of our algorithm to these parameters. π‘šπ‘›π‘–π‘ π‘‘ contain images respectfully on the first 10 digits and All combinations of the values of these parameters have been clothing and shoes. Each image was flattened to form a tested: single vector. And each vector has been assigned its corre- β€’ π‘˜ ∈ [10, 32, 80] sponding class (label); β€’ 𝑑 ∈ [4, 6, 8] β€’ 4-band multispectral images; they characterize differ- β€’ 𝑛 ∈ [1 Γ— 106, 2 Γ— 106, 4 Γ— 106 ] ent types of soil. The corresponding dataset is π‘ π‘Žπ‘‘π‘–π‘šπ‘Žπ‘”π‘’; β€’ π‘‘β„Ž_π‘’π‘£π‘œ ∈ [0.05, 0.10, 0.15, 0.20, 0.25, 0.30] β€’ sounds; the π½π‘Žπ‘π‘Žπ‘›π‘’π‘ π‘’π‘‰ π‘œπ‘€π‘’π‘™π‘  dataset is a set of digitized β€’ ψ ∈ [4, 8, 10, 12] Japanese vowel sounds. Each sound is associated with a speaker; An experiment is structured as follows: β€’ historization of people’s activities; the π‘™π‘‘π‘π‘Ž dataset is β€’ first of all a combination of the above-mentioned parame- a collection of the positions of sensors present on people ters is defined. Parameter π‘˜ represents the real number of in order to identify the movement they perform at a given clusters in a dataset. It should be noted that parameters π‘˜, time. The π‘€π‘Žπ‘™π‘˜π‘–π‘›π‘” dataset is a set of people’s activities 𝑑 and 𝑛 above refer only to synthetic data. The values of designed to determine the authors; these parameters are defined in the characteristics of the β€’ images represented by characteristics extracted from real datasets, the object to be identified; for example, in the πΉπ‘œπ‘’π‘Ÿπ‘–π‘’π‘Ÿ β€’ then follows the data generation in the case of synthetic dataset each record is a set of coefficients of the charac- data, otherwise the real data is loaded, ter (one of the first 10 digits) shapes; similarly in π‘π‘’π‘Ÿπ‘›π‘–π‘˜π‘’ β€’ finally, the estimation of π‘˜ is performed by the three algo- each instance of a digit is of rotation invariant Zernike rithms. moments of the digit (between 0 and 9); in 𝑃𝑒𝑛𝑑𝑖𝑔𝑖𝑑𝑠 each For each combination of these parameters the experiment is individual is a positions sequence characterizing a digit; conducted five times. in πΏπ‘’π‘‘π‘‘π‘’π‘Ÿ each instance of a letter of the English alpha- We set the values of πœ– and πœ† in the different criteria as follows: bet contains statistical moments and edge counts of the given letter; in 𝑉 π‘’β„Žπ‘–π‘π‘™π‘’ dataset a data point is just a set of β€’ πœ– has been defined as 2 and 10 respectively in Criteria 1 geometric characteristics of a given vehicle. and 4. The value 2 is the least strict while 10 is the most strict; 4.1.3 METRICS FOR EVALUATING EXPERIMENTAL RESULTS. β€’ in the cluster regularization algorithm, πœ– = 8 in the first To evaluate the results of the algorithms we are comparing, three part of the algorithm. If there are still small clusters left, metrics were used. First, the execution time (Δ𝑑) of the algorithm, then the second part is executed. In this case πœ– = 4; the relative difference (Ξ”π‘˜) between the actual value of k and the β€’ in Criterion 3, πœ† = 0.75. As a result, Δ𝐺 must be less than estimated value of k and finally the distance between the actual 75% of Δ𝐢. partitioning and the partitioning proposed by the algorithm. The relative difference is calculated as follows: We allow our algorithm and x-means to search up to π‘˜π‘šπ‘Žπ‘₯ = |π‘˜ βˆ’ π‘˜π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’π‘‘ | 5π‘˜ centroids. Ξ”π‘˜ = π‘Ÿπ‘’π‘Žπ‘™ (12) π‘˜π‘Ÿπ‘’π‘Žπ‘™ 4.1.1 SYNTHETIC DATA. In order to get closer to the real The variation of the information (𝑣𝑖) [15] was used as a dis- data, natural and overlapping clusters were generated. These tance between two partitionings. It measures the amount of infor- synthetic data have the following properties: mation gained and lost for a dataset passing from a partition A to β€’ a cluster is a set of data that follows a normal distribution. another partition B. It respects the three properties of triangular To generate this set, a semi positive covariance matrix and inequality. So the smaller the 𝑣𝑖 is, the closer the partitionings an average (a vector) are generated at random. Indeed, the are to each other. values of the covariance matrix allow to define a cluster shape that is not isotropic (strictly spherical) and that can 4.2 RUNTIME ANALYSIS have different variances separately on an arbitrary basis of If we consider the performance of the three algorithms according directions and not necessarily on those of the dimensions; to the metric Δ𝑑, our algorithm takes the least time to provide β€’ there are at least two overlapping clusters; a clustering result. This is true in both synthetic and real data β€’ if the centroid of one cluster is in the hyperectangle of and regardless of the value of π‘˜, 𝑑 and 𝑛. Our algorithm is 5.5 to another cluster then we consider that the maximum degree 12.9 times faster than g-means and 15.4 to 75.2 times faster than of overlap has been exceeded. So as a result at least one of x-means in synthetic data, respectively. The same trend occurs the clusters concerned is replaced by a new cluster; in real data, but from 16.3 to 1566.6 times compared to g-means KD-means G-means X-means k d n k(Ξ”π‘˜ ) vi Δ𝑑 k(Ξ”π‘˜ ) vi Δ𝑑 k(Ξ”π‘˜ ) vi Δ𝑑 10 4 1 Γ— 106 8.4(0.17) 0.4 39.7 74.0(6.4) 2.2 344.3(8.7) 45.7(3.57) 2.5 816.2(20.6) 2 Γ— 106 8.3(0.21) 0.5 65.4 91.0(8.1) 2.4 774.1(11.8) 45.7(3.57) 2.5 1420.5(21.7) 4 Γ— 106 8.3(0.18) 0.5 117.4 87.0(7.7) 2.3 1517.3(12.9) 44.6(3.46) 2.4 2203.6(18.8) 6 1 Γ— 106 8.6(0.15) 0.4 41.9 69.2(5.92) 2.2 396.4(9.5) 46.0(3.6) 2.3 1101.5(26.3) 2 Γ— 106 8.1(0.2) 0.5 72.9 89.0(7.9) 2.5 926.9(12.7) 46.1(3.61) 2.3 1869.9(25.7) 4 Γ— 106 8.0(0.23) 0.5 133.7 79.1(6.91) 2.2 1628.0(12.2) 45.9(3.59) 2.3 3061.4(22.9) 8 1 Γ— 106 7.4(0.39) 0.8 45.9 63.2(5.32) 2.6 386.0(8.4) 45.9(3.59) 1.5 789.4(17.2) 2 Γ— 106 6.9(0.37) 0.8 81.2 71.3(6.13) 2.8 823.1(10.1) 45.5(3.55) 1.6 1253.5(15.4) 4 Γ— 106 6.0(0.41) 0.9 153.6 78.9(6.89) 2.8 1727.2(11.2) 46.2(3.62) 1.6 2378.8(15.5) 32 4 1 Γ— 106 35.4(0.38) 0.9 77.6 99.3(2.1) 1.9 535.6(6.9) 148.4(3.64) 2.9 2817.2(36.3) 2 Γ— 106 31.4(0.35) 0.8 132.5 114.1(2.57) 2.0 1152.8(8.7) 147.6(3.61) 2.9 4831.0(36.5) 4 Γ— 106 27.3(0.36) 1.0 233.7 132.1(3.13) 2.1 2468.4(10.6) 145.2(3.54) 2.9 7628.1(32.6) 6 1 Γ— 106 40.2(0.47) 0.9 79.2 101.0(2.16) 1.8 586.7(7.4) 150.2(3.69) 2.7 3689.9(46.6) 2 Γ— 106 34.6(0.44) 0.9 146.9 122.5(2.83) 2.1 1375.6(9.4) 149.1(3.66) 2.7 7880.7(53.7) 4 Γ— 106 29.1(0.44) 1.0 260.2 107.9(2.37) 1.7 2485.1(9.5) 135.9(3.25) 2.7 7901.8(30.4) 8 1 Γ— 106 35.7(0.36) 1.0 82.2 115.2(2.6) 2.2 665.7(8.1) 149.0(3.66) 1.9 2491.6(30.3) 2 Γ— 106 37.2(0.42) 1.0 154.9 126.8(2.96) 2.3 1383.3(8.9) 148.1(3.63) 1.8 3740.0(24.2) 4 Γ— 106 36.4(0.46) 0.9 299.5 136.5(3.27) 2.6 2883.7(9.6) 133.8(3.18) 1.8 5037.7(16.8) 80 4 1 Γ— 106 90.5(0.24) 0.9 141.5 144.1(0.8) 1.5 781.9(5.5) 368.2(3.6) 3.3 7807.2(55.2) 2 Γ— 106 88.3(0.29) 1.0 283.2 170.4(1.13) 1.8 1673.6(5.9) 373.1(3.66) 3.3 14147.0(50.0) 4 Γ— 106 83.0(0.3) 1.2 570.8 186.9(1.34) 2.0 3417.8(6.0) 366.6(3.58) 3.4 21887.1(38.3) 6 1 Γ— 106 96.2(0.37) 0.9 148.3 167.0(1.09) 1.6 893.0(6.0) 376.1(3.7) 3.2 10385.8(70.1) 2 Γ— 106 97.4(0.39) 0.9 285.4 186.2(1.33) 1.8 1892.4(6.6) 375.0(3.69) 3.2 22403.6(78.5) 4 Γ— 106 92.8(0.4) 1.1 578.0 200.4(1.51) 1.7 4039.2(7.0) 363.3(3.54) 2.9 37716.6(65.3) 8 1 Γ— 106 57.5(0.28) 1.2 151.5 207.8(1.6) 2.0 1016.0(6.7) 377.2(3.71) 2.2 7834.6(51.7) 2 Γ— 106 98.6(0.43) 0.9 308.2 150.2(0.88) 1.4 1708.6(5.5) 373.6(3.67) 3.2 21824.6(70.8) 4 Γ— 106 99.2(0.47) 1.0 626.4 166.9(1.09) 1.5 3581.8(5.7) 377.7(3.72) 3.2 47130.8(75.2) Table 1: Comparison of the three algorithms (our KD-means algorithm, g-means and x-means) executed on synthetic data. Note that Δ𝑑 is expressed in seconds. In the Δ𝑑 sub-column of the g-means and x-means columns, the number in brackets represents how many times our algorithm is faster than the compared algorithm. KD G-means X-means dataname k d n k(Ξ”π‘˜ ) vi Δ𝑑 k(Ξ”π‘˜ ) vi Δ𝑑 k(Ξ”π‘˜ ) vi Δ𝑑 vehicle 4 18 1 Γ— 106 3.0(0.28) 2.9 38.6 2226(555.5) 11.8 60530.8(1566.6) 16.0(3.0) 5.3 281.5(7.3) satimage 6 36 1 Γ— 106 28.9(3.81) 2.7 54.5 1177(195.17) 8.8 34595.7(635.3) 28.7(3.78) 3.8 1151.0(21.1) japaneseVowels 9 14 1 Γ— 106 9.4(0.83) 4.5 75.6 3377(374.22) 12.5 55220.1(730.5) 32.0(2.56) 7.7 534.8(7.1) pendigits 10 16 1 Γ— 106 29.1(1.91) 3.6 31.0 2574(256.4) 9.1 40806.3(1316.9) 32.0(2.2) 4.0 759.2(24.5) fourier 10 76 1 Γ— 106 39.2(2.92) 3.0 91.6 1309(129.9) 8.0 38210.3(417.0) 57.3(4.73) 3.9 2791.2(30.5) fashion-mnist 10 784 70000 10.9(0.11) 3.3 58.6 533(52.3) 7.1 954.1(16.3) 32.0(2.2) 4.0 697.0(11.9) ldpa 11 5 164860 32.5(1.96) 6.3 12.5 1766(159.55) 10.8 5996.6(480.7) 47.2(3.3) 7.1 132.3(10.6) walking 22 4 149332 39.6(1.35) 6.7 14.7 1257(56.14) 10.0 3254.1(220.8) 100.7(3.58) 8.4 93.7(6.4) letter 26 16 999999 59.7(1.3) 7.2 57.8 1581(59.81) 11.1 22502.6(389.5) 117.0(3.5) 8.6 6272.4(108.6) zernike 47 47 1 Γ— 106 86.9(1.08) 5.3 118.1 1157(23.62) 9.0 35710.8(302.3) 128.0(1.72) 6.4 5583.1(47.3) emnist 62 784 697932 45.6(0.26) 6.0 948.6 3073(48.56) 8.6 46224.8(48.7) 256.0(3.13) 6.7 36029.7(38.0) Table 2: Comparison of the three algorithms executed on real data. and from 6.4 to 108.6 times compared to x-means. Maximum presence of many gaussian clusters that are well separated from execution times were reached for all three algorithms at 𝑑 = 8, each other. KD-means has a better estimate of good quality than 𝑛 = and π‘˜ = 80 in the synthetic data. In this combination, the the others. In terms of 𝑣𝑖, it is on average 2.46 and 3.05 better elapsed execution times are 10.43 minutes for our algorithm, than g-means and x-means. Same performance in Ξ”π‘˜, better on 59.41 minutes for g-means and 13h05 for x-means respectively. average by 2.83 (g-means) and 4.29 (x-means). It could be pointed In the real data, the maximums are reached in the configuration out that g-means is better than x-means in terms of quality (𝑣𝑖) 𝑑 = 784, 𝑛 = 697932 and π‘˜ = 62 for x-means and KD-means. That in several combinations of (π‘˜, 𝑑, 𝑛). is 15.48 minutes for KD-means and 10 hours for x-means. In the In real data, the KD maxima do not exceed Ξ”π‘˜ = 3.81 and case of g-means, the maximum is even higher than that of the 𝑣𝑖 = 7.2. They are higher for x-means and g-means, they reach others, it is reached at 16.48 hours (π‘˜ = 4, 𝑑 = 18, 𝑛 = 1 Γ— 106 ). respectively (Ξ”π‘˜ = 4.73, 𝑣𝑖 = 8.6) and (Ξ”π‘˜ = 555.5, 𝑣𝑖 = 12.5). The values of Ξ”π‘˜ and 𝑣𝑖 are higher in real data than in synthetic data. 4.3 CLUSTER QUALITY ANALYSIS (Ξ”π‘˜/𝑣𝑖) Indeed, the distributions of clusters in real data are more complex than synthetic clusters (gaussian). As a result, these complex rep- KD-means is better than other algorithms in estimating k with resentations of clusters are difficult to capture completely point good clustering quality in synthetic and real data. by point. This increase is much higher for g-means because it In the synthetic data, KD-means does not exceed 0.47 in Ξ”π‘˜ tends to identify gaussian clusters in particular. However, clus- and 1.2 in 𝑣𝑖. X-means even reaches Ξ”π‘˜ = 3.72 and 𝑣𝑖 = 3.4. The ters in real data are not necessarily gaussian because they are same for g-means with Ξ”π‘˜ = 8.1 and 𝑣𝑖 = 2.8. The decrease in not constituted by gaussian distribution generators. As a result, Ξ”π‘˜ g-means when π‘˜ increased in the synthetic data is due to the Synthetic data Real data for scaling up and overlapping clusters. We plan to improve our th_evo Ξ”π‘˜ vi Δ𝑑 Ξ”π‘˜ vi Δ𝑑 solution in several areas: 0.05 0.3 0.8 213.2 1.7 4.9 132.5 β€’ implement a strategy that manages nested indexing. Be- 0.10 0.3 0.8 203.1 1.6 4.8 144.5 cause the indices of a node are also in the parent node; 0.15 0.3 0.8 198.2 1.5 4.7 134.0 0.20 0.3 0.8 193.5 1.4 4.7 141.1 β€’ use random kd-tree and multiple kd-tree in case the num- 0.25 0.3 0.9 188.6 1.2 4.5 136.5 ber of dimensions is even larger to have more very good 0.30 0.4 1.0 183.8 1.3 4.6 130.5 performance in searching for information stored in kd- Table 3: Performance of our algorithm as a function of the value tree or to have the best data separation. This performance of π‘‘β„Ž_π‘’π‘£π‘œ will help our solution to make better decisions on cluster Synthetic data Real data merging. ψ Ξ”π‘˜ vi Δ𝑑 Ξ”π‘˜ vi Δ𝑑 ACKNOWLEDGMENTS 4 0.5 1.1 195.0 1.3 4.2 131.1 8 0.3 0.8 201.5 1.2 4.4 141.3 These studies are supported by the ANRT and Capgemini funding 10 0.3 0.7 200.0 1.4 4.8 136.1 under CIFRE-Capgemini partnership. 12 0.3 0.8 190.5 1.9 5.3 137.5 Table 4: Performance of our algorithm as a function of the value REFERENCES of ψ [1] Nir Ailon, Yudong Chen, and Huan Xu. 2013. Breaking the Small Cluster Barrier of Graph Clustering. CoRR abs/1302.4549 (2013). arXiv:1302.4549 g-means makes an execessive overestimation of k (therefore Ξ”π‘˜ [2] David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. In ACM-SIAM symposium on Discrete algorithms. 1027–1025. high) compared to KD. The KD-means algorithm is the least arXiv:1212.1121 affected by the gaussianity of clusters because in its operation [3] Jon Louis Bentley. 1975. Multidimensional binary search trees used for as- gaussianity is not explicitly sought. sociative searching. Commun. ACM 18 (September 1975), 509–517. Issue 9. https://doi.org/10.1145/361002.361007 [4] Gianna M. Del Corso. 1997. Estimating an Eigenvector by the Power Method 4.3.1 PARAMETER SENSITIVITY ANAlYSIS. We analyze the with a Random Start. SIAM J. Matrix Anal. Appl. 18, 4 (Oct. 1997), 913–937. three metrics (Δ𝑑, 𝑣𝑖 and Ξ”π‘˜) according to π‘‘β„Ž_π‘’π‘£π‘œ and ψ. This is https://doi.org/10.1137/S0895479895296689 done in both data contexts (real and synthetic). [5] T. Cover and P. Hart. 2006. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theor. 13, 1 (Sept. 2006), 21–27. https://doi.org/10.1109/TIT.1967.1053964 In Tables 3 and 4, π‘‘β„Ž_π‘’π‘£π‘œ and ψ do not have a particular in- [6] Charles Elkan. 2003. Using the Triangle Inequality to Accelerate K-means. In fluence on the execution time of KD-means. Indeed, the range Proceedings of the Twentieth International Conference on International Confer- ence on Machine Learning (ICML’03). AAAI Press, 147–153. (difference between the maximum and minimum values) of Δ𝑑 [7] Aislan G. Foina, Judit Planas, Rosa M. Badia, and Francisco Javier Ramirez- does not exceed 10.2 seconds with respect to ψ. It is about 30 Fernandez. 2011. P-means, a parallel clustering algorithm for a heterogeneous seconds for π‘‘β„Ž_π‘’π‘£π‘œ. If we take into account the Δ𝑑 magnitude of multi-processor environment. In Proceedings of the 2011 International Con- ference on High Performance Computing and Simulation, HPCS 2011. IEEE, the three algorithms, these differences are negligible. 239–248. In the synthetic data, concerning ψ, the values of Ξ”π‘˜ are iden- [8] E.W. Forgy. 1965. Cluster analysis of multivariate data: efficiency versus tical except for ψ = 4 where the value is slightly higher than the interpretability of classifications. Biometrics (1965). [9] Guojun Gan, Chaoqun Ma, and Jianhong Wu. 2007. Data Clustering: The- others by 0.2. The values of 𝑣𝑖 have a maximum difference of 0.4. ory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied This difference is 0.1 when ψ ∈ [8, 10, 12] for which 𝑣𝑖 is minimal. Probability). Society for Industrial and Applied Mathematics. [10] Greg Hamerly and Charles Elkan. 2003. Learning the K in K-means. In Pro- At ψ = 4 Ξ”π‘˜ is slightly higher by 0.3 compared to the rest. In real ceedings of the 16th International Conference on Neural Information Processing data, the range is 1.1 for 𝑣𝑖 but at 0.2 if ψ ∈ [4, 8]. This range is Systems (NIPS’03). MIT Press, 281–288. 0.7 for Ξ”π‘˜ but drops to 0.2 for ψ ∈ [4, 8, 10]. [11] Anil K. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recog- nition Letters 31, 8 (jun 2010), 651–666. Concerning π‘‘β„Ž_π‘’π‘£π‘œ in the synthetic data, the values of Ξ”π‘˜ and [12] A K Jain, M N Murty, and P. J. Flynn. 1999. Data clustering: a review. 𝑣𝑖 have a range of 0.2. In real data, the respective ranges of Ξ”π‘˜ [13] Chuanren Liu, Tianming Hu, Yong Ge, and Hui Xiong. 2012. Which distance and 𝑣𝑖 are 0.5 and 0.4. They drop to 0.2 when ψ ∈ [0.20, 0.25, 0.30] metric is right: An evolutionary k-means view. In Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012. for Ξ”π‘˜ and ψ ∈ [0.15, 0.20, 0.25, 0.30] for 𝑣𝑖. [14] Songrit Maneewongvatana and David M. Mount. 1999. It’s Okay to Be Skinny, The observed ranges of the two metrics 𝑣𝑖 and Ξ”π‘˜ are lower If Your Friends Are Fat. In Center for Geometric Computing 4th Annual Work- shop on Computational Geometry. than the values of the same metrics of the competitors algo- [15] Marina MeilΔƒ. 2003. Comparing Clusterings by the Variation of Information. rithms in the real and synthetic data. But if we are stricter by In Learning Theory and Kernel Machines, Bernhard SchΓΆlkopf and Manfred K. just accepting ranges less than or equal to 0.2 then the optimal Warmuth (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 173–187. values, when the data are gaussian, are ψ ∈ [8, 10, 12] and π‘‘β„Ž_π‘’π‘£π‘œ [16] Dau Pelleg and Andrew Moore. 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In In Proceedings of the 17th taking all the values tested. When the data is real (not neces- International Conf. on Machine Learning. Morgan Kaufmann, 727–734. sarily gaussian clusters), the optimal values are ψ ∈ [4.8] and [17] D T Pham, S S Dimov, and C D Nguyen. 2005. Selection of K in K-means clustering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal π‘‘β„Ž_π‘’π‘£π‘œ ∈ [0.20, 0.25, 0.30]. of Mechanical Engineering Science 219, 1 (jan 2005), 103–119. [18] Hanan Samet. 2005. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 5 CONCLUSION AND PERSPECTIVES [19] M. A. Stephens. 1974. EDF Statistics for Goodness of Fit and Some Comparisons. The estimation of the number of clusters (π‘˜), that k-means must J. Amer. Statist. Assoc. 69, 347 (1974), 730–737. [20] Jan N. van Rijn, Geoffrey Holmes, Bernhard Pfahringer, and Joaquin Van- find, is a major problem in the literature. And it is more significant schoren. 2014. Algorithm Selection on Data Streams. In Discovery Science, SaΕ‘o when data is massive and clusters overlap. We provided a solution, DΕΎeroski, Panče Panov, Dragi Kocev, and Ljupčo Todorovski (Eds.). Springer International Publishing, 325–336. based on kd-tree, that automatically estimates the value of π‘˜ on [21] Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component massive and high dimensional data. It is robust even if clusters analysis. Chemometrics and Intelligent Laboratory Systems 2, 1 (1987), 37 – overlap. Four criteria were defined to guide the procedure for 52. Proceedings of the Multivariate Statistical Workshop for Geologists and Geochemists. estimating k. Through experiments, we have shown that our [22] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An Efficient solution is very competitive compared to the known x-means Data Clustering Method for Very Large Databases. SIGMOD Record 25, 2 (jun and g-means algorithms, which have proven to be unsuitable 1996), 103–114.