=Paper= {{Paper |id=Vol-3074/paper25 |storemode=property |title=Aggregation based on intervals as similarity measure for hierarchical clustering |pdfUrl=https://ceur-ws.org/Vol-3074/paper25.pdf |volume=Vol-3074 |authors=Noelia Rico, Pedro Huidobro, Irene Diaz,Susana Montes |dblpUrl=https://dblp.org/rec/conf/wilf/RicoHDM21 }} ==Aggregation based on intervals as similarity measure for hierarchical clustering== https://ceur-ws.org/Vol-3074/paper25.pdf
Aggregation based on intervals as similarity measure
for hierarchical clustering
Noelia Rico1 , Pedro Huidobro2 , Irene Díaz1 and Susana Montes2
1
    Dept. of Computer Science, University of Oviedo, Spain
2
    Dept. of Statistics and Operational Research, University of Oviedo, Spain


                                         Abstract
                                         Hierarchical clustering algorithms create groups of objects in a data set based on their similarity. This
                                         similarity is commonly measured by a distance function, which makes the resulting groups dependent
                                         on the distance function used and also the technique employed to merge the clusters. In this work,
                                         we propose to transform the objects to a representation where each variable is defined by an interval.
                                         Using this representation, we define a new method that measures the similarity of the objects variable
                                         by variable based on the overlapping of their intervals, instead of using a distance function. The results
                                         obtained for each variable are later passed to an aggregation function, which allows comparing the
                                         similarity between the pairs of clusters. An example of algorithm is proposed in this work using a
                                         binary function to check the overlapping and aggregating its results with the average function. Finally,
                                         the method is compared with other approaches by means of two examples.

                                         Keywords
                                         hierarchical clustering, interval data, aggregation function




1. Introduction
Clustering methods define how to create groups of objects (also known as clusters) in a data
set. The aim is that the groups contain the most similar objects and, at the same time, they
are as different as possible from the objects in other groups [1]. Hierarchical clustering is an
aglomerative algorithm that starts assuming that each object belongs to an individual cluster.
Then, it performs pairwise comparisons between all the clusters of the data set in order to
measure their similarity. This comparison allows us to determine the two most similar ones,
which are grouped together into a single cluster. The pairwise comparison between clusters is
repeated sequentially until all the objects are merged into one single cluster [2]. The result of the
hierarchical clustering algorithm is commonly represented as a tree graph of the groups called
dendrogram. Usually, the number of clusters to be obtained is determined once the complete
dendrogram is built. Therefore, to define the objects in each cluster, the tree is cut at the level
corresponding to the desired number of clusters.

WILF’21: The 13th International Workshop on Fuzzy Logic and Applications
" noeliarico@uniovi.es (N. Rico); huidobropedro@uniovi.es (P. Huidobro); sirene@uniovi.es (I. Díaz);
montes@uniovi.es (S. Montes)
~ https://www.noeliarico.dev (N. Rico)
 0000-0002-6169-4523 (N. Rico); 0000-0002-0170-0426 (P. Huidobro); 0000-0002-3024-6605 (I. Díaz);
0000-0002-4701-2207 (S. Montes)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
   In the process of creating the groups, classical approaches treat the clusters as points and
measure their similarity by using point-based distances such as the Euclidean distance. In this
work, we propose to transform the data in order to represent the clusters as intervals. In the
literature, some methods based on intervals can be found. For instance, Galdino and Maciel [3]
consider some arithmetic operations between intervals in order to define the Euclidean distance
as an interval. On the other hand, Ramos-Guajardo [4] proposes some methods based on the
Jaccard index and on statistical hypothesis tests. However, with this interval representation,
we propose the use of a function that measures the degree of coincidence between two objects
in each variable and an aggregation function in order to determine the clusters that should be
merged together in each level of the tree. These two key points allow modeling uncertainty by
allowing intervals to represent not only the objects but also their neighborhood, which could
be very useful, for example, in case the data set contains noisy data.
   This paper is organized as follows. Section 2 gives an overview of the clustering problem and
the transformation we apply for an interval-based approach. Section 3 details the two key points
introduced into the hierarchical clustering algorithm for creating the clusters using intervals
and also presents the proposed algorithm. A comparison with other approaches is presented in
Section 4 by means of two examples. Conclusions of this work can be found in the final section.


2. Clustering using intervals
Let us consider a data set X where each object 𝑥𝑖 ∈ X is defined by 𝑛 variables such that
𝑥𝑖 = {𝑥1𝑖 , ..., 𝑥𝑛𝑖 } with 𝑥𝑘𝑖 ∈ R, 1 ≤ 𝑘 ≤ 𝑛. The aim of any clustering algorithm is to group
these objects into 𝑚 different clusters 𝒞 such that each object belongs uniquely to one cluster
and 𝒞1 ∪ · · · ∪ 𝒞𝑚 = X. As the number of clusters is usually unknown beforehand, hierarchical
clustering algorithms use a binary tree structure of 𝑛 levels that groups the objects in X from
the bottom of the tree to the top. Therefore, the algorithm defines for each level ℓ of the tree,
with 1 ≤ ℓ ≤ 𝑚, a set of clusters Cℓ = {ℓ 𝒞1 , . . . , ℓ 𝒞ℓ }. Following this, the root of the generated
tree C1 (which is the last level reached by the algorithm) contains one single cluster with all
the elements in X. On the other hand, the bottom level (where the algorithm starts) contains
the leaves of the tree, this is, a set of clusters that has as many clusters as objects in the data set,
since each of the clusters contains a single object of X. As the tree is built from the bottom (i.e.
ℓ equal to the number of objects in X) to the top (i.e. ℓ = 1), in each iteration the clusters in
level ℓ are almost the same than the clusters in level ℓ + 1, varying only in the two clusters of
ℓ + 1 that are substituted by one in ℓ, corresponding to the cluster generated after merging two
most similar clusters in ℓ + 1 into one single cluster in ℓ. The process is repeated iteratively
until the root of the tree is reached. After the tree is obtained, the clusters for any number 𝑚 of
different clusters can be determined by cutting the tree at the corresponding level.
   Taking into account that the clustering algorithm seeks for creating groups of similar objects,
it is necessary to define how the similarity between the objects is measured. This is usually
done by means of a distance function and many variations have been proposed in the literature
[5, 6, 7], although the Euclidean distance is usually the one used in practice. There are three
different cases for which the measure must be defined:

    • Distance between two objects 𝛿(𝑥𝑖 , 𝑥𝑗 ) : 𝑥𝑖 , 𝑥𝑗 ∈ X.
    • Distance between an object and a cluster 𝛿(𝑥𝑖 , ℓ 𝒞𝑗 ) : 𝑥𝑖 ∈ X , ℓ 𝒞𝑗 ∈ Cℓ .
    • Distance between two clusters 𝛿(ℓ 𝒞𝑖 , ℓ 𝒞𝑗 ) : ℓ 𝒞𝑖 , ℓ 𝒞𝑗 ∈ Cℓ .

   However, as hierarchical clustering algorithms consider that each object belongs to its own
cluster in the first iteration (i.e. the bottom of the tree), all three types of distances could
be generalized as 𝛿(ℓ 𝒞𝑖 , ℓ 𝒞𝑗 ). How the distance between two clusters is computed is usually
referred to as linkage method. Depending on the linkage method chosen, the algorithm will
result in different sets of clusters [8]. Many linkage methods have been defined in the literature.
The most popular ones are the listed below accompanied by the explanation on how they define
the similarity between two clusters:

    • single: the distance between the closest points of the clusters.
    • complete: the distance between the furthest points of the clusters.
    • average: the distance between all the points of the clusters and then compute the average.
    • centroid: calculates the centroid of each cluster (i.e. for each variable the average of all
      the points that belong to the cluster) and computes the distance between the two clusters
      using their centroids.

   To sum up, this means that each linkage method represents the cluster by choosing a different
object of the cluster. Then, using this object as a representation of the cluster, the distance is
computed pairwisely and the closest ones are merged together.
   In this work, we propose to identify the clusters using intervals instead of one single point.
For defining a measure of similarity, we keep in mind the original methods and how the distance
between two objects is a particular case of the distance between two clusters. Then, we propose
a new hierarchical clustering method based on intervals that substitutes the linkage method
by an interval-based function and an aggregation function. By using this method, the distance
involving clusters of intervals can be computed by components applying a function 𝑓 to each
variable and then using an aggregation function 𝐴 in order to obtain a single value that allows
comparing the clusters pairwisely. This comparison is necessary in order to establish which
two of the clusters are the most similar and consequently should be merged together into a
single cluster.
   In order to identify the objects of the data set with an interval-based representation, each
object 𝑥𝑖 ∈ X is turned into an object 𝑦𝑖 ∈ Y such that each component 𝑥𝑘𝑖 of the object is
transformed into a degenerated interval 𝑦𝑖𝑘 = [𝑥𝑘𝑖 , 𝑥𝑘𝑖 ], where 𝑥𝑘𝑖 = 𝑥𝑘𝑖 = 𝑥𝑘𝑖 . Notice that this
is a particular case of a cluster, therefore any cluster can be defined as ℓ 𝒞𝑖 = {ℓ 𝒞𝑖1 , . . . , ℓ 𝒞𝑖𝑛 }
where ℓ 𝒞𝑖𝑘 = [ℓ 𝒞𝑖𝑘 , ℓ 𝒞𝑖𝑘 ] with 1 ≤ 𝑘 ≤ 𝑛. The data set Y with the transformed representation of
all the objects in X is given as input to the clustering algorithm.
   Whenever two clusters are merged together, and therefore contain more than one object
of the data set, the new cluster is created taking from the two clusters merged the lowest
and the greatest endpoints for the interval that represents each variable. Formally, given
two clusters ℓ 𝒞𝑖 and ℓ 𝒞𝑗 , the resulting cluster ℓ−1 𝒞ℎ = {ℓ−1 𝒞ℎ1 , . . . , ℓ−1 𝒞ℎ𝑛 } is built such that
ℓ−1 𝒞 𝑘 = [min{ℓ 𝒞 𝑘 , ℓ 𝒞 𝑘 }, max{ℓ 𝒞 𝑘 , ℓ 𝒞 𝑘 }], 1 ≤ 𝑘 ≤ 𝑛.
      ℎ             𝑖       𝑗           𝑖      𝑗
3. Proposed algorithm
The first step in order to obtain an algorithm based on this interval approach is to define the
function 𝑓 used to measure the similarity between two intervals [𝑎, 𝑏], [𝑐, 𝑑] ⊆ R. Although
this function could be defined in multiple forms, here we consider a binary basic approach so
the following function 𝑓 is used to determine whether there is overlapping between the two
intervals:
                                                   if [𝑎, 𝑏] ∩ [𝑐, 𝑑] ̸= ∅
                                              {︂
                                                 1
                         𝑓 ([𝑎, 𝑏], [𝑐, 𝑑]) =
                                                 0 otherwise.
   The algorithm starts by applying the function 𝑓 (ℓ 𝒞𝑖𝑘 , ℓ 𝒞𝑗𝑘 ) to each component of the clusters
ℓ 𝒞 , ℓ 𝒞 ∈ C such that ℓ is equal to the number of objects in X and 1 ≤ 𝑘 ≤ 𝑛. By doing this, it
   𝑖     𝑗      ℓ
is possible to obtain the vector ⃗𝑠ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 = (𝑠1ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 , . . . , 𝑠𝑛ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 ). This vector gives an intuition
about what we call the degree of coincidence between the two objects.
   After ⃗𝑠ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 has been calculated for all pairs of objects, these must be compared. To obtain
a single value that eases this pairwise comparison, the elements of ⃗𝑠ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 are aggregated by
means of an aggregation function 𝐴(𝑠        ⃗ ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 ). For this work, as a first approach, we define 𝐴
simply as the average of the elements in ⃗𝑠, such that:
                                                               𝑛
                                                            1 ∑︁ 𝑘
                                         𝐴(𝑠
                                           ⃗ ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 ) =     𝑠ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 .
                                                            𝑛
                                                             𝑘=1

   The two clusters ℓ 𝒞𝑖 , ℓ 𝒞𝑗 with the greatest value of 𝐴(𝑠⃗ ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 ) are merged together into a
cluster.
   Notice that, using this basic function, in case that none of the elements of the objects are
equal, the overlapping between the initial vectors will be 0 so none of the objects could be
grouped together. To solve this, the neighborhood of the interval in each component will be
considered taking into account the range of each variable in the original data set. Formally, we
can define the vector 𝑤  ⃗ of 𝑛 elements, which associates each value 𝑤𝑘 with the 𝑘-th variable of
the data set using the range of the variable in the initial data set X. Prior to the execution of the
clustering the algorithm, the percentage of the variable that will be considered for this update
is set using 𝑝, which determines the amount that the interval will be extended in case that there
is no overlapping, such that 𝑤𝑘 = (max𝑥𝑖 ∈X (𝑥𝑘𝑖 ) − min𝑥𝑖 ∈X (𝑥𝑘𝑖 )) × 𝑝.
   Using the coefficient 𝑤𝑘 , when there is no overlapping between any of the intervals, the
interval ℓ 𝒞𝑖𝑘 is transformed into:

                                        ℓ 𝑘        𝑘            𝑘
                                         𝒞𝑖 = [ℓ 𝒞 𝑖 − 𝑤𝑘 , ℓ 𝒞 𝑖 + 𝑤𝑘 ].
   The resulting algorithm obtained from the considerations explained is shown in Algorithm 3.


4. Comparison with classical approaches
In order to illustrate the method, an example for a toy data set is presented below.
Algorithm 1 Proposed methodology for hierarchical clustering
 1: Transform X into Y                                            ◁ Define the interval-based objects
 2: Compute the vector 𝑤        ⃗ using X
 3: ℓ = 𝑛        ◁ Initially each object has its own cluster, the algorithm starts with the leaves
 4: while ℓ > 0 do                                                ◁ Build the tree from bottom to top
 5:    k=0
 6:    for all ℓ 𝒞𝑖 , ℓ 𝒞𝑗 ∈ Cℓ do
 7:        for each variable 𝑘 defined with [𝑎, 𝑏] in ℓ 𝒞𝑖 and [𝑐, 𝑑] in ℓ 𝒞𝑗 do
 8:            𝑠𝑘ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 , = 𝑓 ([𝑎, 𝑏], [𝑐, 𝑑]) ◁ Compute the similarity according to the variable
 9:        end for
10:    end for
11:    Compute the global similarity 𝐴(𝑠          ⃗ ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 ).
12:    while 𝐴(𝑠                         ℓ    ℓ
                   ⃗ ℓ 𝒞𝑖 ,ℓ 𝒞𝑗 ) = 0 , ∀ 𝒞𝑖 , 𝒞𝑗 ∈ Cℓ do            ◁ If none of the clusters overlap
13:        Update the intervals using 𝑤         ⃗𝑖
14:        Increase the value of 𝑘
15:        Repeat lines 6 to 11
16:    end while
17:    Determine the two closest clusters and merge them into one
18:    Define the set Cℓ
19: end while


Example 4.1. Let us consider the one-dimensional data set of five objects X = {5, 7, 46, 81, 92}.
    In order to apply the proposed method, first of all, it is necessary to transform the data set X
into an interval data set Y by using degenerated intervals. Therefore, the transformed data set
Y = {[5, 5], [7, 7], [46, 46], [81, 81], [92, 92]} is obtained. Also, it is necessary to compute the
weight that will be used for updating the intervals when there is no overlapping. In this case,
as there is only one variable, only one value must be computed. We define 𝑝 = 0.01 so, taking
into account the range [5, 92] of the variable, the weight 𝑤1 can be computed such that 𝑤1 =
(92 − 5) × 0.01 = 0.87. This value will be used whenever there is no overlapping between the
objects, so the intervals are augmented in their extremes until some overlap is found. The first step
is to compare the objects pairwisely by applying the function 𝑓 to compare the intervals associated
with each variable. The results are shown below:

                           𝑦2 = [7, 7] 𝑦3 = [46, 46] 𝑦4 = [81, 81] 𝑦5 = [92, 92]
             𝑦1 = [5, 5]       0             0             0            0
                           𝑦2 = [7, 7]       0             0            0
                                       𝑦3 = [46, 46]       0            0
                                                     𝑦4 = [81, 81]      0
                  𝑥1 𝑥2                           𝑥3                          𝑥4       𝑥5

  Notice that these objects could also had been represented as:

 C5 = {5 𝒞1 = {[5, 5]}, 5 𝒞2 = {[7, 7]}, 5 𝒞3 = {[46, 46]}, 5 𝒞4 = {[81, 81]}, 5 𝒞5 = {[92, 92]}},
as this is the set containing the leaves at the bottom of the tree, which corresponds to level ℓ = 5
for this data set of five objects. Henceforth, this notation will be used in order to be consistent in
all the iterations of the example.
   As there are no overlapping between any of the objects, the extremes of the intervals must be
updated using 𝑤1 . The new intervals obtained after subtracting 𝑤1 to the left extreme of the
interval and adding 𝑤1 to the right extreme of the interval are shown below:
                               5𝒞                      5𝒞                         5𝒞                            5𝒞
                                    2 = [6.13, 7.87]        3 = [45.13, 46.87]         4 = [80.13, 81.87]            5 = [91.13, 92.87]
    5𝒞
         1 = [4.13, 5.87]              0                        0                          0                                0
                               5 𝒞 = [6.13, 7.87]               0                          0                                0
                                  2
                                                       5 𝒞 = [45.13, 46.87]                0                                0
                                                          3
                                                                                  5 𝒞 = [80.13, 81.87]                      0
                                                                                     4

                       𝑥1 𝑥2                                       𝑥3                                  𝑥4             𝑥5

  As shown in the previous figure, there is still no overlapping so more increase is needed. The
weight 𝑤1 is applied again such that the following table is obtained:
                               5𝒞                      5𝒞                         5𝒞                            5𝒞
                                    2 = [5.26, 8.74]        3 = [44.26, 47.74]         4 = [79.26, 82.74]            5 = [90.26, 93.74]
    5 𝒞 1 = [3.26, 6.74]               1                        0                          0                                0
       1
                               5 𝒞 = [5.26, 8.74]               0                          0                                0
                                  2
                                                       5 𝒞 = [44.26, 47.74]                0                                0
                                                          3
                                                                                  5 𝒞 = [79.26, 82.74]                      0
                                                                                     4

                       𝑥1 𝑥2                                       𝑥3                                  𝑥4             𝑥5

  After this update, there is an overlapping between the clusters 5 𝒞                           5
                                                                                           1 and 𝒞2 . Therefore, these two
clusters are merged together in the next iteration, creating in the upper level of the tree ℓ = 4 the
cluster 4 𝒞1 . The interval takes the lowest value of the left limit and the greatest value of the right
limit of the two clusters that are merged, therefore, 4 𝒞1 = {[3.26, 8.74]}.

                       𝑥1 𝑥2                                       𝑥3                                  𝑥4             𝑥5

  The representation of the clusters in the next iteration is given by the set of intervals C4 , which
contains all the previous intervals and the new cluster created. This set can be defined as:
  C4 = {4 𝒞1 = {[3.26, 8.74]}, 4 𝒞2 = {[44.26, 47.74]}, 4 𝒞3 = {[79.26, 82.74]}, 4 𝒞4 = {[90.26, 93.74]}}.


  After the creation of this new cluster, it is again necessary to update the intervals to increase
their extremes until they overlap and a new cluster can be created. As shown in the comparison
below, in the first iteration there is no overlap:

                                       4𝒞                          4𝒞                          4𝒞
                                            2 = [44.26, 47.74]          3 = [79.26, 82.74]          4 = [90.26, 93.74]
           4𝒞
                1 = [3.26, 8.74]                   0                        0                               0
                                       4𝒞
                                            3 = [44.26, 47.74]              0                               0
                                                                   4 𝒞 = [79.26, 82.74]                     0
                                                                      4


  Therefore, 𝑤1 must be iteratively applied and after five times, the following clusters are obtained:
                                   4𝒞                             4𝒞                        4𝒞
                                        2 = [39.91, 59.02]             3 = [74.91, 87.09]        4 = [85.91, 98.09]
       4𝒞
            1 = [−1.09, 13.09]              0                              0                            0
                                   4 𝒞 = [39.91, 59.02]                    0                            0
                                      2
                                                                  4 𝒞 = [74.91, 87.09]                  1
                                                                     3



                      𝑥1 𝑥2                                  𝑥3                                   𝑥4        𝑥5

   According to the values obtained at this point, it is necessary to merge the clusters 4 𝒞3 and 4 𝒞4 ,
which contains the objects 𝑥4 and 𝑥5 respectively, obtaining the following clusters for the next
iteration, must be merged, which corresponds with the following situation in ℓ = 3:

                      𝑥1 𝑥2                                  𝑥3                                   𝑥4        𝑥5

   Following this procedure and increasing the range of the intervals iteratively using 𝑤1 , the
following overlapping is obtained:

                      𝑥1 𝑥2                                  𝑥3                                   𝑥4        𝑥5

  Meaning that these are merged together making the two following clusters in ℓ = 2:

                      𝑥1 𝑥2                                  𝑥3                                   𝑥4        𝑥5

   The last iteration will conclude the method by merging the two remaining clusters together in
the root of the tree representation, with one single cluster that contains all the objects in the data
set.
   On the other hand, for creating the groups using the classical linkage methods, it is necessary to
perform a pairwise comparison of the objects in X to compute their distance. The results obtained
from this comparison are shown below:

                                     𝑥2 = 7 𝑥3 = 46 𝑥4 = 81 𝑥5 = 92
                              𝑥1 = 5    2      41      76      87
                                     𝑥2 = 7    39      74      85
                                            𝑥3 = 46    35      46
                                                    𝑥4 = 81    11
   Using this, when the single linkage method is employed, the clusters obtained are the same as
the ones with our method (left dendrogram). However, when the complete, average or centroid
linkage methods are considered the results are different (right dendrogram).




              𝑥1 𝑥2           𝑥3             𝑥4 𝑥5                              𝑥1 𝑥2             𝑥3             𝑥4 𝑥5


 The one-dimensional version of the algorithm yields the same results that the single linkage
method using Euclidean distance. Intuitively, it is natural that in one dimension the results
are the same than using this binary function as, the closer the objects are, the sooner they will
overlap. However, the single method is only equivalent to our method in the case of objects
with one dimension. A counterexample showing that this is not the case when the number of
variables increases is shown below.

Example 4.2. Here we present a counterexample using a data set where the objects are defined
with two variables: age and height. This shows that our method is different from single linkage
using Euclidean distance in the case when the number of the variables is greater than one. Let us
consider the following data set (left) and the corresponding pairwise distance between the objects
of the data set (right):

                                Age   Height                    𝑥2           𝑥3
                        𝑥1      10    140                𝑥1 10.04988         30
                        𝑥2      11    150
                        𝑥3      10    170                       𝑥2         20.02498


   According to the Euclidean distance, the most similar objects in the bottom level of the tree (i.e.
ℓ = 3) are 𝑥1 and 𝑥2 . This means that these two objects are merged into one single cluster 2 𝒞1 in
the upper level ℓ = 2. Then, the third object 𝑥3 will be merged with this cluster in an upper level of
the tree, creating the single cluster in the root 1 𝒞1 that contains all the objects. On the other hand,
with our method, we obtain a different result. First of all, let us transform the data set into a data
set Y of degenerated intervals (shown below on the table). The individual results of each pairwise
comparison of the objects after applying the function 𝑓 to each variable are shown in the tables
below. Then, the aggregation function is applied to each pairwise comparison, such that the table
Average is obtained. The objects 𝑦1 and 𝑦3 are the most similar according to the first approach so
these would be merged together first.
                                                     Age               Height             Average

         Age        Height                          𝑦2 𝑦3              𝑦2 𝑦3               𝑦2   𝑦3
    𝑦1   [10,10]    [140,140]
    𝑦2   [11,11]    [150,150]                  𝑦1   0     1      𝑦1    0     0        𝑦1   0    1/2
    𝑦3   [10,10]    [170,170]                       𝑦2    0            𝑦2    0             𝑦2   0


So using this method, on the contrary to what happened using the single linkage method with the
Euclidean distance, the objects 𝑦1 and 𝑦3 , (which represent the original 𝑥1 and 𝑥3 ) form 2 𝒞1 . This
makes the clusters in ℓ = 2 be different than the ones obtained before.

  As illustrated in the previous examples, classical methods based on distances require com-
paring all the pairs of objects in the cluster. On the contrary, with the proposed method, this
matrix is reduced in each step which benefits the implementation of the algorithm as well as its
suitability to be applied in practice to large problems. Furthermore, using the basic function 𝑓
that we are considering in this work, the Boolean matrix used is more efficient and less memory
space is needed to store the comparison.
5. Conclusion and future work
In this work, a new hierarchical clustering method based on grouping intervals has been
proposed. This method provides an approach that uses a coincidence function as well as an
aggregation function, which means that can be extended by changing these parts in order to
explore different properties of the created groups. Future work will include the exploration of
different aggregation functions in order to create the groups.


Acknowledgments
This research has been partially supported by Spanish MINECO projects TIN2017-87600-P
(Noelia Rico and Irene Díz) and PGC2018-098623-B-I00 (Pedro Huidobro and Susana Montes).
Pedro Huidobro and Noelia Rico are alspp supported by the Severo Ochoa predoctoral grant
program by the Principality of Asturias (PA-20-PF-BP19-169 and PA-20-PF-BP19-167, respec-
tively).


References
[1] L. Rokach, O. Maimon, Clustering methods, in: Data mining and knowledge discovery
    handbook, Springer, 2005, pp. 321–352.
[2] S. C. Johnson, Hierarchical clustering schemes, Psychometrika 32 (1967) 241–254.
[3] S. Galdino, P. Maciel, Hierarchical cluster analysis of interval-valued data using width of
    range euclidean distance, in: 2019 IEEE Latin American Conference on Computational
    Intelligence (LA-CCI), 2019, pp. 1–6. doi:10.1109/LA-CCI47412.2019.9036754.
[4] A. B. Ramos-Guajardo, A hierarchical clustering method for random intervals based on a
    similarity measure, Computational Statistics (2021) 1–33.
[5] C. C. Chang, P. W. Lu, J. Y. Hsiao, A hybrid method for estimating the euclidean distance
    between two vectors, in: First International Symposium on Cyber Worlds, 2002. Proceedings.,
    2002, pp. 183 – 190. doi:10.1109/CW.2002.1180878.
[6] R. Lustig, Angle-average for the powers of the distance between two separated vectors,
    Molecular Physics 65 (1988) 175 – 179. doi:10.1080/00268978800100931.
[7] I. Olkin, F. Pukelsheim, The distance between two random vectors with given dispersion
    matrices, Linear Algebra and its Applications 48 (1982) 257 – 263. doi:https://doi.org/
    10.1016/0024-3795(82)90112-4.
[8] F. Murtagh, A survey of recent advances in hierarchical clustering algorithms, The computer
    journal 26 (1983) 354–359.