=Paper= {{Paper |id=Vol-2289/paper5 |storemode=property |title=Computer-aided Diagnosis via Hierarchical Density Based Clustering |pdfUrl=https://ceur-ws.org/Vol-2289/paper5.pdf |volume=Vol-2289 |authors=Tom Obry,Louise Travé-Massuyès, Audine Subias |dblpUrl=https://dblp.org/rec/conf/safeprocess/ObryTS18 }} ==Computer-aided Diagnosis via Hierarchical Density Based Clustering== https://ceur-ws.org/Vol-2289/paper5.pdf
          Computer-aided Diagnosis via Hierarchical Density Based Clustering

                      Tom Obry1,2 and Louise Travé-Massuyès 1 and Audine Subias 1
                    1
                      LAAS-CNRS, Université de Toulouse, CNRS, INSA, Toulouse, France
                             2
                               ACTIA, 5 Rue Jorge Semprun, 31432 Toulouse



                          Abstract                                data about all the influencial attributes. In particular, busi-
                                                                  ness concepts are highly sensitive to environmental parame-
      When applying non-supervised clustering, the                ters that fall outside the scope of the considered business do-
      concepts discovered by the clustering algorithm             main and that are not recorded, for instance stock exchange.
      hardly match business concepts. Hierarchical                In addition, the clusters corresponding to business concepts
      clustering then proves to be a useful tool to exhibit       may be quite "close" in the data space and the only way to
      sets of clusters according to a hierarchy. Data can         capture them would be to guess the right number of clusters
      be analyzed in layers and the user has a full spec-         to initialize correctly the clustering algorithm. This is obvi-
      trum of clusterings to which he can give meaning.           ously quite hard. Hierarchical clustering then proves to be
      This paper presents a new hierarchical density-             a useful tool because it exhibits sets of clusters according to
      based algorithm that advantageously works from              a hierarchy and it modulates the number of clusters. Data
      compacted data. The algorithm is applied to the             can then be analyzed in layers, with a different number of
      monitoring of a process benchmark, illustrating             clusters at each level, and the user has a full spectrum of
      its value in identifying different types of situa-          clusterings to which he can give meaning.
      tions, from normal to highly critical.                      Hierarchical clustering identifies the clusters present in a
                                                                  dataset according to a hierarchy [8][9][10]. There are two
                                                                  strategies to form clusters, the agglomerative ("bottom up")
1    Introduction                                                 strategy where each observation starts in its own cluster and
                                                                  pairs of clusters are merged as one moves up the hierar-
In data-based diagnosis applications, it is often the case that
                                                                  chy. The divise method ("top down") where all observations
huge amounts of data are available but the data is not la-
                                                                  start in one cluster and splits are performed recursively as
belled with the corresponding operating mode, normal or
                                                                  one moves down the hierarchy. The results of hierarchical
faulty. Clustering algorithms, known as non-supervised
                                                                  clustering are usually presented in a dendrogram. A den-
classification methods, can then be used to form clusters that
                                                                  drogram is a tree diagram frequently used to illustrate the
supposedly gather data corresponding to the same operating
                                                                  arrangement of the clusters. In order to decide which clus-
mode.
                                                                  ters should be combined or where a cluster should be split,
   Clustering is a Machine Learning technique used to group       a measure of dissimilarity between sets of observations is
data points according to some similarity criterion. Given         required. In most methods of hierarchical clustering, splits
a set of data points, a clustering algorithm is used to clas-     or merges of clusters are achieved by use of an appropriate
sify each data point into a specific group. Data points that      metric like euclidean, manhattan or maximum distance.
are in the same group have similar features, while data
points in different groups have highly dissimilar features.          Few algorithms propose a density-based hierarchical
Among well-known clustering algorithms, we can mention            clustering approach like α-unchaining single linkage [11]
K-Means [1], PAM [2], K-Modes [3], DBSCAN [4].                    or HDBSCAN [12]. In this paper, we present a new hi-
   Numerous validity indexes have been proposed to evalu-         erarchical clustering algorithm, named HDyclee, based on
ate clusterings [5]. These are generally based on two funda-      density that advantageously works from compacted data in
mental concepts :                                                 the form hypercubes. This contribution is an extension of
                                                                  the clustering algorithm DyClee [13], [14], [15]. The pur-
    • compactness, the members of each cluster should be as       pose of this work is to generate a flat partition of clusters
      close to each other as possible. A common measure           with a hypercube’s density level higher or equal to a thresh-
      of compactness is the variance, which should be mini-       old and to be able to visualize all existant clusters in the
      mized.                                                      dataset with a dendogram by varying the density of the hy-
                                                                  percubes present in a group. The value of the algorithm in
    • separation, the clusters themselves should be widely        a diagnosis context is illustrated with the monitoring of a
      spaced.                                                     Continuous Stirred Tank Heater benchmark, for which it al-
   Nevertheless, one must admit that the concepts discov-         lows the user to identify different types of situations, from
ered by even the most scored clusterings hardly match busi-       normal to highly critical.
ness concepts [6] [7]. One of the reasons is that data bases         This paper is organized as follows. In section 2 the Dy-
are often incomplete in the sense that they do not include the    Clee algorithm is presented. In section 3 the concepts and
                                                                  gathers a group of data samples close in all dimensions and
                                                                  whose information is summarized in a characteristic fea-
                                                                  ture vector (CF). For a µ-cluster µCk , CF has the following
                                                                  form:


                                                                         CFk = (nk , LSk , SSk , tlk , tsk , Dk , Classk ) .    (1)
             Figure 1: Global description DyClee.
                                                                  where nk ∈ < is the number of objects in the µ-cluster k,
principles underlying Dyclee, like the definition of micro        LSk ∈