=Paper=
{{Paper
|id=Vol-2289/paper5
|storemode=property
|title=Computer-aided Diagnosis via Hierarchical Density Based Clustering
|pdfUrl=https://ceur-ws.org/Vol-2289/paper5.pdf
|volume=Vol-2289
|authors=Tom Obry,Louise Travé-Massuyès, Audine Subias
|dblpUrl=https://dblp.org/rec/conf/safeprocess/ObryTS18
}}
==Computer-aided Diagnosis via Hierarchical Density Based Clustering==
Computer-aided Diagnosis via Hierarchical Density Based Clustering
Tom Obry1,2 and Louise Travé-Massuyès 1 and Audine Subias 1
1
LAAS-CNRS, Université de Toulouse, CNRS, INSA, Toulouse, France
2
ACTIA, 5 Rue Jorge Semprun, 31432 Toulouse
Abstract data about all the influencial attributes. In particular, busi-
ness concepts are highly sensitive to environmental parame-
When applying non-supervised clustering, the ters that fall outside the scope of the considered business do-
concepts discovered by the clustering algorithm main and that are not recorded, for instance stock exchange.
hardly match business concepts. Hierarchical In addition, the clusters corresponding to business concepts
clustering then proves to be a useful tool to exhibit may be quite "close" in the data space and the only way to
sets of clusters according to a hierarchy. Data can capture them would be to guess the right number of clusters
be analyzed in layers and the user has a full spec- to initialize correctly the clustering algorithm. This is obvi-
trum of clusterings to which he can give meaning. ously quite hard. Hierarchical clustering then proves to be
This paper presents a new hierarchical density- a useful tool because it exhibits sets of clusters according to
based algorithm that advantageously works from a hierarchy and it modulates the number of clusters. Data
compacted data. The algorithm is applied to the can then be analyzed in layers, with a different number of
monitoring of a process benchmark, illustrating clusters at each level, and the user has a full spectrum of
its value in identifying different types of situa- clusterings to which he can give meaning.
tions, from normal to highly critical. Hierarchical clustering identifies the clusters present in a
dataset according to a hierarchy [8][9][10]. There are two
strategies to form clusters, the agglomerative ("bottom up")
1 Introduction strategy where each observation starts in its own cluster and
pairs of clusters are merged as one moves up the hierar-
In data-based diagnosis applications, it is often the case that
chy. The divise method ("top down") where all observations
huge amounts of data are available but the data is not la-
start in one cluster and splits are performed recursively as
belled with the corresponding operating mode, normal or
one moves down the hierarchy. The results of hierarchical
faulty. Clustering algorithms, known as non-supervised
clustering are usually presented in a dendrogram. A den-
classification methods, can then be used to form clusters that
drogram is a tree diagram frequently used to illustrate the
supposedly gather data corresponding to the same operating
arrangement of the clusters. In order to decide which clus-
mode.
ters should be combined or where a cluster should be split,
Clustering is a Machine Learning technique used to group a measure of dissimilarity between sets of observations is
data points according to some similarity criterion. Given required. In most methods of hierarchical clustering, splits
a set of data points, a clustering algorithm is used to clas- or merges of clusters are achieved by use of an appropriate
sify each data point into a specific group. Data points that metric like euclidean, manhattan or maximum distance.
are in the same group have similar features, while data
points in different groups have highly dissimilar features. Few algorithms propose a density-based hierarchical
Among well-known clustering algorithms, we can mention clustering approach like α-unchaining single linkage [11]
K-Means [1], PAM [2], K-Modes [3], DBSCAN [4]. or HDBSCAN [12]. In this paper, we present a new hi-
Numerous validity indexes have been proposed to evalu- erarchical clustering algorithm, named HDyclee, based on
ate clusterings [5]. These are generally based on two funda- density that advantageously works from compacted data in
mental concepts : the form hypercubes. This contribution is an extension of
the clustering algorithm DyClee [13], [14], [15]. The pur-
• compactness, the members of each cluster should be as pose of this work is to generate a flat partition of clusters
close to each other as possible. A common measure with a hypercube’s density level higher or equal to a thresh-
of compactness is the variance, which should be mini- old and to be able to visualize all existant clusters in the
mized. dataset with a dendogram by varying the density of the hy-
percubes present in a group. The value of the algorithm in
• separation, the clusters themselves should be widely a diagnosis context is illustrated with the monitoring of a
spaced. Continuous Stirred Tank Heater benchmark, for which it al-
Nevertheless, one must admit that the concepts discov- lows the user to identify different types of situations, from
ered by even the most scored clusterings hardly match busi- normal to highly critical.
ness concepts [6] [7]. One of the reasons is that data bases This paper is organized as follows. In section 2 the Dy-
are often incomplete in the sense that they do not include the Clee algorithm is presented. In section 3 the concepts and
gathers a group of data samples close in all dimensions and
whose information is summarized in a characteristic fea-
ture vector (CF). For a µ-cluster µCk , CF has the following
form:
CFk = (nk , LSk , SSk , tlk , tsk , Dk , Classk ) . (1)
Figure 1: Global description DyClee.
where nk ∈ < is the number of objects in the µ-cluster k,
principles underlying Dyclee, like the definition of micro LSk ∈