-

Computer-aided Diagnosis via Hierarchical Density Based Clustering

Tom Obry

0 1

Louise Travé-Massuyès

Audine Subias

1 0 ACTIA , 5 Rue Jorge Semprun, 31432 Toulouse 1 LAAS-CNRS, Université de Toulouse , CNRS, INSA, Toulouse , France

2016

When applying non-supervised clustering, the concepts discovered by the clustering algorithm hardly match business concepts. Hierarchical clustering then proves to be a useful tool to exhibit sets of clusters according to a hierarchy. Data can be analyzed in layers and the user has a full spectrum of clusterings to which he can give meaning. This paper presents a new hierarchical densitybased algorithm that advantageously works from compacted data. The algorithm is applied to the monitoring of a process benchmark, illustrating its value in identifying different types of situations, from normal to highly critical.

In data-based diagnosis applications, it is often the case that huge amounts of data are available but the data is not labelled with the corresponding operating mode, normal or faulty. Clustering algorithms, known as non-supervised classification methods, can then be used to form clusters that supposedly gather data corresponding to the same operating mode.

Clustering is a Machine Learning technique used to group data points according to some similarity criterion. Given a set of data points, a clustering algorithm is used to classify each data point into a specific group. Data points that are in the same group have similar features, while data points in different groups have highly dissimilar features. Among well-known clustering algorithms, we can mention K-Means [ 1 ], PAM [ 2 ], K-Modes [ 3 ], DBSCAN [4].

Numerous validity indexes have been proposed to evaluate clusterings [5]. These are generally based on two fundamental concepts : compactness, the members of each cluster should be as close to each other as possible. A common measure of compactness is the variance, which should be minimized. separation, the clusters themselves should be widely spaced.

Nevertheless, one must admit that the concepts discovered by even the most scored clusterings hardly match business concepts [6] [7]. One of the reasons is that data bases are often incomplete in the sense that they do not include the data about all the influencial attributes. In particular, business concepts are highly sensitive to environmental parameters that fall outside the scope of the considered business domain and that are not recorded, for instance stock exchange. In addition, the clusters corresponding to business concepts may be quite "close" in the data space and the only way to capture them would be to guess the right number of clusters to initialize correctly the clustering algorithm. This is obviously quite hard. Hierarchical clustering then proves to be a useful tool because it exhibits sets of clusters according to a hierarchy and it modulates the number of clusters. Data can then be analyzed in layers, with a different number of clusters at each level, and the user has a full spectrum of clusterings to which he can give meaning.

Hierarchical clustering identifies the clusters present in a dataset according to a hierarchy [ 8 ][9][ 10 ]. There are two strategies to form clusters, the agglomerative ("bottom up") strategy where each observation starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy. The divise method ("top down") where all observations start in one cluster and splits are performed recursively as one moves down the hierarchy. The results of hierarchical clustering are usually presented in a dendrogram. A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters. In order to decide which clusters should be combined or where a cluster should be split, a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, splits or merges of clusters are achieved by use of an appropriate metric like euclidean, manhattan or maximum distance.

Few algorithms propose a density-based hierarchical clustering approach like -unchaining single linkage [ 11 ] or HDBSCAN [ 12 ]. In this paper, we present a new hierarchical clustering algorithm, named HDyclee, based on density that advantageously works from compacted data in the form hypercubes. This contribution is an extension of the clustering algorithm DyClee [ 13 ], [ 14 ], [ 15 ]. The purpose of this work is to generate a flat partition of clusters with a hypercube’s density level higher or equal to a threshold and to be able to visualize all existant clusters in the dataset with a dendogram by varying the density of the hypercubes present in a group. The value of the algorithm in a diagnosis context is illustrated with the monitoring of a Continuous Stirred Tank Heater benchmark, for which it allows the user to identify different types of situations, from normal to highly critical.

This paper is organized as follows. In section 2 the DyClee algorithm is presented. In section 3 the concepts and principles underlying Dyclee, like the definition of micro clusters C, dynamic clusters and the KD-Tree structure, are explained. In the section 4, the hierarchical clustering based-density algorithm is presented. Tests and results are detailed in section 5. The conclusion and perspective for future work end this paper in section 6.1 2

Dyclee: a dynamic clustering algorithm

DyClee is a dynamic clustering algorithm which is able to deal with large amounts of data arriving at fast rates by adopting a two stages strategy similar to [ 16 ], [ 17 ], [ 18 ]. The first stage is a fast scale distance-based algorithm that collects, pre-processes and compresses data samples to form so-called micro-clusters ( -clusters). It operates at the rate of the data stream and creates -clusters putting together data samples that are close, in the sense of a given distance, to each other. -clusters are stored in the form of summarized representations including statistical and temporal information.

The second stage is a slower scale density-based algorithm that groups the -clusters into actual clusters that can be interpreted semantically as classes. It takes place once each tslow seconds and analyses the distribution of -clusters. The density of a -cluster is considered as low, medium or high and is used to create the final clusters by a density based approach, i.e. dense -clusters that are close enough (connected) are said to belong to the same cluster. Similarly to [ 19 ], a cluster is defined as the group of connected -clusters where every inside -cluster presents high density and every outside -cluster exhibits either medium or low density. The above dense -cluster structure allows the algorithm to create clusters of non convex shapes even in high dimensional spaces and it has proved outliers rejection capabilities in evolving environments [ 18 ]. In addition, clusters of similar densities can form clusters of any shape and any size.

In DyClee, both stages work on-line, but operate at different time scales. This multi-density feature allows the detection of novelty behavior in its early stages when only a few objects giving evidence of this evolution are present. Figure 1 gives the global description of DyClee. 3

Main principles of DyClee

All the principles explained in this section are from the core algorithm. 3.1

Notion of micro-clusters C

Considering a d-dimensional object X = [x1, ..., xd] marked with a timestamp tX and qualified by d features, a -cluster 1This work is performed in the framework of a CIFRE Project supported by ACTIA. gathers a group of data samples close in all dimensions and whose information is summarized in a characteristic feature vector (CF). For a -cluster Ck, CF has the following form:

CFk = (nk; LSk; SSk; tlk; tsk; Dk; Classk) : (1) where nk 2 < is the number of objects in the -cluster k, LSk 2 <d is the vector containing the linear sum of each feature over the nk objects, SSk 2 <d is the square sum of features over the nk objects, tlk 2 < is the time when the last object was assigned to that -cluster, tsk 2 < is the time when the -cluster was created, Dk is the -cluster density and Classk is the -cluster label if known. Using LSk, SSk and nk the variance of the group of objects assigned to the -cluster k can be calculated.

The -cluster is shaped as a d-dimensional box since the L1-norm is used as distance measure. The distance between an object X = [x1, ..., xd]T and a -cluster Ck, named as dis(X; Ck), is calculated as the sum of the distances between the Ck vector center ck = [c1k; : : : ; cdk]T and the object value for each feature as shown in equation (2): d dis(X; Ck) = L1(X; ck) = X xi cik :

(2) i=1

The data is normalized according to the data context, i.e. the feature range [mini; maxi] of each feature i, i = 1; :::; d. If no context is available in advance, it may be established online. The size of the hyperboxes Si along each dimension i is set as a fraction of the corresponding feature range. The hyperbox size per feature is hence found according to (3), where i is a user constant parameter in the interval (0; 1), establishing the fraction:

Si = ijmaxi minij; 8i = 1; : : : ; d: (3) Whenever an object X arrives, the algorithm searches for the closest -cluster. Once found, a maximal distance criterion is evaluated to decide whether or not the object fits inside the -cluster hyper-box. If the fitting is sufficient the -cluster feature vector is updated using the object information; if not, a new -cluster is created with the object information using its time-stamp as cluster time of creation.

The density of a -cluster Ck is calculated using the current number of objects nk and the current hyper-volume of the bounding box Vk = Qid=1 Si, as shown in (4): Dk = nk : Vk (4) Let Ck and Ck be two -clusters, then Ck and Ck are said to be directly connected if their hyper-boxes overlap in all but ' dimensions, where ' is an integer. The parameter ', fixed by the user, establishes the feature selectivity.

A -cluster Ck1 is said to be connected to Ckn if there exists a chain of -clusters f Ck1 ; Ck2 ; ; Ckn g such that Cki is directly connected to Cki+1 for i = 1; 2; ; n 1. A set of connected -clusters is said to be a group. Dyclee is a dynamic clustering algorithm, which means that not only the parameters but the classifier structure changes according to input data in an automatic way. It achieves several cluster operations like creation, elimination, drift, merge, and split. For instance, a cluster is splitted into two or more clusters if, with the arrival of new data, high density regions can be distinguished inside the cluster. In that scenario, dense regions are separated by low density regions, making the cluster no longer homogeneous. Even more, the cluster center could be situated in a low density region, loosing its interpretability as prototype of the elements in the cluster. Splitting the cluster creates smaller homogeneous clusters, completely representative of the belonging samples. An illustrative example of this phenomenon is shown in Figure 2. To find groups of connected -clusters, a KD-Tree [ 20 ] is used. A KD-Tree is a binary tree, each of whose nodes represents an axis-aligned hyperrectangle. Each node specifies an axis and splits the set of points based on whether their coordinate along that axis is greater than or less than a particular value. The tree is queried to return only neighbors who are at a maximum distance from a point. A Cj is the neighbor of the Ck if the condition in equation (5) is respected :

L1 = mi=da1x jxik i cj j r: ^ (5) where d is the number of dimension, cji the center of the Cj at the dimension i and r the maximal distance from the Ck. In this context, r is set to i. 4

A new hierarchical clustering density-based algorithm

In this section, a new hierarchical clustering density-based algorithm is presented. Inputs are the connections between -clusters from the KD-Tree and the output is a flat partition of clusters where all -clusters that are in clusters have a minimum density level guaranteed. 4.1

Representation of the -clusters connections

The representation of the connections between all -clusters can be visualized by a weighted Graph. A weighted Graph G = (N ; E ; W) is a triplet where N is a set of nodes. A node ni corresponds to the -cluster Ci. E is the set of edges with eij the edge between the node ni and nj , which are unordered pairs of elements of G. Finally, W is a set of weights on E with wij defined in the equation 6. wi;j = min(Di; Dj ): (6) where the density of a -cluster Di is defined in the equation 4.

An edge eij between -clusters Ci and Cj means those are directly connected in the sense defined in the section 3.1. If two -clusters are not directly connected, there is no edge between them which leads to a Graph that is not full. The Graph of -cluster’s connections is built according to the algorithm 1. Neighbors are searched for each -cluster with respect to the equation 5 (lines 3 to 6). The function Search_neighbors() is detailed in the algorithm 2. For each

Ck, the distance L1 defined in equation 5 is applied where r = i, xik the value of Ck at the ith dimension and cij the center of the Cj at the dimension i. The variable Neighbors_of_k contains all the neighbors of Ck. An edge ekj is added to the Graph G for each neighbor Cj of the cluster studied Ck and the weight wkj is calculated with the equation defined above (lines 7 to 11).

Algorithm 1 Build the Graph of -cluster’s connections Require: KD-Tree 1: G = Graph() 2: Connection = () 3: for k = 1 to Number of -clusters do 4: N = [] 5: N = Search_neighbors(k) 6: Connection[k] = N 7: for j = 1 to Nbre of neighbors of k do 8: Weight = min(Dk, Dj ) 9: G.add_edges(k, j, Weight) 10: end for 11: end for Algorithm 2 Research of a -cluster’s neighbors Require: KD-Tree, -cluster Ck 1: for j = 1 to Number of -clusters do 2: Neighbors_of_k = [] 3: if mi=da1x jxik cij j r^then

4: Neighbors_of_k.add(j)

5: end if 6: end for 7: return Neighbors_of_k Let us consider five -clusters C1, C2, C3, C4 and C5 with their densities D1 = 14, D2 = 12, D3 = 3, D4 = 9, D5 = 13. Figure 3 shows those five -clusters.

The search of neighbors begins with the densest -cluster. In this example, the research starts with the C1 as shown in Figure 4. All -clusters that satisfy the equation 5 are neighbors of C1.

Once neighbors of C1 are found, the neighbors of the other -clusters are searched. The result is shown in table 1: -clusters

Neighbors C1 C2 C3 C4 C5

C2 C1, C3 C2, C4 C3, C5

The weights are calculated for every edge following the equation 6 : 8w12 = min(D1; D2) = min(14; 12) = 12; > ><w23 = min(D2; D3) = min(12; 3) = 3; >w34 = min(D3; D4) = min(3; 9) = 3 > :w45 = min(D4; D5) = min(9; 13) = 9: (7) The Figure 5 shows the weighted Graph which represents the connections between all -clusters in the dataset. The objective of the algorithm proposed in this paper is to represent and visualize all the possible clusters at different levels of density in a dataset. In contrast to most known hierarchical clustering algorithm, links that relies level ln and ln + 1 in our new algorithm’s dendogram are not based on the distance between objects but on their densities. Furthermore, our proposal is not based on the objects but on -clusters that contains the objects to decrease the complexity of calculation. At the root of the tree, there is one cluster composed by all -clusters. Each cut in the tree corresponds to a density threshold, i.e each cluster formed below this cut level is composed by -clusters that have a density higher to the density threshold. At the bottom of the tree, there is one cluster for each -cluster. So root’s density level is 0 and the last cut on the tree is max(w 2 W). Like [ 12 ], a variable " is user parameter evolving in the interval " 2 [0; max(w 2 W)]. For each iteration of ", all the weights w 2 W are checked. if wij "^, the edge eij is removed. The vertical axis of the dendogram is 1=D to keep an ascending direction.

Figure 6 represents the case when " = 3. We can observe two clusters composed by -clusters that have their densities strictly higher to three and one -cluster alone. This method allows to detect a split of clusters case and to isolate the least denses -clusters as early as possible. Once all values in the interval of " studied and all edges removed, the dendogram can be generated (see Figure 7).

At the density threshold D = 3, two clusters composed by C1, C2 and C4, C5 are found and the -cluster

C3 is alone. One cluster composed by C1 and C2 is found and C4, C5. Then for D=9, one cluster composed by C3. Finally, above density D = 12, all -clusters are alone. The dendogram generated, a partition of clusters corresponding to a specific density level can be extracted from the Graph of -cluster’s connections with a density threshold according to the algorithm 2. To find clusters, the edges that have a weight strictly lower than the density threshold are removed (lines 1 to 7). The remaining edges in the Graph have their weights higher or equal to the density threshold hence -clusters forming the clusters are guaranteed to have their densities higher or equal to the threshold. Remaining groups are searched in the Graph. If the size of a group is equal to 1, i.e the group have not connection with other -clusters, it is considered as noise. Else, the group is recognized as a cluster (lines 10 to 16).

Once the density threshold known, all the edges that have their weights strictly lower are removed to left only groups of -cluster to form final clusters. Figure 11 shows the final clusters. The first includes C1 and C2, the second includes C4, C5 and the last contains C6, C7. C2 and

C8 are considered as noise because they do not have any connections with the other -clusters.

An other example is shown in Figure 8 to illustrate this part. Let us consider -clusters Ci, i = 1; :::; 8 with their respective densities D1 = 12, D2 = 3, D3 = 10, D4 = 11, D5 = 13, D6 = 9, D7 = 10 and D8 = 4.

In this example, the density threshold is fixed to " = 8. Every group of -clusters that is below the density threshold is considered as a cluster. If a -cluster does not have connection to the other -clusters, then it is considered as noise. The Figure 10 illustrates how to visualize the clusters that have a density strictly above the threshold.

Process inputs are set-points for the cold water, hot water and steam valves. Process outputs are hot and cold water flow, tank level and temperature. Process inputs and outputs represent electronic signals in the range 4-20 mA. The test is done using three output variables: cold water flow CWflow, tank level T anklevel, and temperature of the water in the tank T anktemperature, in the operation mode OP 1. In this mode, these variables are regulated at the values provided in table 2. The process undergoes several faults and several repairs that are reported in table 3.

Variable

CWflow(mA)

T anklevel(mA) T anktemperature(mA)

The measurements of CWflow, T anklevel, and T anktemperature are shown in Figure 13 and their recorded values across time constitute the data set for our hierarchical clustering experiment. Sudden changes in the value of regulated variables are indicative of the occurence of some fault or of some fault being fixed. The dataset was generated by simulation.

For this experiment, the radius of research for a cluster’s neighbors r is set to 0.06. The parameter ' that defines the number of dimensions that must overlap so that two -clusters are considered directly connected is set to 0. That means two -clusters Ck and Ck are directly connected if their hyperboxes overlap in all dimensions. The parameter " is set to 0 for results shown in Figure 16 that correspond to the root of the dendogram. For the following graphs, the x-axis is the flow CWflow normalized and the y-axis is the tank temperature T anktemperature normalized. The tank level T anklevel is not plotted. Figure 14 shows the graph of connections between -clusters when " = 0. Each red square represents a -cluster. Micro-clusters that are not connected to the others are considered as noise. The cluster C293 is connected to C222 and C232, meaning there are directly connected. C222 is connected to C34 and C232 is connected with C41 and so on. This chain of -clusters forms a cluster. Figure 15 shows a generalized dendogram representing the hierarchy of behaviors found in the dataset.

HDyClee detects the 6 main behaviors as shown the Figure 16. The biggest cluster (green cluster) represents the nominal behavior, the blue cluster represents fault s2, the pink cluster (bottom of the Figure) represents the fault s1, the grey cluster models the fault l3, the brown cluster shows the event l1, and the purple cluster represents l1 and l2 present simultaneously. Black points represent -clusters that have no connection with other -clusters. All the objects inside these -clusters are considered as noise.

To visualize only the nominal behavior, the dendogram must be cut at the density threshold " = 7 because above this value, the other clusters have no -clusters that are connected to each other. This is shown on Figure 19. Figure 18 illustrates the graph of -cluster connections after deletion of the edges with a weight wij 7^. Some -clusters which were part of the biggest cluster are now considered as noise. Indeed, edges that connected them to other -clusters were less than the density threshold.

It is possible to visualize the most frequent behaviors of the system, in our case the normal behavior and fault l3. This is reproted in Figure 17. For this purpose, the density threshold is set to " = 6 by using the dendogram. At this density, the clusters corresponding to other behaviors are considered as noise because their maximal densities are less than 6. The work presented in this paper proposes a new hierarchical density-based algorithm named HDyclee. The purpose of this algorithm is to extract a hierarchy of clusters that are guaranteed to have a level of density at each layer. Branches in the dendogram do not represent distance between objects but minimum density difference. This approach allows one to identify clusters with the poorest densities and then walk up the hierarchy for higher densities. The algorithm is detailed and tested on a well known monitoring benchmark. HDyClee is able to detect all the behaviors of the process and the user can explore more or less frequent behaviors by cutting the dendogram at different densities.

Next step is to develop experimentations in order to compare this new algorithm with other density-based algorithms. Then the comparative study will include hierarchical and distance-based clustering methods [ 22 ], [ 23 ]. Several perspectives have been identified for HDyClee, which follow from DyClee properties. In particular, DyClee has a forgetting function that allows to forget -clusters which do not receive any data or are not significant (not denses clusters). This function will be included in HDyClee, which will allow us to produce a dynamic dendogram and then to visualize the evolution of the different behaviors of a system.

A density-sensitive hierarXiv preprint arXiv:

[1] James

MacQueen

et al. Some methods for classification and analysis of multivariate observations . In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability , volume 1 , pages 281 - 297 . Oakland, CA, USA, 1967 .

[2]

Leonard

Kaufman and

Peter

Rousseeuw . Clustering by means of medoids . North-Holland, 1987 .

[3]

Zhexue

Huang . Extensions to the k-means algorithm for clustering large data sets with categorical values . Data mining and knowledge discovery , 2 ( 3 ): 298 - 304 , 1998 .

Martin

Ester , Hans-Peter Kriegel , Jörg Sander, Xiaowei Xu , et al. A density-based algorithm for discovering clusters in large spatial databases with noise . In Knowledge Discovery Data Mining , volume 96 , pages 226 - 231 , 1996 .

Maria

Halkidi , Yannis Batistakis, and

Michalis

Vazirgiannis . On clustering validation techniques . Journal of intelligent information systems , 17 ( 2-3 ): 107 - 145 , 2001 .

Olatz

Arbelaitz , Ibai Gurrutxaga, Javier Muguerza, Jesùs M Pérez , and Iñigo Perona . An extensive comparative study of cluster validity indices . Pattern Recognition , 46 ( 1 ): 243 - 256 , 2013 .

Yanchi

Liu ,

Zhongmou

Li ,

Hui

Xiong ,

Xuedong

Gao , and

Junjie

Wu . Understanding of internal clustering validation measures . In Data Mining (ICDM) , 2010 IEEE 10th International Conference on, pages 911 - 916 . IEEE, 2010 .

[15] Nathalie

Barbosa

, Louise Travé-Massuyès, and Victor Hugo Grisales. A data-based dynamic classification technique: A two-stage density approach . In SAFEPROCESS 2015, Proceedings of the 9th IFAC Symposium on Fault Detection, Supervision and Safety for Technical Processes , pages 1224 - 1231 . IFAC, 2015 .

[16] Charu

C Aggarwal

, Jiawei Han,

Jianyong

Wang , and Philip S Yu . A framework for clustering evolving data streams . In Proceedings of the 29th international conference on Very large data bases- Volume 29 , pages 81 - 92 . VLDB Endowment, 2003 .

[17] Philipp

Kranen

, Ira Assent, Corinna Baldauf, and

Thomas

Seidl . The ClusTree: indexing micro-clusters for any stream mining . Knowledge and information systems , 29 ( 2 ): 249 - 272 , 2011 .

[18] Feng

Cao

, Martin Ester, Weining Qian, and

Aoying

Zhou . Density-based clustering over an evolving data stream with noise . In Proceedings of the 2006 SIAM international conference on data mining , pages 328 - 339 . SIAM, 2006 .

[19]

Yixin

Chen and

Tu . Density-based clustering for real-time stream data . In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data mining , pages 133 - 142 , 2007 .

[20]

Songrit

Maneewongvatana and David M Mount. It's okay to be skinny, if your friends are fat . In Center for Geometric Computing 4th Annual Workshop on Computational Geometry , volume 2 , pages 1 - 8 , 1999 .

[21] Nina

F Thornhill

, Sachin C Patwardhan , and Sirish L Shah . A continuous stirred tank heater simulation model with applications . Journal of Process Control , 18 ( 3-4 ): 347 - 360 , 2008 .

[22]

Adrián

Rodríguez

Ramos , José Manuel Bernal de Lázaro, Antônio J da Silva

Neto , Carlos Cruz Corona, José Luís Verdegay, and Orestes LlanesSantiago. An approach to fault diagnosis using fuzzy clustering techniques . In Advances in Fuzzy Logic and Technology 2017 , pages 232 - 243 . Springer, 2017 .

[23] Yaguo

Lei

, Zhengjia He, Yanyang Zi , and Xuefeng Chen . New clustering algorithm-based fault diagnosis using compensation distance evaluation technique . Mechanical Systems and Signal Processing , 22 ( 2 ): 419 - 435 , 2008 .

[8] Stephen

Johnson . Hierarchical clustering schemes . Psychometrika , 32 ( 3 ): 241 - 254 , 1967 .

Robin

Sibson . Slink: an optimally efficient algorithm for the single-link cluster method . The computer journal , 16 ( 1 ): 30 - 34 , 1973 .

[10]

Daniel

Defays . An efficient algorithm for a complete link method . The computer journal , 20 ( 4 ): 364 - 366 , 1977 .

[11] Álvaro Martínez-Pérez. archical clustering method. 1210.6292 , 2012 .

[12] Ricardo

JGB Campello

, Davoud Moulavi, and

Jörg

Sander . Density-based clustering based on hierarchical density estimates . In Pacific-Asia conference on Knowledge Discovery and Data mining , pages 160 - 172 . Springer, 2013 .

[13] Nathalie

Barbosa

, Louise Travé-Massuyès, and Victor Hugo Grisales. A novel algorithm for dynamic clustering: Properties and performance . In Machine Learning and Applications (ICMLA) , 2016 15th IEEE International Conference on, pages 565 - 570 . IEEE, 2016 .

[14]

Nathalie

Barbosa . A data-based approach for dynamic classification of functional scenarios oriented