Introduction

Cluster Ensemble with Averaged Co-Association Matrix Maximizing the Expected Margin

Vladimir Berikov

berikov@math.nsc.ru 0 1 0 Novosibirsk State University , Novosibirsk , Russia 1 Sobolev Institute of mathematics , Novosibirsk , Russia

489 500

The problem considered is cluster analysis with usage of the ensemble approach. The paper proposes a method for finding optimal weights for the averaged co-association matrix applied to the construction of the ensemble partition. The main idea is to find such weights for which the expectation of ensemble margin takes its maximum value. A latent variable pairwise classification model is used for determining margin characteristics dependent on cluster validity indices. To construct the ensemble partition, we apply minimum spanning tree found on the averaged co-association matrix as an adjacency matrix. The efficiency of the method is confirmed by Monte-Carlo simulations with artificial data sets.

cluster ensemble weighted co-association matrix latent variable model cluster validity index margin expectation minimum spanning tree

Introduction

Copyright c by the paper’s authors. Copying permitted for private and academic purposes. step of the algorithm, the current partition is updated to improve the quality functional. As a rule, the process is guided by certain user-specified parameters.

Collective decision making (ensemble clustering) is a comparatively new approach in cluster analysis [6,7]. Following the ensemble framework, a number of base partitions are obtained firstly with the collection of different algorithms (or with a single algorithm under different parameters or other working settings). On the second step, the given variants are combined to yield the final consensus partition. The ensemble approach, as a rule, allows one to increase the stability of clustering results in case of uncertainty on data model or when it is not clear which of algorithm’s parameters are most appropriate for a particular problem.

A number of ways to find the ensemble solution exist. In the co-association (CA) matrix based approach, each pair of objects ai, aj ∈ A are associated with the matrix element indicating how often the objects are assigned to different clusters over all partition variants. This value is considered as analog of distance between ai and aj. To construct the final partition, any procedure based on a pairwise distance matrix can be used, for example, hierarchical agglomerative clustering (HAC) algorithm.

Some authors suggest weighted cluster ensemble aimed at improving its characteristics. The weights depend on the estimated quality (cluster validity indices, diversity measures, etc.) of base partitions [8,9].

The work [10] considerers weighted cluster ensemble composed of different algorithms. For determining the ensemble solution, it is proposed to use weighted averaged CA matrix, where the weights are found according to the principle of minimum variance of ensemble margin (closely related to the maximization of the expected margin). This principle is substantiated by the analysis of probabilistic properties of cluster ensembles in the framework of a latent variable pairwise classification (LVPC) model. The suggested method (Pairwise Weighted Ensemble Clustering, PWEC) has demonstrated an ability to find complex data structures under noise conditions; however, it considers only variations of cluster labels and disregards potentially useful information on other characteristics of base partitions such as cluster validity indices. Another disadvantage is that it can not automatically determine the number of clusters.

The proposed work aims at eliminating the limitations mentioned above: we suggest a method that a) assigns weights taking into account both stability measures of base clusterings and the obtained quality estimates; b) finds the optimal number of clusters using minimum spanning tree (MST) algorithm.

The rest of the paper is organized as follows. The second section briefly introduces the necessary notions. In the third section we describe the concept of ensemble margin and obtain an expression for the expected margin with use of LVPC model. This section also suggests a solution to the problem of assessing the characteristics of the ensemble and finding the optimal weights which give the maximum value to the expected margin. In the forth section we formulate the algorithm for constructing the ensemble clustering partition with MST approach. The fifth section presents the results of numerical experiments using Monte-Carlo simulations with data sampled from a given probability distribution. The conclusion summarizes the work.

Basic concepts

This section describes the method of ensemble clustering with weighted CA matrix and provides examples of cluster validity indices used in this work. 2.1

Cluster ensemble with usage of weighted co-association matrix

In the framework of the ensemble approach, we consider a collection of different clustering algorithms μ1, . . . , μM applied to data set A. Suppose that each algorithm μm, m = 1, . . . , M is enabled to work a number of times under different conditions such as initial centroids coordinates, subsets of features, number of clusters, etc. In each lth run, it generates a partition composed of Kl,m clusters, where l = 1, . . . , LM , LM is a given number of runs for algorithm μm.

The quality of each variant is evaluated with some criterion (e.g., cluster validity index) γl,m. It is allowable to suppose that the indices are properly standardized so that a) 0 ≤ γl,m ≤ 1, and b) the more compact and distant are the found clusters, the larger is the index value.

For a pair of different objects ai, aj, we define the value

hl,m(i, j) = I[μl,m(ai) 6= μl,m(aj )], where I[·] is indicator function: I[true] = 1; I[f alse] = 0; μl,m(a) is a cluster label assigned by algorithm μm to object a in lth run.

The averaged weighted CA matrix H = (h¯(i, j)) is calculated over all found variants, taking into account validity indices. We set matrix elements as

M Lm h¯(i, j) = X X wl,m(i, j) hl,m(i, j), m=1 l=1 ( 1 ) where wl,m(i, j) ≥ 0 are weight constants. The weights should account for the degree of confidence to the results of cluster assignments for a given pair. In the suggested algorithm (see below) the weights depend on validity indices and other characteristics of the ensemble.

It is easy to prove that H defines a pseudo metric on A, thus matrix elements can be considered as analogs of distances between pairs of objects. To find the final partition of A on a predefined number of clusters using matrix H, one can apply HAC algorithm with average linkage rule for the definition of between-group distances. 2.2

Cluster validity indices

Cluster validity indices give one a possibility to compare the obtained partition with some ”etalon” variant (external indices ); or to assess the result of clustering with various measures of cluster compactness and separability (internal indices ). A majority of existing indices require quadratic time for their calculation; however, indices of linear complexity exist.

External indices estimate the degree of correspondence between two partitions P1 = {C1,1, . . . , Ck,1, . . . , CK1,1} and P2 = {C1,2, . . . , Cl,2, . . . , CK2,2}, where Ck,1 = {ai1 , . . . , aiNk,1 }, Cl,2 = {aj1 , . . . , ajNl,2 }, Nk,1 is a number of objects in kth cluster of the first partition, Nl,2 is a number of objects in lth cluster of the second partition.

The Rand index is defined as

ϕR(P1, P2) = (S + D)/G0, where S is a number of pairs belonging to the same cluster in P1 and P2, D is a number of pairs belonging to different clusters, G0 = (n2) is total number of pairs. The index belongs to interval [0, 1] and defines the portion of correctly classified pairs; the value ϕR = 1 indicates perfect agreement between two partitions.

The adjusted Rand index [11] is defined with the following expression: N(k,l) 2 P − Q1Q2/G0 k,l ϕAR(P1, P2) = 12 (Q1 + Q2) − Q1Q2/G0 , where Q1 = P N2k,1 , Q2 = P N2l,2 and N (k,l) = kCk,1 ∩ Cl,2k. The maximum value k k of ϕAR is 1; the index may take negative values. The value close to 0 indicates that the obtained cluster partition is mostly probably accidental.

Internal indices are determined for a single partition P = {C1, . . . , CK }. This work uses two examples of internal indices: Hubert Gamma index and cophenetic correlation coefficient.

Hubert Gamma index, Γ (P ), is specified as the linear correlation coefficient between elements of matrices R(i, j) and U (i, j) (i < j), where R is a matrix of pairwise Euclidian distances between data points, U is unweighed co-association matrix: U (i, j) = 0 if ∃Cq : ai, aj ∈ Cq; otherwise U (i, j) = 1. An increase in index value typically indicates that the quality of the partition is improving.

Cophenetic correlation coefficient, Coph(P ), is defined with a dendrogram of clustering partition. The dendrogrammatic distance τ (i, j) between ai, aj is the height of the node at which these two objects are first joined together. Coph(P ) equals the linear correlation coefficient between the Euclidian distances R(i, j) and dendrogrammatic distances τ (i, j) over all i, j (i < j).

Both Γ (P ) and Coph(P ) belong to interval [−1, 1]; it will be more convenient for us to replace their negative values with zeros. 3

Margin of cluster ensemble

Suppose that data sample is a mixture of a finite number of components (classes). A latent groundtruth variable Y defines a class number to which an object belongs to. Matrix Z with elements

Z(i, j) = I[ Y (ai) 6= Y (aj ) ], ( 2 ) where ai and aj are arbitrary data objects, defines their true status (i.e., if ai and aj indeed belong to separate classes). Let us define margin matrix dependent on the variants of cluster partitions. For a pair ai, aj, their margin is mg(i, j) = {weighted number of votes for Z(i, j)−

weighted number of votes against Z(i, j)} and can be written as

M Lm mg(i, j) = X X wl,m(i, j) {I[hl,m(i, j) = Z(i, j)] − I[hl,m(i, j) 6= Z(i, j)]} .

m=1 l=1 This value indicates to what extend the number of right decisions for ai, aj exceed the number of wrong ones. Evidently, it equals:

M Lm mg(i, j) = X X wl,m(i, j)(2Z(i, j) − 1)(2hl,m(i, j) − 1).

m=1 l=1

Margin matrix can not be calculated if the true partition is unknown. However, it was shown in [10] that some of margin’s characteristics can be assessed using a few basic assumptions on the statistical behavior of clustering algorithms. Furthermore, there was found an upper bound for error probability (at assigning objects in a pair to different clusters) and it was proved that an increase in the expected margin and a decrease in margin variance reduces the error resulting in improving the quality of cluster ensemble. 3.1

Conditional expectation of margin

Consider a specific form of ( 1 ): let each weight be a function of validity index and a quantity αm(i, j) ≥ 0: wl,m(i, j) = αm(i, j) γl,m ,

Lm where P αm(i, j) = 1 for all i, j (in ( 4 ), we separate a component dependent on validity m index).

Below we formulate basic assumptions which allow us assessing characteristics of cluster ensemble: a) Data table X is arbitrary and fixed; for any pair of data objects ai, aj , chosen at random from A, their true status is a random value Z(i, j) defined with ( 2 ). b) Each algorithm μm is randomized, i.e. it depends on a random vector Ωm from the set of parameters Ωm, m = 1, . . . , M . For data table X as input, μm is running Lm times with i.i.d. parameters Ω1,m, . . . , ΩLm,m being statistical copies of Ωm.

It should be noted that the i.i.d. condition must be ensured through a mechanism of the ensemble generation.

For any algorithm μm, the conditional probabilities of correct decisions (either separation or joining of ai, aj under fixed Z(i, j)) are denoted as qm1(i, j) = P [hm(i, j, Ωm) = 1|Z(i, j) = 1], ( 3 ) ( 4 ) ( 5 ) qm0(i, j) = P [hm(i, j, Ωm) = 0|Z(i, j) = 0]. ( 6 )

The measure of cluster validity, obtained by algorithm μm on Ωm, is considered as a random value γm(X, Ωm). Because the quality criterion is determined on the whole data set, one may understand this value as practically independent on quantities Z(i, j), hl,m(i, j), etc., defined for a specific object pair.

Proposition 1. Given the assumptions a), b) be valid, conditional mathematical expectation of ensemble margin for ai, aj under Z(i, j) = z is:

EΩ¯ [mg(i, j)|Z(i, j) = z] = X αm(i, j) E[γm] (2pm(z; i, j) − 1),

m where Ω¯ = (Ω1,1, . . . , ΩLM ,M ), E[γm] = EΩm [γm(Ωm)], pm(z; i, j) = (1 − z) qm0(i, j) + z qm1(i, j).

Proof. Let us denote by

hl,m(i, j, Ωl,m) = I[μm(i, Ωl,m) 6= μm(j, Ωl,m)] the decision for a pair ai, aj, where μm(i, Ωl,m) is a cluster label assigned to object ai by algorithm μm in its lth run with parameter vector Ωl,m, and also denote by hm(i, j, Ωm) = I[μm(i, Ωm) 6= μm(j, Ωm)] the decision under vector Ωm.

Until the proof end, let us skip arguments i, j for short: mg(i, j) = mg, Z(i, j) = Z, etc. From ( 3 ) and ( 4 ) we have:

EΩ¯ [mg|Z = z] = X αm X EΩ¯ [γl,m (Ωl,m)(2z − 1)(2hl,m(Ωl,m) − 1)].

m Lm l Because Ωl,m and Ωm are equally distributed and γl,m is independent on hl,m, it holds true that

EΩ¯ [mg|Z = z] = X αmE[γm(Ωm)] (2z − 1)(2EΩm [hm(Ωm)] − 1).

m For z = 0 we have:

(2z − 1)(2EΩm [hm(Ωm)] − 1) = −(2P [hm(Ωm) = 1|Z = 0] − 1) = 2qm0 − 1; and for z = 1:

Therefore This completes the proof.

(2z − 1)(2EΩm [hm(Ωm)] − 1) = 2P [hm(Ωm) = 1|Z = 0] − 1 = 2qm1 − 1.

EΩ¯ [mg|Z = z] = X αmE[γm(Ωm)] (2pm(z; i, j) − 1).

In this work we also assume that each algorithm μm creates a partition for which the probability of correct decision for any pair ai, aj is greater than 0.5, i.e. qm0(i, j) > 0.5, qm1(i, j) > 0.5. ( 7 ) It means that clustering algorithms included into the ensemble possess at least slightly better classification accuracy than a trivial procedure based on random assignment of a pair to the same or different clusters. 3.2

Evaluation of the expected margin and its optimization

The main idea of our method is to control the behavior of cluster ensemble by changing weights α1, . . . , αM looking for the maximum conditional expectation of margin. The expression for the expected margin depends on a number of characteristics (pm(z; i, j), E[γm]) which should be evaluated from the results of base clustering algorithms, i.e., a number of partitions along with evaluated validity indices γl,m, l = 1 . . . , Lm, m = 1, . . . , M .

To evaluate pm(z; i, j), one may consider the obtained partitions and calculate coassociation matrices with elements

¯h l,m(i, j) = I[μl,m(ai) 6= μl,m(aj )].

For each (i, j), the division frequency is P¯m(i, j) = an estimate of conditional probability 1

P h¯ l,m(i, j). Each P¯m(i, j) is

Lm l

P [hm(i, j, Ωm) = 1|Z(i, j) = z] depending on the realization of Z(i, j). Given the assumption ( 7 ) be valid, one may assess pm(z; i, j) with quantity

p¯m(i, j) = max(P¯m(i, j), 1 − P¯m(i, j)).

This value can be interpreted as an estimate of local stability of cluster decisions for a pair ai, aj .

To evaluate theoretical means E[γm], m = 1, . . . , M , one may use sample averages 1 Lm of validity indices γ¯m = P γl,m.

Lm l=1 Let us consider a problem of maximization of the expected margin: for each i, j (i < j), find α1(i, j), . . . , αM (i, j) : X αm(i, j)γ¯m(2p¯m(i, j) − 1) → m

max α1(i,j),...,αM (i,j) The solution to this linear programming problem is rather straightforward: α∗m∗ (i, j) = 1, m∗=arg m=1,...,M{γ¯mp¯m(i,j)};

max 0, otherwise.

As one can see, in the resulting optimization strategy only the best algorithm from the ensemble (considering quality and stability estimates) is taken into account. For different pairs ai, aj , different algorithms may come out on top. 4

Finding ensemble partition with minimum spanning tree

Let G = (V, E) be an undirected graph with non-negative edge weights w, where V is set of vertices (in our case, V = A, i.e., the set of data objects); E is the set of edges. Matrix w = (wi,j ) defines weights of edges (ai, aj ), where i, j = 1, . . . , n. As a weight matrix, we consider the averaged CA matrix H with elements h¯(i, j) defined according to ( 1 ), ( 4 ), ( 8 ).

A spanning tree is a tree with n − 1 edges, i.e. a tree that connects all the vertices. The total weight of a spanning tree T is defined as wT = Pe∈T w(e). A minimum spanning tree is a tree of minimum total weight.

A large number of methods for MST construction exist [12]. Their running time is of polynomial order (in some special cases, near to linear).

In a graph-theoretic framework, computationally efficient MST-based algorithms are widely used in cluster analysis [13]. Once the MST T is built for a given data set, first k − 1 edges with largest weights are removed, which creates a partition Pk of A on k clusters. The obtained variant is evaluated with an objective function; the best partition for different k ∈ {2, . . . , Kmax} forms an output (Kmax is algorithm’s parameter). MST clustering algorithms are capable of finding complex non-spherical clusters; they can automatically determine the number of clusters using a predefined quality functional.

In this paper, we apply well known Kruskal’s algorithm [14] for finding MST for data set A and matrix H. Using the obtained MST, the final ensemble partition is created. As a quality qriterion, we use the following functional: k F (Pk) = k Y Ncwc/wT ,

c=1 where Nc is size of cth cluster, wc is total weight of edges which enter into cth cluster.

MST-based ensemble clustering algorithm (MSTEClust) has the following main steps.

Algorithm MSTEClust. Input:

A = {a1, . . . , an}: dataset described with data table X; Kmax: maximum number of clusters; L1, . . . , LM : number of runs for base algorithms μ1, . . . , μM ; ( 8 ) ( 9 ) Ω1, . . . , ΩM: sets of allowable parameters (working conditions) of algorithms.

Output:

partition of A into optimal number K of clusters (2 ≤ K ≤ Kmax).

Steps: 1. Generate L1, . . . , LM variants of cluster partition with algorithms μ1, . . . , μM for randomly chosen working parameters; calculate validity indices γl,m, l = 1, . . . , Lm, m = 1, . . . , M . 2. Calculate averaged co-association matrix H with optimal weights according to ( 1 ), ( 4 ), ( 8 ). 3. Construct MST on graph (V, G) = (A, H) using Kruskal’s algorithm. 4. Find a partition of A on optimal number of clusters K using MST and criterion ( 9 ). end.

The running time of steps 1-2 linearly depends on the complexity of base clustering algorithms and validity estimators. Constructing of the final partition needs O(n log(n)) operations. One may conclude that MSTEClust is more readily applicable to large amount of data than PEWEC [10] which has time complexity of order O(n2). 5

Numerical experiments

This section describes numerical experiments with MSTEClust algorithm with Monte Carlo modeling. In the experiments, two base clustering algorithms are included into the ensemble: k-means (KM) and hierarchical agglomerative clustering algorithm with single linkage rule (HACSL).

We choose the following distribution model. In 16-dimensional feature space four classes are defined. First and second classes are of normal distribution N (ν1, Σ), N (ν2, Σ), where ν1 = (0, ..., 0)T , ν2 = (6, ..., 6)T , Σ = σI is diagonal covariance matrix, σ = 1. The coordinates of objects from other two classes are determined recursively: xki+1 = xki + θ1 · 1 + θ2 · ε, where 1 = (1, . . . , 1)T , θ1 = θ2 = 0.25, ε is Gaussian random vector N (0, I), k = 3, 4. For class 3, x31 = (−6, 6, ..., −6, 6)T + 0.1 · ε; for class 4, x41 = (6, −6, ..., 6, −6)T + 0.1 · ε. The number of objects of each class equals 25. Figure 1 illustrates sampled data.

The random subspace method is used for the ensemble construction: each base cluster partition is built on dens randomly chosen variables (dens = 3 in this experiment). Besides that, for each base algorithm the number of clusters is chosen at random from range {2, . . . , Kmax}, Kmax = 5. The obtained clustering results are assessed with cluster validity indices: Hubert Gamma index estimates the quality of KM; the results of HACSL are evaluated with cophenetic correlation index. The number of ensemble variants for each algorithm is set to L1 = L2 = 25.

To study the behavior of the ensemble algorithm in the presence of noise, some of the features, whose indexes are determined by chance and their total number equals parameter d0, are replaced with random values uniformly distributed over feature range.

In the process of Monte Carlo modeling, artificial data sets are repeatedly generated according to the specified distribution model. For each data set, a number of base partitions are obtained by KM and HACSL; the ensemble solution is found according to MSTEClust. 1.2

1 d0

Figure 2 presents the results of the experiments (Adjusted Rand index averaged over 100 random samples). For comparison purposes, the results of KM and HACSL (for which the true number of clusters is given) are plotted as well. To illustrate the effect of weights optimization taken validity indices into account, Figure 2 displays the evaluations of the performance of PEWEC [10] modification in which the MST is used to construct the ensemble partition instead of a dendrogram algorithm.

The plots are positive evidence for the suggested ensemble algorithm with optimized weights. Beginning with sufficiently large number of noise features, MSTECLust gives significantly better clustering quality than other examined procedures. In contrast, non-ensemble algorithms totally fail to recover heavily distorted data structure.

An important problem is experimental verification of theoretical suppositions which lie at the basis of the suggested method. The validity of assumption ( 7 ) is crucial because it determines the way of assessing the expected margin. In Monte Carlo simulation, the true status of each object pair is known, thus it is possible to evaluate conditional probabilities qm1(i, j), qm0(i, j) in ( 5 ),( 6 ) using available frequencies. In the experiment, all above-described settings are unchanged. Table 1 presents the results of estimations. From this table, one can see that in the condition of small and moderate noise (up to three noise features) the estimates stay above 0.5, thus confirming assumption ( 7 ). Under increasing noise, violation of the inequalities is seen, evidently resulting in the decrease of the algorithm’s accuracy.

Conclusion

In this work we have suggested a method of ensemble clustering using a number of different base algorithms. The ensemble partition is found by the weighted average of co-association matrices, where the weights depend on cluster validity indices. A mathematical methodology of determining optimal weights is proposed. As an optimization functional, we utilize the estimate of the ensemble’s expected margin. To construct the final partition, it is suggested to use minimum spanning tree built on the averaged co-association matrix as an adjacency matrix.

Unlike other existing methods of ensemble clustering, the proposed one takes into account both stability measures of base partitions and the obtained quality estimates; it is capable of finding the optimal number of clusters.

The efficiency of the suggested MSTEClust algorithm is confirmed experimentally with Monte-Carlo modeling. For a given distribution model, the simulations under noise distortions have demonstrated significant improvement of clustering quality for MSTEClust in comparison with other considered algorithms. Monte-Carlo experiments have confirmed the principal feasibility of theoretical assumptions used in this work for the estimation of cluster ensemble characteristics.

In the future works, the author plans to continue studying theoretical properties of clustering ensembles and developing algorithms for real world applications such as hyperspectral images analysis and segmentation.

Acknowledgements

This work was partially supported by the Russian Foundation for Basic Research, project 14-07-00249a.

1. Falkenauer

E.: Genetic

Algorithms and Grouping Problems. Wiley, New York ( 1998 )

2. Naldi

, Campello

, Hruschka

, Carvalho

: Efficiency of evolutionary k-means . Applied Soft Computing. 11 , 1938 - 1952 ( 2011 )

3. Duda

R.O.

, Hart

P.E.

, Stork

D.G.: Pattern

Classification. Second Edition . Wiley, New York ( 2000 )

4. Jain

A.K.

, Dubes R.C. : Algorithms for clustering data . Prentice Hall, NJ ( 1988 )

5. Jain

A.K.

: Data clustering: 50 years beyond k-means . Pattern Recognition Letters . 31 ( 8 ), 651 - 666 ( 2010 )

6. Ghosh

, Acharya

: Cluster ensembles . Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery . Vol. 1 , 305 - 315 ( 2011 )

7. Vega-Pons , S.

Ruiz-Shulcloper , J.:

A Survey of Clustering Ensemble Algorithms . IJPRAI 25 ( 3 ), 337 - 372 ( 2011 )

8. Vega-Pons , S. , Ruiz-Shulcloper , J. : Clustering Ensemble Method for Heterogeneous Partitions . CIARP' 09 , 481 - 488 ( 2009 )

9. Naldi , M. C. , Carvalho , A. C. P. L. F. , Campello , R. J. G. B. : Cluster ensemble selection based on relative validity indexes . Data Mining and Knowledge Discovery . Vol. 27 , 259 - 289 ( 2013 )

10. Berikov

: Weighted ensemble of algorithms for complex data clustering . Pattern Recognition Letters . Vol. 38 , 99 - 106 ( 2014 )

11. Hubert

, Arabie

: Comparing partitions . Journal of Classification . Vol. 2 , 193 - 218 ( 1985 )

12. Gabow , H. N. , Galil , Z. , Spencer , T. , Tarjan , R. E. : Efficient algorithms for finding minimum spanning trees in undirected and directed graphs . Combinatorica . Vol. 6 ( 2 ), 109 - 122 ( 1986 )

13. Gower J., G. Ross. : Minimum spanning trees and single linkage cluster analysis . Applied Statistics . Vol. 18 , 54 -- 64 ( 1969 )

14. Kruskal J.: On the shortest spanning subtree and the traveling salesman problem . Proceedings of the American Mathematical Society , pp. 48 - 50 ( 1956 )