=Paper= {{Paper |id=Vol-1623/papercpr1 |storemode=property |title=Cluster Ensemble with Averaged Co-Association Matrix Maximizing the Expected Margin |pdfUrl=https://ceur-ws.org/Vol-1623/papercpr1.pdf |volume=Vol-1623 |authors=Vladimir Berikov |dblpUrl=https://dblp.org/rec/conf/door/Berikov16 }} ==Cluster Ensemble with Averaged Co-Association Matrix Maximizing the Expected Margin== https://ceur-ws.org/Vol-1623/papercpr1.pdf
    Cluster Ensemble with Averaged Co-Association
       Matrix Maximizing the Expected Margin

                                        Vladimir Berikov1,2
                    1
                        Sobolev Institute of mathematics, Novosibirsk, Russia
                        2
                          Novosibirsk State University, Novosibirsk, Russia
                                       berikov@math.nsc.ru
                                     http://www.math.nsc.ru



       Abstract. The problem considered is cluster analysis with usage of the en-
       semble approach. The paper proposes a method for finding optimal weights for
       the averaged co-association matrix applied to the construction of the ensemble
       partition. The main idea is to find such weights for which the expectation of en-
       semble margin takes its maximum value. A latent variable pairwise classification
       model is used for determining margin characteristics dependent on cluster va-
       lidity indices. To construct the ensemble partition, we apply minimum spanning
       tree found on the averaged co-association matrix as an adjacency matrix. The
       efficiency of the method is confirmed by Monte-Carlo simulations with artificial
       data sets.

       Keywords: cluster ensemble, weighted co-association matrix, latent variable
       model, cluster validity index, margin expectation, minimum spanning tree


1    Introduction
The problem of cluster analysis consists in finding a partition P = {C1 , . . . , CK } of data
set A = {a1 , . . . , an } on a relatively small number of homogeneous clusters. The result
can be either hard (each data object is attributed to one cluster) or fuzzy (for each
object, a cluster membership function is defined); this paper considers hard clustering
only.
    Let the data sample be described with a table X = (xi,m ), where xi,m = Xm (ai ) ∈ R
is a value of feature Xm for object ai (m ∈ {1, . . . , d}, i ∈ {1, . . . , n}), d is feature space
dimensionality. As a criterion of homogeneity, one may understand certain functional
dependent on the scatter of observations within groups and the distances between
clusters. The number of clusters is either a predefined parameter or should be found in
a best way. In the given work it is assumed that the value of K should be determined
automatically.
    It is well-known that clustering is a NP-hard problem essentially [1,9]. Because of
large computational complexity of the search process, a majority of existing algorithms
[3]-[5] apply approximate iterative procedures to find locally optimal partition. In each

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: A. Kononov et al. (eds.): DOOR 2016, Vladivostok, Russia, published at http://ceur-ws.org
490     V. Berikov

step of the algorithm, the current partition is updated to improve the quality functional.
As a rule, the process is guided by certain user-specified parameters.
    Collective decision making (ensemble clustering) is a comparatively new approach
in cluster analysis [6,7]. Following the ensemble framework, a number of base partitions
are obtained firstly with the collection of different algorithms (or with a single algorithm
under different parameters or other working settings). On the second step, the given
variants are combined to yield the final consensus partition. The ensemble approach, as
a rule, allows one to increase the stability of clustering results in case of uncertainty on
data model or when it is not clear which of algorithm’s parameters are most appropriate
for a particular problem.
    A number of ways to find the ensemble solution exist. In the co-association (CA)
matrix based approach, each pair of objects ai , aj ∈ A are associated with the matrix
element indicating how often the objects are assigned to different clusters over all
partition variants. This value is considered as analog of distance between ai and aj . To
construct the final partition, any procedure based on a pairwise distance matrix can
be used, for example, hierarchical agglomerative clustering (HAC) algorithm.
    Some authors suggest weighted cluster ensemble aimed at improving its character-
istics. The weights depend on the estimated quality (cluster validity indices, diversity
measures, etc.) of base partitions [8,9].
    The work [10] considerers weighted cluster ensemble composed of different algo-
rithms. For determining the ensemble solution, it is proposed to use weighted averaged
CA matrix, where the weights are found according to the principle of minimum variance
of ensemble margin (closely related to the maximization of the expected margin). This
principle is substantiated by the analysis of probabilistic properties of cluster ensem-
bles in the framework of a latent variable pairwise classification (LVPC) model. The
suggested method (Pairwise Weighted Ensemble Clustering, PWEC) has demonstrated
an ability to find complex data structures under noise conditions; however, it considers
only variations of cluster labels and disregards potentially useful information on other
characteristics of base partitions such as cluster validity indices. Another disadvantage
is that it can not automatically determine the number of clusters.
    The proposed work aims at eliminating the limitations mentioned above: we suggest
a method that
a) assigns weights taking into account both stability measures of base clusterings and
the obtained quality estimates;
b) finds the optimal number of clusters using minimum spanning tree (MST) algorithm.
    The rest of the paper is organized as follows. The second section briefly introduces
the necessary notions. In the third section we describe the concept of ensemble mar-
gin and obtain an expression for the expected margin with use of LVPC model. This
section also suggests a solution to the problem of assessing the characteristics of the
ensemble and finding the optimal weights which give the maximum value to the ex-
pected margin. In the forth section we formulate the algorithm for constructing the
ensemble clustering partition with MST approach. The fifth section presents the re-
sults of numerical experiments using Monte-Carlo simulations with data sampled from
a given probability distribution. The conclusion summarizes the work.
                                                        Averaged Co-Association Matrix   491

2     Basic concepts

This section describes the method of ensemble clustering with weighted CA matrix and
provides examples of cluster validity indices used in this work.


2.1   Cluster ensemble with usage of weighted co-association matrix

In the framework of the ensemble approach, we consider a collection of different clus-
tering algorithms µ1 , . . . , µM applied to data set A. Suppose that each algorithm µm ,
m = 1, . . . , M is enabled to work a number of times under different conditions such as
initial centroids coordinates, subsets of features, number of clusters, etc. In each lth
run, it generates a partition composed of Kl,m clusters, where l = 1, . . . , LM , LM is a
given number of runs for algorithm µm .
    The quality of each variant is evaluated with some criterion (e.g., cluster validity
index) γl,m . It is allowable to suppose that the indices are properly standardized so
that a) 0 ≤ γl,m ≤ 1, and b) the more compact and distant are the found clusters, the
larger is the index value.
    For a pair of different objects ai , aj , we define the value

                           hl,m (i, j) = I[µl,m (ai ) 6= µl,m (aj )],

where I[·] is indicator function: I[true] = 1; I[f alse] = 0; µl,m (a) is a cluster label
assigned by algorithm µm to object a in lth run.
    The averaged weighted CA matrix H = (h̄(i, j)) is calculated over all found variants,
taking into account validity indices. We set matrix elements as
                                         Lm
                                       M X
                                       X
                          h̄(i, j) =             wl,m (i, j) hl,m (i, j),                (1)
                                       m=1 l=1

where wl,m (i, j) ≥ 0 are weight constants. The weights should account for the degree
of confidence to the results of cluster assignments for a given pair. In the suggested
algorithm (see below) the weights depend on validity indices and other characteristics
of the ensemble.
    It is easy to prove that H defines a pseudo metric on A, thus matrix elements can be
considered as analogs of distances between pairs of objects. To find the final partition
of A on a predefined number of clusters using matrix H, one can apply HAC algorithm
with average linkage rule for the definition of between-group distances.


2.2   Cluster validity indices

Cluster validity indices give one a possibility to compare the obtained partition with
some ”etalon” variant (external indices); or to assess the result of clustering with
various measures of cluster compactness and separability (internal indices). A majority
of existing indices require quadratic time for their calculation; however, indices of linear
complexity exist.
492      V. Berikov

    External indices estimate the degree of correspondence between two partitions
P1 = {C1,1 , . . . , Ck,1 , . . . , CK1 ,1 } and P2 = {C1,2 , . . . , Cl,2 , . . . , CK2 ,2 }, where Ck,1 =
{ai1 , . . . , aiNk,1 }, Cl,2 = {aj1 , . . . , ajNl,2 }, Nk,1 is a number of objects in kth cluster of
the first partition, Nl,2 is a number of objects in lth cluster of the second partition.
    The Rand index is defined as

                                    ϕR (P1 , P2 ) = (S + D)/G0 ,

where S is a number of pairs belonging to the same cluster in P1 and P2 , D is a number
of pairs belonging to different clusters, G0 = (n2 ) is total number of pairs. The index
belongs to interval [0, 1] and defines the portion of correctly classified pairs; the value
ϕR = 1 indicates perfect agreement between two partitions.
    The adjusted Rand index [11] is defined with the following expression:
                                         P N (k,l) 
                                             2        − Q1 Q2 /G0
                                                k,l
                            ϕAR (P1 , P2 ) = 1                               ,
                                                2 (Q1 + Q2 ) − Q1 Q2 /G0
               P Nk,1             P Nl,2 
where Q1 =           2     , Q2 =        2      and N (k,l) = kCk,1 ∩ Cl,2 k. The maximum value
                k                   k
of ϕAR is 1; the index may take negative values. The value close to 0 indicates that the
obtained cluster partition is mostly probably accidental.
    Internal indices are determined for a single partition P = {C1 , . . . , CK }. This work
uses two examples of internal indices: Hubert Gamma index and cophenetic correlation
coefficient.
    Hubert Gamma index, Γ (P ), is specified as the linear correlation coefficient between
elements of matrices R(i, j) and U (i, j) (i < j), where R is a matrix of pairwise Euclid-
ian distances between data points, U is unweighed co-association matrix: U (i, j) = 0 if
∃Cq : ai , aj ∈ Cq ; otherwise U (i, j) = 1. An increase in index value typically indicates
that the quality of the partition is improving.
    Cophenetic correlation coefficient, Coph(P ), is defined with a dendrogram of clus-
tering partition. The dendrogrammatic distance τ (i, j) between ai , aj is the height of
the node at which these two objects are first joined together. Coph(P ) equals the lin-
ear correlation coefficient between the Euclidian distances R(i, j) and dendrogrammatic
distances τ (i, j) over all i, j (i < j).
    Both Γ (P ) and Coph(P ) belong to interval [−1, 1]; it will be more convenient for
us to replace their negative values with zeros.


3     Margin of cluster ensemble
Suppose that data sample is a mixture of a finite number of components (classes). A
latent groundtruth variable Y defines a class number to which an object belongs to.
Matrix Z with elements
                             Z(i, j) = I[ Y (ai ) 6= Y (aj ) ],                 (2)
where ai and aj are arbitrary data objects, defines their true status (i.e., if ai and
aj indeed belong to separate classes). Let us define margin matrix dependent on the
                                                         Averaged Co-Association Matrix             493

variants of cluster partitions. For a pair ai , aj , their margin is

                    mg(i, j) = {weighted number of votes for Z(i, j)−

                         weighted number of votes against Z(i, j)}
and can be written as
                     Lm
                   M X
                   X
      mg(i, j) =             wl,m (i, j) {I[hl,m (i, j) = Z(i, j)] − I[hl,m (i, j) 6= Z(i, j)]} .
                   m=1 l=1

This value indicates to what extend the number of right decisions for ai , aj exceed the
number of wrong ones. Evidently, it equals:
                                Lm
                              M X
                              X
               mg(i, j) =               wl,m (i, j)(2Z(i, j) − 1)(2hl,m (i, j) − 1).                (3)
                              m=1 l=1

    Margin matrix can not be calculated if the true partition is unknown. However, it
was shown in [10] that some of margin’s characteristics can be assessed using a few
basic assumptions on the statistical behavior of clustering algorithms. Furthermore,
there was found an upper bound for error probability (at assigning objects in a pair
to different clusters) and it was proved that an increase in the expected margin and
a decrease in margin variance reduces the error resulting in improving the quality of
cluster ensemble.

3.1    Conditional expectation of margin
Consider a specific form of (1): let each weight be a function of validity index and a
quantity αm (i, j) ≥ 0:
                                             αm (i, j) γl,m
                               wl,m (i, j) =                ,                       (4)
                                                  Lm
      P
where αm (i, j) = 1 for all i, j (in (4), we separate a component dependent on validity
       m
index).
    Below we formulate basic assumptions which allow us assessing characteristics of
cluster ensemble:
a) Data table X is arbitrary and fixed; for any pair of data objects ai , aj , chosen at
random from A, their true status is a random value Z(i, j) defined with (2).
b) Each algorithm µm is randomized, i.e. it depends on a random vector Ωm from the
set of parameters Ωm , m = 1, . . . , M . For data table X as input, µm is running Lm
times with i.i.d. parameters Ω1,m , . . . , ΩLm ,m being statistical copies of Ωm .
    It should be noted that the i.i.d. condition must be ensured through a mechanism
of the ensemble generation.
    For any algorithm µm , the conditional probabilities of correct decisions (either sep-
aration or joining of ai , aj under fixed Z(i, j)) are denoted as
                         1
                        qm (i, j) = P [hm (i, j, Ωm ) = 1|Z(i, j) = 1],                             (5)
494      V. Berikov

                         0
                        qm (i, j) = P [hm (i, j, Ωm ) = 0|Z(i, j) = 0].                     (6)
    The measure of cluster validity, obtained by algorithm µm on Ωm , is considered as
a random value γm (X, Ωm ). Because the quality criterion is determined on the whole
data set, one may understand this value as practically independent on quantities Z(i, j),
hl,m (i, j), etc., defined for a specific object pair.
    Proposition 1. Given the assumptions a), b) be valid, conditional mathematical
expectation of ensemble margin for ai , aj under Z(i, j) = z is:
                                            X
               EΩ̄ [mg(i, j)|Z(i, j) = z] =     αm (i, j) E[γm ] (2pm (z; i, j) − 1),
                                              m

                                                                                      0
where Ω̄ = (Ω1,1 , . . . , ΩLM ,M ), E[γm ] = EΩm [γm (Ωm )], pm (z; i, j) = (1 − z) qm (i, j) +
   1
z qm (i, j).
    Proof. Let us denote by

                      hl,m (i, j, Ωl,m ) = I[µm (i, Ωl,m ) 6= µm (j, Ωl,m )]

the decision for a pair ai , aj , where µm (i, Ωl,m ) is a cluster label assigned to object ai
by algorithm µm in its lth run with parameter vector Ωl,m , and also denote by

                           hm (i, j, Ωm ) = I[µm (i, Ωm ) 6= µm (j, Ωm )]

the decision under vector Ωm .
    Until the proof end, let us skip arguments i, j for short: mg(i, j) = mg, Z(i, j) = Z,
etc. From (3) and (4) we have:
                           X αm X
        EΩ̄ [mg|Z = z] =               EΩ̄ [γl,m (Ωl,m )(2z − 1)(2hl,m (Ωl,m ) − 1)].
                            m
                               Lm
                                        l

Because Ωl,m and Ωm are equally distributed and γl,m is independent on hl,m , it holds
true that
                           X
          EΩ̄ [mg|Z = z] =   αm E[γm (Ωm )] (2z − 1)(2EΩm [hm (Ωm )] − 1).
                                 m

      For z = 0 we have:
                                                                           0
        (2z − 1)(2EΩm [hm (Ωm )] − 1) = −(2P [hm (Ωm ) = 1|Z = 0] − 1) = 2qm − 1;

and for z = 1:
                                                                         1
         (2z − 1)(2EΩm [hm (Ωm )] − 1) = 2P [hm (Ωm ) = 1|Z = 0] − 1 = 2qm − 1.

      Therefore
                                       X
                  EΩ̄ [mg|Z = z] =          αm E[γm (Ωm )] (2pm (z; i, j) − 1).
                                        m

This completes the proof.
                                                        Averaged Co-Association Matrix           495

   In this work we also assume that each algorithm µm creates a partition for which
the probability of correct decision for any pair ai , aj is greater than 0.5, i.e.
                                0                1
                               qm (i, j) > 0.5, qm (i, j) > 0.5.                                 (7)

It means that clustering algorithms included into the ensemble possess at least slightly
better classification accuracy than a trivial procedure based on random assignment of
a pair to the same or different clusters.


3.2   Evaluation of the expected margin and its optimization

The main idea of our method is to control the behavior of cluster ensemble by changing
weights α1 , . . . , αM looking for the maximum conditional expectation of margin. The
expression for the expected margin depends on a number of characteristics (pm (z; i, j),
E[γm ]) which should be evaluated from the results of base clustering algorithms, i.e.,
a number of partitions along with evaluated validity indices γl,m , l = 1 . . . , Lm , m =
1, . . . , M .
     To evaluate pm (z; i, j), one may consider the obtained partitions and calculate co-
association matrices with elements

                            h̄ l,m (i, j) = I[µl,m (ai ) 6= µl,m (aj )].

                                                                1 P
For each (i, j), the division frequency is P̄m (i, j) =             h̄ l,m (i, j). Each P̄m (i, j) is
                                                               Lm l
an estimate of conditional probability

                              P [hm (i, j, Ωm ) = 1|Z(i, j) = z]

depending on the realization of Z(i, j). Given the assumption (7) be valid, one may
assess pm (z; i, j) with quantity

                         p̄m (i, j) = max(P̄m (i, j), 1 − P̄m (i, j)).

This value can be interpreted as an estimate of local stability of cluster decisions for a
pair ai , aj .
    To evaluate theoretical means E[γm ], m = 1, . . . , M , one may use sample averages
                           1 LPm
of validity indices γ̄m =        γl,m .
                          Lm l=1
    Let us consider a problem of maximization of the expected margin:

                    for each i, j (i < j), find α1 (i, j), . . . , αM (i, j) :
                   X
                        αm (i, j)γ̄m (2p̄m (i, j) − 1) →              max              ,
                                                               α1 (i,j),...,αM (i,j)
                    m
                                                               X
                  s.t. α1 (i, j) ≥ 0, . . . , αM (i, j) ≥ 0,       αm (i, j) = 1.
                                                               m
496       V. Berikov

      The solution to this linear programming problem is rather straightforward:
                                     1, m∗ =arg max {γ̄ p̄ (i,j)};
                                                          m m
                          ∗                     m=1,...,M
                        αm∗ (i, j) = 0, otherwise.                                        (8)

As one can see, in the resulting optimization strategy only the best algorithm from
the ensemble (considering quality and stability estimates) is taken into account. For
different pairs ai , aj , different algorithms may come out on top.


4      Finding ensemble partition with minimum spanning tree

Let G = (V, E) be an undirected graph with non-negative edge weights w, where V is
set of vertices (in our case, V = A, i.e., the set of data objects); E is the set of edges.
Matrix w = (wi,j ) defines weights of edges (ai , aj ), where i, j = 1, . . . , n. As a weight
matrix, we consider the averaged CA matrix H with elements h̄(i, j) defined according
to (1), (4), (8).
    A spanning tree is a tree with n − 1 edges, i.e. a tree that Pconnects all the vertices.
The total weight of a spanning tree T is defined as wT =            e∈T w(e). A minimum
spanning tree is a tree of minimum total weight.
    A large number of methods for MST construction exist [12]. Their running time is
of polynomial order (in some special cases, near to linear).
    In a graph-theoretic framework, computationally efficient MST-based algorithms
are widely used in cluster analysis [13]. Once the MST T is built for a given data
set, first k − 1 edges with largest weights are removed, which creates a partition Pk
of A on k clusters. The obtained variant is evaluated with an objective function; the
best partition for different k ∈ {2, . . . , Kmax } forms an output (Kmax is algorithm’s
parameter). MST clustering algorithms are capable of finding complex non-spherical
clusters; they can automatically determine the number of clusters using a predefined
quality functional.
    In this paper, we apply well known Kruskal’s algorithm [14] for finding MST for
data set A and matrix H. Using the obtained MST, the final ensemble partition is
created. As a quality qriterion, we use the following functional:
                                                  k
                                                  Y
                                    F (Pk ) = k         Ncwc /wT ,                        (9)
                                                  c=1

where Nc is size of cth cluster, wc is total weight of edges which enter into cth cluster.
    MST-based ensemble clustering algorithm (MSTEClust) has the following main
steps.

Algorithm MSTEClust.
Input:
A = {a1 , . . . , an }: dataset described with data table X;
Kmax : maximum number of clusters;
L1 , . . . , LM : number of runs for base algorithms µ1 , . . . , µM ;
                                                      Averaged Co-Association Matrix          497

Ω1 , . . . , ΩM : sets of allowable parameters (working conditions) of algorithms.
Output:
partition of A into optimal number K of clusters (2 ≤ K ≤ Kmax ).
Steps:
1. Generate L1 , . . . , LM variants of cluster partition with algorithms µ1 , . . . , µM for ran-
domly chosen working parameters; calculate validity indices γl,m , l = 1, . . . , Lm , m =
1, . . . , M .
2. Calculate averaged co-association matrix H with optimal weights according to (1),
(4), (8).
3. Construct MST on graph (V, G) = (A, H) using Kruskal’s algorithm.
4. Find a partition of A on optimal number of clusters K using MST and criterion (9).
end.
     The running time of steps 1-2 linearly depends on the complexity of base clustering
algorithms and validity estimators. Constructing of the final partition needs O(n log(n))
operations. One may conclude that MSTEClust is more readily applicable to large
amount of data than PEWEC [10] which has time complexity of order O(n2 ).


5    Numerical experiments

This section describes numerical experiments with MSTEClust algorithm with Monte
Carlo modeling. In the experiments, two base clustering algorithms are included into
the ensemble: k-means (KM) and hierarchical agglomerative clustering algorithm with
single linkage rule (HACSL).
    We choose the following distribution model. In 16-dimensional feature space four
classes are defined. First and second classes are of normal distribution N (ν1 , Σ),
N (ν2 , Σ), where ν1 = (0, ..., 0)T , ν2 = (6, ..., 6)T , Σ = σI is diagonal covariance matrix,
σ = 1. The coordinates of objects from other two classes are determined recursively:
xki+1 = xki + θ1 · 1 + θ2 · ε, where 1 = (1, . . . , 1)T , θ1 = θ2 = 0.25, ε is Gaussian
random vector N (0, I), k = 3, 4. For class 3, x31 = (−6, 6, ..., −6, 6)T + 0.1 · ε; for class
4, x41 = (6, −6, ..., 6, −6)T + 0.1 · ε. The number of objects of each class equals 25.
Figure 1 illustrates sampled data.
    The random subspace method is used for the ensemble construction: each base clus-
ter partition is built on dens randomly chosen variables (dens = 3 in this experiment).
Besides that, for each base algorithm the number of clusters is chosen at random from
range {2, . . . , Kmax }, Kmax = 5. The obtained clustering results are assessed with clus-
ter validity indices: Hubert Gamma index estimates the quality of KM; the results
of HACSL are evaluated with cophenetic correlation index. The number of ensemble
variants for each algorithm is set to L1 = L2 = 25.
    To study the behavior of the ensemble algorithm in the presence of noise, some of
the features, whose indexes are determined by chance and their total number equals
parameter d0 , are replaced with random values uniformly distributed over feature range.
    In the process of Monte Carlo modeling, artificial data sets are repeatedly generated
according to the specified distribution model. For each data set, a number of base
partitions are obtained by KM and HACSL; the ensemble solution is found according
to MSTEClust.
498    V. Berikov


                                          12


                                          10


                                           8


                                           6


                                           4




                                     X2
                                           2


                                           0


                                          −2


                                          −4


                                          −6
                                           −6   −4   −2        0        2           4   6   8        10   12
                                                                            X1



Fig. 1. An example of sampled data (4 classes marked with ×, ◦, +, ∗) ; projection on axes
X1 , X2 .



                                      1.2


                                          1


                                      0.8
               adjasted Rand index




                                                                                                a)

                                                                   d)
                                      0.6
                                                                                                b)
                                                          c)
                                      0.4


                                      0.2


                                          0


                                     −0.2
                                                0    1         2        3           4   5   6        7
                                                                            d
                                                                                0


Fig. 2. Dependence of averaged ARI from the number of noise features d0 : a) MSTEClust,
b) PEWEC, c) KM, d) HACSL. 95% confidence intervals for estimated characteristic are also
shown.



    Figure 2 presents the results of the experiments (Adjusted Rand index averaged
over 100 random samples). For comparison purposes, the results of KM and HACSL
(for which the true number of clusters is given) are plotted as well. To illustrate the
effect of weights optimization taken validity indices into account, Figure 2 displays the
                                                    Averaged Co-Association Matrix        499

evaluations of the performance of PEWEC [10] modification in which the MST is used
to construct the ensemble partition instead of a dendrogram algorithm.
    The plots are positive evidence for the suggested ensemble algorithm with optimized
weights. Beginning with sufficiently large number of noise features, MSTECLust gives
significantly better clustering quality than other examined procedures. In contrast,
non-ensemble algorithms totally fail to recover heavily distorted data structure.
    An important problem is experimental verification of theoretical suppositions which
lie at the basis of the suggested method. The validity of assumption (7) is crucial be-
cause it determines the way of assessing the expected margin. In Monte Carlo sim-
ulation, the true status of each object pair is known, thus it is possible to evaluate
                            1          0
conditional probabilities qm  (i, j), qm (i, j) in (5),(6) using available frequencies. In the
experiment, all above-described settings are unchanged. Table 1 presents the results
of estimations. From this table, one can see that in the condition of small and mod-
erate noise (up to three noise features) the estimates stay above 0.5, thus confirming
assumption (7). Under increasing noise, violation of the inequalities is seen, evidently
resulting in the decrease of the algorithm’s accuracy.

                                   1                                      0
Table 1. Averaged estimates of qm    (i, j) = P (hm = 1|Z(i, j) = 1 and qm  (i, j) = P (hm =
0|Z(i, j) = 0). Averaging is done over all i < j and 100 generated data samples under given
number of noise features d0 . For KM, m = 1; for HACSL, m = 2. Standard deviations of the
estimated quantities are shown in parenthesis.

                                                   d0
                        0             1             2             3             4
       q10 (·, ·) 0.957 (0.008) 0.862 (0.023) 0.773 (0.029) 0.698 (0.03) 0.630 (0.026)
       q11 (·, ·) 0.746 (0.017) 0.723 (0.015) 0.703 (0.017) 0.681 (0.015) 0.659 (0.013)
       q20 (·, ·) 0.99 (0.003) 0.969 (0.01) 0.949 (0.013) 0.936 (0.014) 0.923 (0.013)
       q21 (·, ·) 0.716 (0.021) 0.642 (0.028) 0.575 (0.035) 0.506 (0.034) 0.433 (0.043)




Conclusion
In this work we have suggested a method of ensemble clustering using a number of
different base algorithms. The ensemble partition is found by the weighted average of
co-association matrices, where the weights depend on cluster validity indices. A math-
ematical methodology of determining optimal weights is proposed. As an optimization
functional, we utilize the estimate of the ensemble’s expected margin. To construct the
final partition, it is suggested to use minimum spanning tree built on the averaged
co-association matrix as an adjacency matrix.
     Unlike other existing methods of ensemble clustering, the proposed one takes into
account both stability measures of base partitions and the obtained quality estimates;
it is capable of finding the optimal number of clusters.
     The efficiency of the suggested MSTEClust algorithm is confirmed experimentally
with Monte-Carlo modeling. For a given distribution model, the simulations under
500     V. Berikov

noise distortions have demonstrated significant improvement of clustering quality for
MSTEClust in comparison with other considered algorithms. Monte-Carlo experiments
have confirmed the principal feasibility of theoretical assumptions used in this work for
the estimation of cluster ensemble characteristics.
    In the future works, the author plans to continue studying theoretical properties
of clustering ensembles and developing algorithms for real world applications such as
hyperspectral images analysis and segmentation.


Acknowledgements
This work was partially supported by the Russian Foundation for Basic Research,
project 14-07-00249a.


References
 1. Falkenauer E.: Genetic Algorithms and Grouping Problems. Wiley, New York (1998)
 2. Naldi M., Campello R., Hruschka E., Carvalho A.: Efficiency of evolutionary k-means.
    Applied Soft Computing. 11, 1938–1952 (2011)
 3. Duda R.O., Hart P.E., Stork D.G.: Pattern Classification. Second Edition. Wiley, New
    York (2000)
 4. Jain A.K., Dubes R.C.: Algorithms for clustering data. Prentice Hall, NJ (1988)
 5. Jain A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters. 31 (8),
    651–666 (2010)
 6. Ghosh J., Acharya A.: Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining
    and Knowledge Discovery. Vol. 1, 305–315 (2011)
 7. Vega-Pons, S. Ruiz-Shulcloper, J.: A Survey of Clustering Ensemble Algorithms. IJPRAI
    25(3), 337–372 (2011)
 8. Vega-Pons, S., Ruiz-Shulcloper, J.: Clustering Ensemble Method for Heterogeneous Par-
    titions. CIARP’09, 481–488 (2009)
 9. Naldi, M. C., Carvalho, A. C. P. L. F., Campello, R. J. G. B.: Cluster ensemble selection
    based on relative validity indexes. Data Mining and Knowledge Discovery. Vol. 27, 259—
    289 (2013)
10. Berikov V.: Weighted ensemble of algorithms for complex data clustering. Pattern Recog-
    nition Letters. Vol. 38, 99–106 (2014)
11. Hubert L., Arabie P.: Comparing partitions. Journal of Classification. Vol. 2, 193–218
    (1985)
12. Gabow, H. N., Galil, Z., Spencer, T., Tarjan, R. E.: Efficient algorithms for finding min-
    imum spanning trees in undirected and directed graphs. Combinatorica. Vol. 6(2), 109 –
    122 (1986)
13. Gower J., G. Ross.: Minimum spanning trees and single linkage cluster analysis. Applied
    Statistics. Vol. 18, 54-–64 (1969)
14. Kruskal J.: On the shortest spanning subtree and the traveling salesman problem. Pro-
    ceedings of the American Mathematical Society, pp. 48–50 (1956)