<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cluster Ensemble with Averaged Co-Association Matrix Maximizing the Expected Margin</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vladimir Berikov</string-name>
          <email>berikov@math.nsc.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Novosibirsk State University</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sobolev Institute of mathematics</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>489</fpage>
      <lpage>500</lpage>
      <abstract>
        <p>The problem considered is cluster analysis with usage of the ensemble approach. The paper proposes a method for finding optimal weights for the averaged co-association matrix applied to the construction of the ensemble partition. The main idea is to find such weights for which the expectation of ensemble margin takes its maximum value. A latent variable pairwise classification model is used for determining margin characteristics dependent on cluster validity indices. To construct the ensemble partition, we apply minimum spanning tree found on the averaged co-association matrix as an adjacency matrix. The efficiency of the method is confirmed by Monte-Carlo simulations with artificial data sets.</p>
      </abstract>
      <kwd-group>
        <kwd>cluster ensemble</kwd>
        <kwd>weighted co-association matrix</kwd>
        <kwd>latent variable model</kwd>
        <kwd>cluster validity index</kwd>
        <kwd>margin expectation</kwd>
        <kwd>minimum spanning tree</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
step of the algorithm, the current partition is updated to improve the quality functional.
As a rule, the process is guided by certain user-specified parameters.</p>
      <p>Collective decision making (ensemble clustering) is a comparatively new approach
in cluster analysis [6,7]. Following the ensemble framework, a number of base partitions
are obtained firstly with the collection of different algorithms (or with a single algorithm
under different parameters or other working settings). On the second step, the given
variants are combined to yield the final consensus partition. The ensemble approach, as
a rule, allows one to increase the stability of clustering results in case of uncertainty on
data model or when it is not clear which of algorithm’s parameters are most appropriate
for a particular problem.</p>
      <p>A number of ways to find the ensemble solution exist. In the co-association (CA)
matrix based approach, each pair of objects ai, aj ∈ A are associated with the matrix
element indicating how often the objects are assigned to different clusters over all
partition variants. This value is considered as analog of distance between ai and aj. To
construct the final partition, any procedure based on a pairwise distance matrix can
be used, for example, hierarchical agglomerative clustering (HAC) algorithm.</p>
      <p>Some authors suggest weighted cluster ensemble aimed at improving its
characteristics. The weights depend on the estimated quality (cluster validity indices, diversity
measures, etc.) of base partitions [8,9].</p>
      <p>The work [10] considerers weighted cluster ensemble composed of different
algorithms. For determining the ensemble solution, it is proposed to use weighted averaged
CA matrix, where the weights are found according to the principle of minimum variance
of ensemble margin (closely related to the maximization of the expected margin). This
principle is substantiated by the analysis of probabilistic properties of cluster
ensembles in the framework of a latent variable pairwise classification (LVPC) model. The
suggested method (Pairwise Weighted Ensemble Clustering, PWEC) has demonstrated
an ability to find complex data structures under noise conditions; however, it considers
only variations of cluster labels and disregards potentially useful information on other
characteristics of base partitions such as cluster validity indices. Another disadvantage
is that it can not automatically determine the number of clusters.</p>
      <p>The proposed work aims at eliminating the limitations mentioned above: we suggest
a method that
a) assigns weights taking into account both stability measures of base clusterings and
the obtained quality estimates;
b) finds the optimal number of clusters using minimum spanning tree (MST) algorithm.</p>
      <p>The rest of the paper is organized as follows. The second section briefly introduces
the necessary notions. In the third section we describe the concept of ensemble
margin and obtain an expression for the expected margin with use of LVPC model. This
section also suggests a solution to the problem of assessing the characteristics of the
ensemble and finding the optimal weights which give the maximum value to the
expected margin. In the forth section we formulate the algorithm for constructing the
ensemble clustering partition with MST approach. The fifth section presents the
results of numerical experiments using Monte-Carlo simulations with data sampled from
a given probability distribution. The conclusion summarizes the work.</p>
    </sec>
    <sec id="sec-2">
      <title>Basic concepts</title>
      <p>This section describes the method of ensemble clustering with weighted CA matrix and
provides examples of cluster validity indices used in this work.
2.1</p>
      <sec id="sec-2-1">
        <title>Cluster ensemble with usage of weighted co-association matrix</title>
        <p>In the framework of the ensemble approach, we consider a collection of different
clustering algorithms μ1, . . . , μM applied to data set A. Suppose that each algorithm μm,
m = 1, . . . , M is enabled to work a number of times under different conditions such as
initial centroids coordinates, subsets of features, number of clusters, etc. In each lth
run, it generates a partition composed of Kl,m clusters, where l = 1, . . . , LM , LM is a
given number of runs for algorithm μm.</p>
        <p>The quality of each variant is evaluated with some criterion (e.g., cluster validity
index) γl,m. It is allowable to suppose that the indices are properly standardized so
that a) 0 ≤ γl,m ≤ 1, and b) the more compact and distant are the found clusters, the
larger is the index value.</p>
        <p>For a pair of different objects ai, aj, we define the value</p>
        <p>hl,m(i, j) = I[μl,m(ai) 6= μl,m(aj )],
where I[·] is indicator function: I[true] = 1; I[f alse] = 0; μl,m(a) is a cluster label
assigned by algorithm μm to object a in lth run.</p>
        <p>The averaged weighted CA matrix H = (h¯(i, j)) is calculated over all found variants,
taking into account validity indices. We set matrix elements as</p>
        <p>
          M Lm
h¯(i, j) = X X wl,m(i, j) hl,m(i, j),
m=1 l=1
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where wl,m(i, j) ≥ 0 are weight constants. The weights should account for the degree
of confidence to the results of cluster assignments for a given pair. In the suggested
algorithm (see below) the weights depend on validity indices and other characteristics
of the ensemble.
        </p>
        <p>It is easy to prove that H defines a pseudo metric on A, thus matrix elements can be
considered as analogs of distances between pairs of objects. To find the final partition
of A on a predefined number of clusters using matrix H, one can apply HAC algorithm
with average linkage rule for the definition of between-group distances.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Cluster validity indices</title>
        <p>Cluster validity indices give one a possibility to compare the obtained partition with
some ”etalon” variant (external indices ); or to assess the result of clustering with
various measures of cluster compactness and separability (internal indices ). A majority
of existing indices require quadratic time for their calculation; however, indices of linear
complexity exist.</p>
        <p>External indices estimate the degree of correspondence between two partitions
P1 = {C1,1, . . . , Ck,1, . . . , CK1,1} and P2 = {C1,2, . . . , Cl,2, . . . , CK2,2}, where Ck,1 =
{ai1 , . . . , aiNk,1 }, Cl,2 = {aj1 , . . . , ajNl,2 }, Nk,1 is a number of objects in kth cluster of
the first partition, Nl,2 is a number of objects in lth cluster of the second partition.</p>
        <p>The Rand index is defined as</p>
        <p>ϕR(P1, P2) = (S + D)/G0,
where S is a number of pairs belonging to the same cluster in P1 and P2, D is a number
of pairs belonging to different clusters, G0 = (n2) is total number of pairs. The index
belongs to interval [0, 1] and defines the portion of correctly classified pairs; the value
ϕR = 1 indicates perfect agreement between two partitions.</p>
        <p>The adjusted Rand index [11] is defined with the following expression:
N(k,l)
2
P − Q1Q2/G0
k,l
ϕAR(P1, P2) = 12 (Q1 + Q2) − Q1Q2/G0
,
where Q1 = P N2k,1 , Q2 = P N2l,2 and N (k,l) = kCk,1 ∩ Cl,2k. The maximum value
k k
of ϕAR is 1; the index may take negative values. The value close to 0 indicates that the
obtained cluster partition is mostly probably accidental.</p>
        <p>Internal indices are determined for a single partition P = {C1, . . . , CK }. This work
uses two examples of internal indices: Hubert Gamma index and cophenetic correlation
coefficient.</p>
        <p>Hubert Gamma index, Γ (P ), is specified as the linear correlation coefficient between
elements of matrices R(i, j) and U (i, j) (i &lt; j), where R is a matrix of pairwise
Euclidian distances between data points, U is unweighed co-association matrix: U (i, j) = 0 if
∃Cq : ai, aj ∈ Cq; otherwise U (i, j) = 1. An increase in index value typically indicates
that the quality of the partition is improving.</p>
        <p>Cophenetic correlation coefficient, Coph(P ), is defined with a dendrogram of
clustering partition. The dendrogrammatic distance τ (i, j) between ai, aj is the height of
the node at which these two objects are first joined together. Coph(P ) equals the
linear correlation coefficient between the Euclidian distances R(i, j) and dendrogrammatic
distances τ (i, j) over all i, j (i &lt; j).</p>
        <p>Both Γ (P ) and Coph(P ) belong to interval [−1, 1]; it will be more convenient for
us to replace their negative values with zeros.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Margin of cluster ensemble</title>
      <p>Suppose that data sample is a mixture of a finite number of components (classes). A
latent groundtruth variable Y defines a class number to which an object belongs to.
Matrix Z with elements</p>
      <p>
        Z(i, j) = I[ Y (ai) 6= Y (aj ) ],
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
where ai and aj are arbitrary data objects, defines their true status (i.e., if ai and
aj indeed belong to separate classes). Let us define margin matrix dependent on the
variants of cluster partitions. For a pair ai, aj, their margin is
mg(i, j) = {weighted number of votes for Z(i, j)−
      </p>
      <p>weighted number of votes against Z(i, j)}
and can be written as</p>
      <p>M Lm
mg(i, j) = X X wl,m(i, j) {I[hl,m(i, j) = Z(i, j)] − I[hl,m(i, j) 6= Z(i, j)]} .</p>
      <p>m=1 l=1
This value indicates to what extend the number of right decisions for ai, aj exceed the
number of wrong ones. Evidently, it equals:</p>
      <p>M Lm
mg(i, j) = X X wl,m(i, j)(2Z(i, j) − 1)(2hl,m(i, j) − 1).</p>
      <p>m=1 l=1</p>
      <p>Margin matrix can not be calculated if the true partition is unknown. However, it
was shown in [10] that some of margin’s characteristics can be assessed using a few
basic assumptions on the statistical behavior of clustering algorithms. Furthermore,
there was found an upper bound for error probability (at assigning objects in a pair
to different clusters) and it was proved that an increase in the expected margin and
a decrease in margin variance reduces the error resulting in improving the quality of
cluster ensemble.
3.1</p>
      <sec id="sec-3-1">
        <title>Conditional expectation of margin</title>
        <p>
          Consider a specific form of (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ): let each weight be a function of validity index and a
quantity αm(i, j) ≥ 0:
wl,m(i, j) =
αm(i, j) γl,m ,
        </p>
        <p>
          Lm
where P αm(i, j) = 1 for all i, j (in (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), we separate a component dependent on validity
m
index).
        </p>
        <p>
          Below we formulate basic assumptions which allow us assessing characteristics of
cluster ensemble:
a) Data table X is arbitrary and fixed; for any pair of data objects ai, aj , chosen at
random from A, their true status is a random value Z(i, j) defined with (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ).
b) Each algorithm μm is randomized, i.e. it depends on a random vector Ωm from the
set of parameters Ωm, m = 1, . . . , M . For data table X as input, μm is running Lm
times with i.i.d. parameters Ω1,m, . . . , ΩLm,m being statistical copies of Ωm.
        </p>
        <p>It should be noted that the i.i.d. condition must be ensured through a mechanism
of the ensemble generation.</p>
        <p>
          For any algorithm μm, the conditional probabilities of correct decisions (either
separation or joining of ai, aj under fixed Z(i, j)) are denoted as
qm1(i, j) = P [hm(i, j, Ωm) = 1|Z(i, j) = 1],
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
qm0(i, j) = P [hm(i, j, Ωm) = 0|Z(i, j) = 0].
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
        </p>
        <p>The measure of cluster validity, obtained by algorithm μm on Ωm, is considered as
a random value γm(X, Ωm). Because the quality criterion is determined on the whole
data set, one may understand this value as practically independent on quantities Z(i, j),
hl,m(i, j), etc., defined for a specific object pair.</p>
        <p>Proposition 1. Given the assumptions a), b) be valid, conditional mathematical
expectation of ensemble margin for ai, aj under Z(i, j) = z is:</p>
        <p>EΩ¯ [mg(i, j)|Z(i, j) = z] = X αm(i, j) E[γm] (2pm(z; i, j) − 1),</p>
        <p>m
where Ω¯ = (Ω1,1, . . . , ΩLM ,M ), E[γm] = EΩm [γm(Ωm)], pm(z; i, j) = (1 − z) qm0(i, j) +
z qm1(i, j).</p>
        <p>Proof. Let us denote by</p>
        <p>hl,m(i, j, Ωl,m) = I[μm(i, Ωl,m) 6= μm(j, Ωl,m)]
the decision for a pair ai, aj, where μm(i, Ωl,m) is a cluster label assigned to object ai
by algorithm μm in its lth run with parameter vector Ωl,m, and also denote by
hm(i, j, Ωm) = I[μm(i, Ωm) 6= μm(j, Ωm)]
the decision under vector Ωm.</p>
        <p>
          Until the proof end, let us skip arguments i, j for short: mg(i, j) = mg, Z(i, j) = Z,
etc. From (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) and (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) we have:
        </p>
        <p>EΩ¯ [mg|Z = z] = X αm X EΩ¯ [γl,m (Ωl,m)(2z − 1)(2hl,m(Ωl,m) − 1)].</p>
        <p>m Lm l
Because Ωl,m and Ωm are equally distributed and γl,m is independent on hl,m, it holds
true that</p>
        <p>EΩ¯ [mg|Z = z] = X αmE[γm(Ωm)] (2z − 1)(2EΩm [hm(Ωm)] − 1).</p>
        <p>m
For z = 0 we have:</p>
        <p>(2z − 1)(2EΩm [hm(Ωm)] − 1) = −(2P [hm(Ωm) = 1|Z = 0] − 1) = 2qm0 − 1;
and for z = 1:</p>
        <sec id="sec-3-1-1">
          <title>Therefore</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>This completes the proof.</title>
          <p>(2z − 1)(2EΩm [hm(Ωm)] − 1) = 2P [hm(Ωm) = 1|Z = 0] − 1 = 2qm1 − 1.</p>
          <p>EΩ¯ [mg|Z = z] = X αmE[γm(Ωm)] (2pm(z; i, j) − 1).</p>
          <p>m</p>
          <p>
            In this work we also assume that each algorithm μm creates a partition for which
the probability of correct decision for any pair ai, aj is greater than 0.5, i.e.
qm0(i, j) &gt; 0.5, qm1(i, j) &gt; 0.5.
(
            <xref ref-type="bibr" rid="ref7">7</xref>
            )
It means that clustering algorithms included into the ensemble possess at least slightly
better classification accuracy than a trivial procedure based on random assignment of
a pair to the same or different clusters.
3.2
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation of the expected margin and its optimization</title>
        <p>The main idea of our method is to control the behavior of cluster ensemble by changing
weights α1, . . . , αM looking for the maximum conditional expectation of margin. The
expression for the expected margin depends on a number of characteristics (pm(z; i, j),
E[γm]) which should be evaluated from the results of base clustering algorithms, i.e.,
a number of partitions along with evaluated validity indices γl,m, l = 1 . . . , Lm, m =
1, . . . , M .</p>
        <p>To evaluate pm(z; i, j), one may consider the obtained partitions and calculate
coassociation matrices with elements</p>
        <p>¯h l,m(i, j) = I[μl,m(ai) 6= μl,m(aj )].</p>
        <p>For each (i, j), the division frequency is P¯m(i, j) =
an estimate of conditional probability
1</p>
        <p>P h¯ l,m(i, j). Each P¯m(i, j) is</p>
        <p>Lm l</p>
        <p>
          P [hm(i, j, Ωm) = 1|Z(i, j) = z]
depending on the realization of Z(i, j). Given the assumption (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ) be valid, one may
assess pm(z; i, j) with quantity
        </p>
        <p>p¯m(i, j) = max(P¯m(i, j), 1 − P¯m(i, j)).</p>
        <p>This value can be interpreted as an estimate of local stability of cluster decisions for a
pair ai, aj .</p>
        <p>To evaluate theoretical means E[γm], m = 1, . . . , M , one may use sample averages
1 Lm
of validity indices γ¯m = P γl,m.</p>
        <p>Lm l=1
Let us consider a problem of maximization of the expected margin:
for each i, j (i &lt; j), find α1(i, j), . . . , αM (i, j) :
X αm(i, j)γ¯m(2p¯m(i, j) − 1) →
m</p>
        <p>max
α1(i,j),...,αM (i,j)
The solution to this linear programming problem is rather straightforward:
α∗m∗ (i, j) =
1, m∗=arg m=1,...,M{γ¯mp¯m(i,j)};</p>
        <p>max
0, otherwise.</p>
        <p>As one can see, in the resulting optimization strategy only the best algorithm from
the ensemble (considering quality and stability estimates) is taken into account. For
different pairs ai, aj , different algorithms may come out on top.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Finding ensemble partition with minimum spanning tree</title>
      <p>
        Let G = (V, E) be an undirected graph with non-negative edge weights w, where V is
set of vertices (in our case, V = A, i.e., the set of data objects); E is the set of edges.
Matrix w = (wi,j ) defines weights of edges (ai, aj ), where i, j = 1, . . . , n. As a weight
matrix, we consider the averaged CA matrix H with elements h¯(i, j) defined according
to (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ).
      </p>
      <p>A spanning tree is a tree with n − 1 edges, i.e. a tree that connects all the vertices.
The total weight of a spanning tree T is defined as wT = Pe∈T w(e). A minimum
spanning tree is a tree of minimum total weight.</p>
      <p>A large number of methods for MST construction exist [12]. Their running time is
of polynomial order (in some special cases, near to linear).</p>
      <p>In a graph-theoretic framework, computationally efficient MST-based algorithms
are widely used in cluster analysis [13]. Once the MST T is built for a given data
set, first k − 1 edges with largest weights are removed, which creates a partition Pk
of A on k clusters. The obtained variant is evaluated with an objective function; the
best partition for different k ∈ {2, . . . , Kmax} forms an output (Kmax is algorithm’s
parameter). MST clustering algorithms are capable of finding complex non-spherical
clusters; they can automatically determine the number of clusters using a predefined
quality functional.</p>
      <p>In this paper, we apply well known Kruskal’s algorithm [14] for finding MST for
data set A and matrix H. Using the obtained MST, the final ensemble partition is
created. As a quality qriterion, we use the following functional:
k
F (Pk) = k Y Ncwc/wT ,</p>
      <p>c=1
where Nc is size of cth cluster, wc is total weight of edges which enter into cth cluster.</p>
      <p>MST-based ensemble clustering algorithm (MSTEClust) has the following main
steps.</p>
      <sec id="sec-4-1">
        <title>Algorithm MSTEClust.</title>
      </sec>
      <sec id="sec-4-2">
        <title>Input:</title>
        <p>
          A = {a1, . . . , an}: dataset described with data table X;
Kmax: maximum number of clusters;
L1, . . . , LM : number of runs for base algorithms μ1, . . . , μM ;
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
Ω1, . . . , ΩM: sets of allowable parameters (working conditions) of algorithms.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Output:</title>
        <p>partition of A into optimal number K of clusters (2 ≤ K ≤ Kmax).</p>
        <p>
          Steps:
1. Generate L1, . . . , LM variants of cluster partition with algorithms μ1, . . . , μM for
randomly chosen working parameters; calculate validity indices γl,m, l = 1, . . . , Lm, m =
1, . . . , M .
2. Calculate averaged co-association matrix H with optimal weights according to (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ),
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ).
3. Construct MST on graph (V, G) = (A, H) using Kruskal’s algorithm.
4. Find a partition of A on optimal number of clusters K using MST and criterion (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ).
end.
        </p>
        <p>The running time of steps 1-2 linearly depends on the complexity of base clustering
algorithms and validity estimators. Constructing of the final partition needs O(n log(n))
operations. One may conclude that MSTEClust is more readily applicable to large
amount of data than PEWEC [10] which has time complexity of order O(n2).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Numerical experiments</title>
      <p>This section describes numerical experiments with MSTEClust algorithm with Monte
Carlo modeling. In the experiments, two base clustering algorithms are included into
the ensemble: k-means (KM) and hierarchical agglomerative clustering algorithm with
single linkage rule (HACSL).</p>
      <p>We choose the following distribution model. In 16-dimensional feature space four
classes are defined. First and second classes are of normal distribution N (ν1, Σ),
N (ν2, Σ), where ν1 = (0, ..., 0)T , ν2 = (6, ..., 6)T , Σ = σI is diagonal covariance matrix,
σ = 1. The coordinates of objects from other two classes are determined recursively:
xki+1 = xki + θ1 · 1 + θ2 · ε, where 1 = (1, . . . , 1)T , θ1 = θ2 = 0.25, ε is Gaussian
random vector N (0, I), k = 3, 4. For class 3, x31 = (−6, 6, ..., −6, 6)T + 0.1 · ε; for class
4, x41 = (6, −6, ..., 6, −6)T + 0.1 · ε. The number of objects of each class equals 25.
Figure 1 illustrates sampled data.</p>
      <p>The random subspace method is used for the ensemble construction: each base
cluster partition is built on dens randomly chosen variables (dens = 3 in this experiment).
Besides that, for each base algorithm the number of clusters is chosen at random from
range {2, . . . , Kmax}, Kmax = 5. The obtained clustering results are assessed with
cluster validity indices: Hubert Gamma index estimates the quality of KM; the results
of HACSL are evaluated with cophenetic correlation index. The number of ensemble
variants for each algorithm is set to L1 = L2 = 25.</p>
      <p>To study the behavior of the ensemble algorithm in the presence of noise, some of
the features, whose indexes are determined by chance and their total number equals
parameter d0, are replaced with random values uniformly distributed over feature range.</p>
      <p>In the process of Monte Carlo modeling, artificial data sets are repeatedly generated
according to the specified distribution model. For each data set, a number of base
partitions are obtained by KM and HACSL; the ensemble solution is found according
to MSTEClust.
1.2</p>
      <p>1
d0</p>
      <p>Figure 2 presents the results of the experiments (Adjusted Rand index averaged
over 100 random samples). For comparison purposes, the results of KM and HACSL
(for which the true number of clusters is given) are plotted as well. To illustrate the
effect of weights optimization taken validity indices into account, Figure 2 displays the
evaluations of the performance of PEWEC [10] modification in which the MST is used
to construct the ensemble partition instead of a dendrogram algorithm.</p>
      <p>The plots are positive evidence for the suggested ensemble algorithm with optimized
weights. Beginning with sufficiently large number of noise features, MSTECLust gives
significantly better clustering quality than other examined procedures. In contrast,
non-ensemble algorithms totally fail to recover heavily distorted data structure.</p>
      <p>
        An important problem is experimental verification of theoretical suppositions which
lie at the basis of the suggested method. The validity of assumption (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) is crucial
because it determines the way of assessing the expected margin. In Monte Carlo
simulation, the true status of each object pair is known, thus it is possible to evaluate
conditional probabilities qm1(i, j), qm0(i, j) in (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ),(
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) using available frequencies. In the
experiment, all above-described settings are unchanged. Table 1 presents the results
of estimations. From this table, one can see that in the condition of small and
moderate noise (up to three noise features) the estimates stay above 0.5, thus confirming
assumption (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ). Under increasing noise, violation of the inequalities is seen, evidently
resulting in the decrease of the algorithm’s accuracy.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this work we have suggested a method of ensemble clustering using a number of
different base algorithms. The ensemble partition is found by the weighted average of
co-association matrices, where the weights depend on cluster validity indices. A
mathematical methodology of determining optimal weights is proposed. As an optimization
functional, we utilize the estimate of the ensemble’s expected margin. To construct the
final partition, it is suggested to use minimum spanning tree built on the averaged
co-association matrix as an adjacency matrix.</p>
      <p>Unlike other existing methods of ensemble clustering, the proposed one takes into
account both stability measures of base partitions and the obtained quality estimates;
it is capable of finding the optimal number of clusters.</p>
      <p>The efficiency of the suggested MSTEClust algorithm is confirmed experimentally
with Monte-Carlo modeling. For a given distribution model, the simulations under
noise distortions have demonstrated significant improvement of clustering quality for
MSTEClust in comparison with other considered algorithms. Monte-Carlo experiments
have confirmed the principal feasibility of theoretical assumptions used in this work for
the estimation of cluster ensemble characteristics.</p>
      <p>In the future works, the author plans to continue studying theoretical properties
of clustering ensembles and developing algorithms for real world applications such as
hyperspectral images analysis and segmentation.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work was partially supported by the Russian Foundation for Basic Research,
project 14-07-00249a.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Falkenauer</surname>
            <given-names>E.: Genetic</given-names>
          </string-name>
          <string-name>
            <surname>Algorithms</surname>
          </string-name>
          and Grouping Problems. Wiley, New York (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Naldi</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campello</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hruschka</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carvalho</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Efficiency of evolutionary k-means</article-title>
          .
          <source>Applied Soft Computing. 11</source>
          ,
          <fpage>1938</fpage>
          -
          <lpage>1952</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Duda</surname>
            <given-names>R.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hart</surname>
            <given-names>P.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stork</surname>
            <given-names>D.G.: Pattern</given-names>
          </string-name>
          <string-name>
            <surname>Classification. Second Edition</surname>
          </string-name>
          . Wiley, New York (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jain</surname>
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubes R.C.</surname>
          </string-name>
          <article-title>: Algorithms for clustering data</article-title>
          . Prentice Hall, NJ (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jain</surname>
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Data clustering: 50 years beyond k-means</article-title>
          .
          <source>Pattern Recognition Letters</source>
          .
          <volume>31</volume>
          (
          <issue>8</issue>
          ),
          <fpage>651</fpage>
          -
          <lpage>666</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ghosh</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acharya</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Cluster ensembles</article-title>
          .
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          . Vol.
          <volume>1</volume>
          ,
          <fpage>305</fpage>
          -
          <lpage>315</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Vega-Pons</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ruiz-Shulcloper</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A Survey of Clustering Ensemble Algorithms</article-title>
          .
          <source>IJPRAI</source>
          <volume>25</volume>
          (
          <issue>3</issue>
          ),
          <fpage>337</fpage>
          -
          <lpage>372</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Vega-Pons</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruiz-Shulcloper</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Clustering Ensemble Method for Heterogeneous Partitions</article-title>
          . CIARP'
          <volume>09</volume>
          ,
          <fpage>481</fpage>
          -
          <lpage>488</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Naldi</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carvalho</surname>
            ,
            <given-names>A. C. P. L. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campello</surname>
            ,
            <given-names>R. J. G. B.</given-names>
          </string-name>
          :
          <article-title>Cluster ensemble selection based on relative validity indexes</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          . Vol.
          <volume>27</volume>
          ,
          <fpage>259</fpage>
          -
          <lpage>289</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Berikov</surname>
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Weighted ensemble of algorithms for complex data clustering</article-title>
          .
          <source>Pattern Recognition Letters</source>
          . Vol.
          <volume>38</volume>
          ,
          <fpage>99</fpage>
          -
          <lpage>106</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hubert</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arabie</surname>
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Comparing partitions</article-title>
          .
          <source>Journal of Classification</source>
          . Vol.
          <volume>2</volume>
          ,
          <fpage>193</fpage>
          -
          <lpage>218</lpage>
          (
          <year>1985</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Gabow</surname>
            ,
            <given-names>H. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Galil</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spencer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarjan</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          :
          <article-title>Efficient algorithms for finding minimum spanning trees in undirected and directed graphs</article-title>
          .
          <source>Combinatorica</source>
          . Vol.
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <fpage>109</fpage>
          -
          <lpage>122</lpage>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Gower</surname>
            J.,
            <given-names>G. Ross.</given-names>
          </string-name>
          :
          <article-title>Minimum spanning trees and single linkage cluster analysis</article-title>
          .
          <source>Applied Statistics</source>
          . Vol.
          <volume>18</volume>
          ,
          <fpage>54</fpage>
          --
          <lpage>64</lpage>
          (
          <year>1969</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kruskal</surname>
          </string-name>
          J.:
          <article-title>On the shortest spanning subtree and the traveling salesman problem</article-title>
          .
          <source>Proceedings of the American Mathematical Society</source>
          , pp.
          <fpage>48</fpage>
          -
          <lpage>50</lpage>
          (
          <year>1956</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>