Searching for Optimal Classifier Using a
    Combination of Cluster Ensemble and Kernel
                     Method

                Vladimir B. Berikov1,2 and Lyailya Sh. Cherikbayeva3
                 1
                   Sobolev Institute of Mathematics, Novosibirsk, Russia
                                  berikov@math.nsc.ru
                   2
                     Novosibirsk State University, Novosibirsk, Russia
              3
                Al-Farabi Kazakh National University, Almaty, Kazakhstan
                                  lailash01@gmail.com


         Abstract. This work introduces a supervised classification algorithm
         based on a combination of ensemble clustering and kernel method. The
         main idea of the algorithm lies behind the expectation that the ensemble
         clustering as a preliminary stage would restore more accurately metric
         relations between data objects under noise distortions and existence of
         complex data structures, eventually rising the overall classification qual-
         ity. The algorithm consists in two major steps. On the first step, the
         averaged co-association matrix is calculated using cluster ensemble. It is
         proved that the matrix satisfies Mercer’s condition, i.e., it defines sym-
         metric non-negative definite kernel. On the next step, optimal classifier is
         found with the obtained kernel matrix as input. The classifier maximizes
         the width of hyperplane’s separation margin in the space induced by the
         cluster ensemble kernel. Numerical experiments with artificial examples
         and real hyperspectral image have shown that the proposed algorithm
         possesses classification accuracy comparable with some state-of-the-art
         methods, and in many cases outperforms them, especially in noise con-
         ditions.

         Keywords: Kernel based learning · Cluster ensemble · Co-association
         matrix · Support vector machine


1     Introduction

Kernel based learning and collective decision making (ensemble approach) stay
among top trends influencing the progress in machine learning.
    Kernel (potential) function defines an implicit non-linear mapping of the
initial feature space into a new space with larger or even infinite dimensionality.
     Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
    In: S. Belim et al. (eds.): OPTA-SCL 2018, Omsk, Russia, published at http://ceur-ws.org
46      V. Berikov, L.Sh. Cherikbayeva

In the new space, initial configurations of patterns are transformed (with so
called “kernel trick”) into the structures which often are more compact and
linearly separable. To calculate distances or average values in the new space,
it is not necessary to determine coordinates of the transformed points; it is
enough to only know the values of kernel function. Such methods as Support
Vector Machine (SVM), Kernel Fisher Discriminant (KFD), Kernel K-means,
Kernel Principal Component Analysis, etc., are based on this methodology [1].
The decision function is defined by a linear combination of kernels determining
distances to a number of sample elements. In the existing algorithms, the kernel
matrix is usually calculated with use of the Euclidean distance between objects
in the input feature space.
    Ensemble approach exploits the idea of collective decision making by usage
of algorithms working on different settings such as subsets of parameters, sub-
samples of data, combinations of features, etc. Ensemble based systems usually
yield robust and effective solution, especially in case of uncertainty in data model
or when it is not clear which of algorithm’s parameters are most appropriate for
a particular problem. As a rule, properly organized ensemble (even composed of
“weak” predictors) significantly improves the overall quality of decisions [2–6].
    Cluster analysis aims at determining a partition of a dataset on natural clus-
ters using objects descriptions and a certain criterion of compactness-remoteness
of groups. Ensemble clustering is one of the successful implementations of the
collective methodology. There are a number of major techniques for constructing
the ensemble decision [7–9]. Following evidence accumulation approach [10], the
clustering partition is found in two steps. On the first step, a number of cluster-
ing results are obtained (for example, by usage of K-means for different number
of clusters or random initializations of centroids). For each partition variant, the
co-association boolean matrix is calculated. The matrix elements correspond to
the pairs of data objects and indicate if the pair belong to the same cluster or
not. On the second step, the averaged co-association matrix is calculated over all
variants; it is used for constructing the resultant partition: the matrix elements
are considered as distances or similarity measures between data points and any
clustering algorithm designed for such type of input information is applied to
get the final clustering partition.
    This paper introduces an algorithm of classifier construction using a combi-
nation of ensemble clustering and kernel based learning. The proposed methodic
is based on the hypothesis that the preliminary ensemble clustering allows one to
restore more accurately metric relations between objects under noise distortions
and existence of complex data structures. The obtained kernel matrix depends
on the outputs of clustering algorithms and is less noise-addicted than conven-
tional similarity matrix. Clustering with sufficiently large number of clusters
can be viewed as Learning Vector Quantization methodic [11] known for lower-
ing the average distortion in data. These reasons, as supposed, eventually result
in an increase of recognition accuracy of the combination. The outline of the
method is as follows. First of all, a number of variants of a dataset partitioning
are obtained with base clustering algorithm. Then the averaged co-association
                                            Searching for Optimal Classifier    47

matrix is calculated, where the averaging is performed with weights dependent
on the obtained ensemble’s characteristics. The matrix elements play the role
of similarity measures between objects in the new feature space induced by im-
plicit non-linear transformation of input features. On the second stage, a kernel
classifier is found by usage of the obtained co-association matrix as input kernel
matrix (we used SVM in numeric experiments).
    The aim of this paper is to verify the practicability of the suggested method-
ology with theoretical analysis and experimental evaluation.
    There are two main types of cluster ensembles: homogeneous (when a single
algorithm partitions data by varying its working settings) and heterogeneous
ones (which includes a number of different algorithms). Heterogeneous cluster
ensemble was considered in [12, 13], where methods for its weights optimization
were suggested. Homogeneous cluster ensemble was investigated in [14] with use
of the probabilistic model assuming the validity of some key assumptions. In the
current work, we follow a scheme of homogeneous ensemble and perform theo-
retical investigation of some of its properties using less restrictive assumptions.
    The rest of the paper is organized as follows. Section 2 briefly overviews re-
lated works. Section 3 introduces necessary notions in the field of kernel based
classifiers and ensemble clustering. In The Next Section We Prove That the
Weighted Co-association Matrix obtained with cluster ensemble is a valid kernel
matrix. The proposed algorithm of classifier design KCCE is also presented and
some details of the optimization procedure are given. Section 5 provides a prob-
abilistic analysis of the ensemble clustering stage. The Final Section Describes
the Results of Numerical Experiments with KCCE. The conclusion summarizes
the work and describes some of the future plans.


2   Related Works
The idea of combining cluster analysis and pattern recognition methods is rather
well-known in machine learning [15]. There are several natural reasons for the
combination:
 – Cluster analysis can be viewed as a tool for data cleaning to eliminate outliers
   or noisy items from learning sample [16].
 – Joint learning and control sample provides additional information on data
   distribution which can be utilized to improve the classifier performance (this
   way of reasoning is sometimes called the trunsductive learning). For example,
   the authors of [17] make a partition of the united sample into clusters which
   are used to design more accurate decision rule.
 – In semi-supervised learning context [18], usage of small amount of labeled
   data in combination with a large volume of unlabeled examples is useful for
   constructing more efficient classifier.
   A connection between cluster analysis and kernel based classifiers was es-
tablished in [19], where cluster kernels were proposed implementing the cluster
assumption in the form: “two points are likely to have the same class label if
48      V. Berikov, L.Sh. Cherikbayeva

there is a path connecting them passing through regions of high density only”.
Three types of kernels were presented: kernels from mixture models, random walk
kernels and kernels induced by a cluster representation with spectral clustering
[20].
    The usage of a certain similarity function (which not necessarily possesses
positive semi-definiteness property) instead of kernel function was proposed in
[21]. A classifier is finding in two stages. On the first stage, the choice of some
“supporting” points is performed. With regard to these points, according to the
defined similarity function, initial observations are mapped into metric space
of small dimensionality. On the second stage, a linear classification rule is con-
structed in the new space implementing SVM-type algorithm to find the classi-
fication margin of maximum width.
    Following the idea of combining cluster ensembles and supervised classifi-
cation, the authors of [22] construct new feature space by usage of the degree
of belonging of objects to clusters in the obtained variants of data partitioning
with cluster ensemble. The transomed feature matrix is utilized as input training
set for classification using conventional techniques such as Decision Tree, Naive
Bayes, K-nearest neighbors, Neural Network. The method showed its effective-
ness in comparison with a number of state-of-the-art procedures.
    Unlike the above mentioned works, we apply completely different combina-
tion scheme based on the notion of kernel function.


3    Basic Preliminaries

Suppose we are given a data set A = {a1 , . . . , aN } consisting of N objects
(examples), A ⊂ Γ , where Γ is a statistical population. Information about the
objects is presented in the form of a feature matrix Z = (X, Y) = (xi , yi )N         i=1 ,
where xi = (xi,1 , . . . , xi,d ) ∈ Rd is input feature vector (d is feature space
dimensionality), xi,m = Xm (ai ) is a value of feature Xm for object ai ; yi is a
class label attributed to ith object, i = 1, . . . , N . For binary classification task
we assume yi ∈ {−1, 1}. In multi-class classification problem, an arbitrary finite
set of unordered class labels is defined.
    On the basis of the information about A (training sample), it is required
to find a classifier (predictor, decision function) y = f (x), optimal in some
sense, e.g. having minimal expected losses for unseen examples. To examine the
performance of the classifier, it is possible to use test sample B = {b1 , . . . , bNt },
B ⊂ Γ described with feature matrix Xtest . We shall presume that the objects
in A and B are independent and identically distributed (iid), that is, the sets are
collected on the basis of independent random choice of objects from Γ without
replacement following a fixed distribution.
    Kernel classifiers [1] make use of the notion of kernel function K(xi , xj ) ≥ 0,
where K is a kind of similarity measure between two data points. Linear kernel
classifier exemplifies
               P        a binary decision function introduced within this approach:
f (x) = sign(       αi yi K(x, xi )), where sign is the sign function, α1 , . . . , αN are
              xi ∈X
                                               Searching for Optimal Classifier        49

non-negative weights. A number of methods for determining weights (Support
Vector Machine, Kernel Fisher Discriminate, etc.) exist.
     For the SVM classifier, the weights are found as a solution to the constrained
quadratic optimization problem of maximizing the width of the margin (separa-
tion region) between two classes in Hilbert’s space induced by kernel mapping.
     KFD is a kernelized version of Fisher’s linear discriminant analysis (LDA)
which aims at finding such a position of a straight line in feature space, for
which the object’s projections are separated as better as possible according
to a functional minimizing within-class scatter of projections and maximizing
between-class distance.
     The general multi-class classification problem can be solved by the applica-
tion of a series of binary classification tasks for SVM or KFD, e.g., one-against-
all, one-against-one or Error Correcting Output Codes (ECOC) schemes [23].
     Kernel k-NN classifier assigns data points according to k Nearest Neighbor
rule, where neighboring points are determined with respect to similarity measure
defined by kernel function.
     Consider a scheme of homogeneous cluster ensemble. Let a clustering algo-
rithm µ be running a number of times under different conditions such as initial
cluster centroids coordinates, subsets of features, number of clusters or other
parameters. The joined data set A ∪ B is the input for the algorithm (if test
sample is unavailable in the moment of classifier design, then set A is the input).
In each lth trial, algorithm µ creates a partition of the given dataset composed
of Kl clusters, where l = 1, . . . , L, and L is the given number of runs. For each
variant of clustering, we define the evaluation function γl (cluster validity index
or diversity measure). We suppose that the values are scaled to non-negative
quantities and the better is the found variant according to certain criterion, the
larger is the quantity.
     For a pair of different data objects (ai , aj ) ∈ A ∪ B, we define the value
hl (i, j) = I[µl (ai ) = µl (aj )], where I[·] is the indicator function: I[true] = 1;
I[f alse] = 0; µl (a) is the cluster label assigned by algorithm µ to object a in lth
run. Ensemble matrix M stores the results of clusterings: M = (µl (ai ))l=1,...,L
                                                                            i=1,...,N +Nt .
     The averaged co-association matrix H = (h̄(i, j)) is defined over all gener-
                           PL
ated variants: h̄(i, j) =      ul hl (i, j), where the standardized weights u1 , . . . , uL
                          l=1
indicate the quality of clustering for the given variants: ul = Pγγl 0 , l = 1 . . . , L.
                                                                        l


4    Kernel Classification with Averaged Co-association
     Matrix

Let K(x, x0 ): D × D → R be a symmetric function, either continuous or hav-
ing a finite domain, D be a closed subset in Rd . According to Mercer’s the-
orem, K(x, x0 ) is kernel function (i.e., it defines inner product in some metric
space), if and only if for any finite set of m points {xi }m
                                                           i=1 in D and real num-
50          V. Berikov, L.Sh. Cherikbayeva

bers {ci }m                                          m
           i=1 , matrix K = (K(i, j)) = (K(xi , xj ))i,j=1 is nonnegativity definite:
 m
 P
     ci cj K(i, j) ≥ 0. Let us prove the following
i,j=1


Proposition 1. The averaged co-association matrix satisfies Mercer’s condi-
tion.

Proof. The symmetric property of H is obvious. The domain of H is a finite set
              (l)
A ∪ B. Let Ir be the set of indices for data points belonging to rth cluster in
                                                                  m
lth variant of partitioning. Then for any {ci }m
                                                                  P
                                               i=1 it holds true:   ci cj h̄(i, j) =
                                                                                                   i,j=1
 m
 P              L
                P                      L
                                       P           m
                                                   P                          L
                                                                              P          Kl
                                                                                         P      P
        ci cj         ul hl (i, j) =         ul           ci cj hl (i, j) =         ul                   ci cj
i,j=1           l=1                    l=1        i,j=1                       l=1        k=1 i,j∈I (l)
                                                                                                   k
                                             L         Kl
                                                                     ci )2 ≥ 0.
                                             P         P        P
                                       =          ul        (
                                           l=1         k=1 i∈I (l)
                                                                k


    From this property, it follows that the averaged co-association matrix is a
valid kernel matrix and can be used in kernel based classification methods.
    Let us describe the main steps of the proposed algorithm KCCE (Kernel
Classification with Cluster Ensemble).

Algorithm KCCE.
Input:
training data set Z = (X, Y) = (xi , yi ), i = 1, . . . , N ;
test data set Xtest ;
L: number of runs for base clustering algorithm µ;
Ω: set of allowable parameters (working conditions) of µ.
Output:
decision function y = f (x); class labels attributed to Xtest .
Steps:
1. Generate L variants of clustering partition of X ∪ Xtest using algorithm µ
with randomly chosen working parameters; calculate evaluation functions and
weights;
2. For each pair (xi , xj ) ∈ X ∪ Xtest (i 6= j), if the pair are assigned to the same
group in lth variant, then hl (i, j) := 1, otherwise hl (i, j) := 0;
3. Calculate the averaged co-association matrix H;
4. Find decision function with the preset type of kernel classifier and matrix H;
5. Classify test sample Xtest using the found decision function and matrix H;
end.

    In this paper, we use K-means as base clustering algorithm, however it is
possible to apply any other clustering technique. As the kernel classifier, we uti-
lize soft margin version of SVM which aims at optimizing the following objective
                                                          Searching for Optimal Classifier   51

function:
                                     1      2
                                                     P
                                     2 ||w|| + C          ξi → min
                                                                w,b,ξ
                                                      i
                                      subject to:
                                     yi (hw, xi i + b) ≥ 1 − ξi
                                     ξi ≥ 0, i = 1, ..., N,

where w is normal vector to the separating hyperplane in the space induced by
kernel, b is hyperplane’s bias, ξi is a penalty imposed on ith example violating
the separation margin, C ≥ 0 is soft margin parameter. By solving for the
Lagrangian dual, one obtains the quadratic optimization problem:
                         X                   1X
               W (α) =           αi −              αi αj yi yj K(xi , xj ) → max
                             i
                                             2 i,j

                                 P
               subject to:               αi yi = 0, 0 ≤ αi ≤ C, i = 1, . . . , N,
                                 i
where K(·, ·) is kernel function. One may substitute this kernel by the cluster
ensemble kernel H and get:
                                 X               1X                X
                  W (α) =                 αi −         αi αj yi yj   ul hl (i, j)
                                     i
                                                 2 i,j
                                                                   l


                    Kl                                Kl X
      X        1X X    X                  X      1X X
  =       αi −   ul         αi αj yi yj =   αi −   ul   (     αi yi )2 .
      i
               2                          i
                                                 2
                  l    k=1 i,j∈I
                        (l)                               (l)            l     k=1 i∈I
                                         k                                               k


   We search for the optimal solution using Sequential Minimal Optimization
(SMO) method [24]. A point xi for which αi > 0 is called support
                                                               P    vector. The
bias term b is determined by any support vector xi∗ : b = yi∗ − yi αi H(xi , xi∗ ).
                                                                                    i

     P decision for xj ∈ Xtest is calculated using the found multipliers: f (xj ) =
   The
sign( yi αi H(xj , xj ) + b).
      i
     One can see that there is no need to store the kernel matrix: the objective
function is computed using the ensemble matrix M kept in memory. Therefore
KCCE has linear storage complexity with respect to data size. The time com-
plexity depends on the type of utilized clustering and kernel algorithms and
is linear with respect to data matrix dimensionality in case of K-means and
SVM-SMO.
     In some classification tasks, test data are unavailable in the moment of clas-
sifier design. To find the decision function f (x) for any new feature vector x, this
observation should be attributed to clusters according to the obtained partition
variants. It is possible to make this assignment using cluster centroid coordinates
which can be stored in memory during the implementation of Step 1 in KCCE.
The observation is assigned to the nearest centroid’s label for each clustering
variant.
52      V. Berikov, L.Sh. Cherikbayeva

5    On the Reliability of Ensemble Clustering
The suggested combined method includes two main phases: ensemble clustering
and kernel classifier design. An important question is the reliability of the first
step: are the elements of the obtained similarity matrix H fit true relationships
between object pairs (e.g., belonging to same or different classes)? To study this
problem, we use the methodology described in [12, 14, 13].
    In the clustering process, true class labels are unavailable. Following a prob-
abilistic approach, one may suppose that data sample is composed of a finite
number of components. A latent groundtruth variable Y 0 defines the class num-
ber to which an object belongs. Denote

                               v(i, j) = I[ Y 0 (ai ) = Y 0 (aj ) ],                    (1)

where ai and aj are arbitrary objects from input sample. This quantity deter-
mines the true status of the pair (i.e., if ai and aj indeed belong to the same
class).                    P                P
    Function c(i, j) = I [       ul >             ul ] will be called the ensemble
                              l:hl (i,j)=1        l:hl (i,j)=0
decision for ai and aj following weighted voting procedure.
   For a pair (ai , aj ), their ensemble’s margin is defined as
                                     X                X
                  mg(i, j) = {               ul −              ul }
                                    l:hl (i,j)=v(i,j)       l:hl (i,j)6=v(i,j)

and can be rewritten in the form:
                        L
                        X
           mg(i, j) =         ul {I[hl (i, j) = v(i, j)] − I[hl (i, j) 6= v(i, j)]} .
                        l=1

This value indicates to what extend the number of right decisions for (ai , aj )
exceed the number of wrong ones. Evidently, it equals:
                                    L
                                    X
                    mg(i, j) =            ul (2v(i, j) − 1)(2hl (i, j) − 1).            (2)
                                    l=1

The margin can not be calculated if the true partition is unknown. However,
it was shown in [12, 14] that some of margin’s characteristics can be evaluated
using a number of assumptions on the behavior of clustering algorithms:
A1) Under the condition that input data matrix is fixed, for any pair of objects
(ai , aj ) their true status is a random value V (i, j) with values defined in (1).
A2) Algorithm µ is randomized, i.e. it depends on the vector Ω chosen at random
from the set of parameters Ω. For fixed input data, µ is running L times with
i.i.d. parameters Ω1 , . . . , ΩL being statistical copies of Ω.
     Conditional probabilities of correct decisions (partition or union of ai and aj
under fixed V (i, j)) are denoted as

 q0 (i, j) = P [h(i, j, Ω) = 0|V (i, j) = 0], q1 (i, j) = P [h(i, j, Ω) = 1|V (i, j) = 1],
                                                  Searching for Optimal Classifier         53

where h(i, j, Ω) is the decision for (ai , aj ) made by algorithm µ with random
parameters Ω. It should be noted that the work [14] makes use of more restrictive
assumption: q0 (i, j) = q1 (i, j), i.e. it presumes that the conditional probabilities
of correct assigning of both kinds coincide. This assumption could be used for the
qualitative analysis of cluster ensemble’s behavior; however, numerical results of
[13] demonstrate that it can be violated.
    The measure of clustering validity, estimated with algorithm µ, is represented
as a random value γ(X, Ω). Because the quality criterion is determined on the
whole data set, one may consider this value practically independent on V (i, j)
and h(i, j, Ω).
    The weights u1 , . . . , uL are random values following identical distribution
defined by the distribution of γ(Ω) and its statistical copies γ1 (Ω1 ), . . . , γL (ΩL ).
The weights are dependent on each other, and for any pair (ul1 , ul2 ) the degree of
their dependence is characterized with the covariance coefficient
                                                                    σ = cov[ul1 , ul2 ].
                                                 P              P
    From i.i.d. assumption, and because of          E[ul ] = E    ul = 1, it follows
                                                    l                       l
               1
that E[ul ] =    . Let us denote by s = V ar[ul ] the variance of weights. From
               L
                      "      #
                        X        X                  X
             0 = V ar      ul =      V ar[ul ] +           cov[ul1 , ul2 ],
                             l        l                 l1 ,l2 (l1 6=l2 )

one may conclude that LV ar[ul ] + L(L − 1)cov[ul1 , ul2 ] = 0. Thus we get
                                                s
                                       σ=−         .                                       (3)
                                               L−1
Proposition 2. Given the assumptions A1), A2) be valid, conditional mathe-
matical expectation of ensemble margin for (ai , aj ) under V (i, j) = v equals:

                     EΩ̄ [mg(i, j) | V (i, j) = v] = 2Q(v; i, j) − 1,

where Ω̄ = (Ω1 , . . . , ΩL ), Q(v; i, j) = (1 − v) q0 (i, j) + v q1 (i, j), v ∈ {0, 1}.

Proof. Let us denote by hl (i, j, Ωl ) = I[µ(i, Ωl ) = µ(j, Ωl )] the decision for
(ai , aj ), where µ(i, Ωl ) is the cluster label assigned to object ai by algorithm µ
in its lth run with usage of parameter vector Ωl .
     Until the proof end, arguments i, j are skipped for short: mg(i, j) = mg,
V (i, j) = V , etc. FromP(2) we have:
     EΩ̄ [mg|V = v] = EΩ̄ [ul (Ωl )(2v − 1)(2hl (Ωl ) − 1)].
                         l
    Because Ωl and Ω are equally distributed and ul (Ωl ) is independent on
hl (Ωl ), it holds true
                     P that
EΩ̄ [mg|V = v] = E[ul (Ω)] (2v −1)(2EΩ [h(Ω)]−1) = (2v −1)(2EΩ [h(Ω)]−1).
                     l
    For v = 0 we have: (2v − 1)(2EΩ [h(Ω)] − 1) = −(2P [h(Ω) = 1|V = 0] − 1) =
2q0 − 1; and for v = 1: (2v − 1)(2EΩ [h(Ω)] − 1) = 2P [h(Ω) = 1|V = 1] − 1 =
2q1 − 1. Therefore EΩ̄ [mg|V = v] = 2Q(v; i, j) − 1. This completes the proof.
54       V. Berikov, L.Sh. Cherikbayeva

Proposition 3. Given the assumptions A1), A2) be valid, conditional variance
of ensemble margin for (ai , aj ) under V (i, j) = v equals:
                                                                                                               1
        V arΩ̄ [mg(i, j) | V (i, j) = v] = 4Q(v; i, j)(1 − Q(v; i, j)) (L s +                                    ).
                                                                                                               L
Proof. Let us again skip indices i, j for simplicity; and also let hl denote hl (Ωl ),
h = h(Ω), ul = ul (Ω). From the properties of variance, it follows that under
condition V = v it holds true:
                              "                        #        "               #
                               X                                  X
                        2
  V arΩ̄ [mg] = (2v − 1) V ar      E[ul ] (2E[hl ] − 1) = V ar        2ul hl − 1 =
                                                    l                                                  l
              "                #
                  X                         X                                X
      4V ar            ul hl ] = 4                V ar[ul hl ] + 4                          cov[ul1 hl1 , ul2 hl2 ] =
                  l                           l                         l1 ,l2 (l1 6=l2 )
              X
                      V ar[ul ]V ar[hl ] + (E[ul ])2 V ar[hl ] + (E[hl ])2 V ar[ul ] +
                                                                                    
          4
              l
                                                  X
                                      4                       cov[ul1 hl1 , ul2 hl2 ] =
                                          l1 ,l2 (l1 6=l2 )
                                                                                 X
  4 L s V ar[h] + 4 V ar[h]/L + 4 L s(E[h])2 + 4                                                cov[ul1 hl1 , ul2 hl2 ]. (4)
                                                                            l1 ,l2 (l1 6=l2 )

Evidently, V ar[h | V = v] P
                           = Q(v)(1 − Q(v)). From the independence of all pairs
hl1 and hl2 , we have:          cov[ul1 hl1 , ul2 hl2 ] =
                                 l1 ,l2 (l1 6=l2 )
                         P
                                      (E[ul1 ul2 ]E[hl2 hl1 ]−E[ul1 hl1 ]E[ul2 hl2 ]) =
                  l1 ,l2 (l1 6=l2 )

                        X
                                      (E[ul1 ul2 ](E[h])2 − (E[h])2 E[ul1 ]E[ul2 ]) =
                  l1 ,l2 (l1 6=l2 )
                           X
        (E[h])2                        (E[ul1 ul2 ] − E[ul1 ]E[ul2 ]) = (E[h])2 L(L − 1) σ.
                      l1 ,l2 (l1 6=l2 )

     Using (3), (4) we see that
                                                        4
     V ar [mg] = 4Q(1 − Q) L s +                          Q(1 − Q) + 4(E[h])2 L s − 4 (E[h])2 L s =
                                                        L
                                                                     1
                                      4Q(1 − Q) (L s +                 ).                   
                                                                     L
     Now we consider the upper bound for the weight’s variance.

Proposition 4. Under the validity of the assumptions A1) and A2), the vari-
                                                             1
ance s = V ar[ul ] is upper bounded with the expression: s ≤ 2 (E[γ(Ω)−2 ] − 1).
                                                            L
                                                   Searching for Optimal Classifier           55

   The proof technically repeats the one for analogous statement in [14], p. 430.
   Our main objective is finding dependencies between the observed ensemble
characteristics and the probability of directly unobserved classification error for
a pair ai , aj :
                  Perr (i, j) = PΩ1 ,...,ΩL ,U (i,j) [ c(i, j) 6= V (i, j)].
   It is clear that the probability of error in ensemble classification of a pair
equals Perr (i, j) = P [mg(i, j) < 0].
   Now let us consider the conditional probability of error under given state of
objects: Perr (v; i, j) = PΩ1 ,...,ΩL ,V (i,j) [ c(i, j) 6= V (i, j) | V (i, j) = v], v ∈ {0, 1}.

Proposition 5. Let the above introduced model assumptions A1), A2) be valid,
and also let EΩ̄ [mg(i, j) | V (i, j) = v] > 0 for each (i, j). Then the conditional
probability of error in classification of a given pair is upper bounded by the ex-
pression:
                                     V arΩ̄ [mg(i, j) | V (i, j) = v]
                   Perr (v; i, j) <                                   ,
                                     (EΩ̄ [mg(i, j) | V (i, j) = v])2
where conditional mathematical expectation and variance of margin are given by
Propositions 2,3.

    This property directly follows from the Tchebychev’s inequality. The proof
is similar to one given in [14], p. 433, and skipped in this work for the sake of
brevity.
    From Proposition 2, it is clear that the margin expectation takes positive
value if the following assumption is valid:

               A3) ∀i, j (i 6= j), 0.5 < q0 (i, j) ≤ 1, 0.5 < q1 (i, j) ≤ 1.

     It means that the base clustering algorithm has at least slightly better clas-
sification quality than just random assignment of a pair to the same or different
clusters. In machine learning theory, similar conditions are known as algorithm’s
weak learnability.
     Let us formulate an important consequence of the obtained results.

Proposition 6. Holding all other factors constant, under validity of assump-
tions A1), A2) and A3), conditional probability of error converges to zero as the
ensemble size increases.

   This property follows from Propositions 2-5 taking into account that the
weight’s variance s is of order O(L−2 ).


6    Numerical Experiments
This Section Describes Numerical Experiments with Kcce. In the First experi-
ment, we used Monte Carlo statistical modeling technique to evaluate the quality
of classification. We consider a simple case of two spherical Gaussian classes
N (m1 , Σ) and N (m2 , Σ), each having diagonal covariance matrix Σ = σI,
56      V. Berikov, L.Sh. Cherikbayeva

where σ > 0 is a parameter, m1 and m2 are population means (in our ex-
periments, m1 = 0 and m2 = 1). The classes are of equal prior probabilities
P1 = P2 = 0.5. In this simple case, it is possible to derive the Bayes probability
of error PB = Φ(−δ/2), where δ = (m1 − m2 )T Σ −1 (m1 − m2 ) is the Macha-
lanobis distance, Φ is the standard Gaussian cumulative distribution function.
Moreover, the expected generalization error for the optimal sample-based
                                                                      !     linear

classifier [25] asymptotically equals PN = Φ − 2δ q        1
                                                                1
                                                                                  .
                                                        1+ N (1+ δ2d      d
                                                                   2 )+ N 2 δ 2

    In Monte Carlo modeling, we repeatedly generate data sets according to the
given distributions. For each class, training sample size equals N . Each data set
is analyzed in Matlab environment with SVM, KFD and KCCE. For SVM and
                                          ||x−x0 || 2
KFD, radial basis function ϕ(x, x0 ) = e 2σr       with σr = 5 is used as a kernel.
The soft margin constant is C = 10. The cluster ensemble is generated for KCCE
by random initialization of centroids in K-means (number of clusters equals 2).
Ensemble size equals 10. The weights of partition variants are constant values.
    The accuracy (probability of correct classification) of each algorithm is es-
timated by independent test sample of size Nt = 1000. To make the results
more statistically sound, we averaged the accuracy estimates over 100 Monte
Carlo repetitions. The 95% confidence intervals for the probability of correct
classification are evaluated for each algorithm.
    Figure 1 presents the results of modeling. The plots display the dependencies
between algorithm’s accuracy and standard deviation σ for different combina-
tions of feature space dimensionality and training sample size.
    The results show that there exist such examples of data distribution for which
the proposed method demonstrates substantially more accurate classification
than SVM or KFD, and approaches to the optimal Bayes classifier.
    In the second experiment we consider a real hyperspectral satellite image
“Indian Pines” taken from [26]. This scene was gathered by AVIRIS sensor over
the Indian Pines test site in North-western Indiana. The image size is 145 × 145
pixels; each pixel is characterized by the vector of 224 spectral intensities in 400-
2500 nm range. The image includes 16 classes describing different vegetation
types, as one can see in Figure 2. There are unlabeled pixels not assigned to any
of the classes. These pixels are excluded from the analysis. To study the effect
of noise on the performance of the algorithms, randomly selected 100r% of the
spectral intensity values have experienced a distorting effect: the corresponding
value x is replaced by the quantity generated from the interval [x(1−p), x(1+p)],
where r, p are preset parameters. The dataset has been randomly divided on
training and test sample in proportion 1:3.
    We use multiclass SVM following “one-against-one” strategy. Cluster ensem-
ble size is L = 200 . For the construction of each variant, three hyperspectral
channels are randomly chosen. To obtain more diverse results of K-means, the
number of its iterations is limited to 1, and the initial centroids are randomly
sampled from data. In the ensemble generation,√data matrix Xtest is not used.
The number of clusters in each variant equals d N e. The weights of clusterings
are constant values.
                                             Searching for Optimal Classifier      57


                     a ) d = 10, N = 50              b) d = 10, N = 200


                     c ) d = 20, N = 50             d ) d = 20, N = 200


Fig. 1. Results of Monte Carlo experiments for a number of combinations of feature
space dimensionality d and training sample size N . “svmEns”: averaged accuracy of the
proposed algorithm KCCE; “svm”: averaged accuracy of SVM; “KernFish”: averaged
accuracy of KFD algorithm; “SampleBayes”: accuracy of optimal sample-based linear
classifier


Fig. 2. Indian Pines hyperspectral image: (a) Composite image of hyperspectral data;
(b) Ground-truth map
58      V. Berikov, L.Sh. Cherikbayeva

    We compare the proposed algorithm with conventional SVM using Euclidean
metric, under similar conditions (the parameters are chosen as recommended
default values in Matlab environment; RBF kernel with σ = 10 gives the best
results). Table 1 shows the accuracy of classification (rate of correctly predicted
class labels) on test sample for some of the noise parameters. The running time
on a dual-core Intel Core i5 processor with a clock frequency of 2.8 GHz and 4
GB RAM is about 50 sec in average for KCCE and 14 sec for SVM (note that
an unoptimized code is used in KCCE implementation, in contrast with efficient
implementation of SVM). One can see that KCCE has revealed itself as more
noise resistant than SVM, especially under large distortion rates.

     Table 1. Accuracy of KCCE and SVM on Indian Pines hyperspectral image

          Noise parameters r, p   0.05, 0.05   0.1, 0.1 0.15, 0.15 0.2, 0.2
            KCCE accuracy           0.777      0.742      0.720      0.686
            SVM accuracy            0.767      0.618      0.543      0.503


7    Conclusion
In this work, we have introduced a supervised classification algorithm using a
combination of ensemble clustering and kernel based classification. In the clus-
tering ensemble, we used a scheme of a single clustering algorithm that con-
structs base partitions with parameters taken at random. It was verified that
the weighted co-association matrix obtained with a clustering ensemble is a
valid kernel matrix. The proposed combined approach experimentally has been
proven to be successful when comparing with Support Vector Machine and Ker-
nel Fisher Discriminant. Monte-Carlo experiments demonstrated that there exist
examples of data distribution for which the proposed method gets significantly
more accurate predictions. The experiment with a real hyperspectral satellite im-
age has shown that the suggested algorithm is more accurate than SVM under
noise distortion.
    In the future, we plan to continue working under improving the performance
of the suggested approach. For example, it will be useful to filter out points with
unstable clusterings before using the kernel classifier. We expect that applying
cluster ensemble with optimized weights [12] will further improve the accuracy.
It will be interesting to apply the introduced approach for the solution of other
types of machine learning problems such as regression or transfer learning.

Acknowledgement. The work was carried out according to the scientific re-
search program “Mathematical methods of pattern recognition and prediction”
in the Sobolev Institute of mathematics SB RAS. The research was partly sup-
ported by RFBR grant 18-07-00600 and partly by the Russian Ministry of Science
and Education under the 5-100 Excellence Programme.
                                               Searching for Optimal Classifier       59

References
 1. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge
    University Press (2004)
 2. Zhuravlev, Y.I.: Principles of construction of justification of algorithms for the
    solution of badly formalized problems. Mathematical Notes of the Academy of
    Sciences of the USSR. 23(6), 493–501 (1978)
 3. Schapire, R., Freund, Y., Bartlett, P., Lee, W.: Boosting the margin: a new expla-
    nation for the effectiveness of voting methods. Annals of Statistics 26(5), 1651–1686
    (1998)
 4. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
 5. Kuncheva, L.: Combining Pattern Classifiers. Methods and Algorithms. Wiley, NJ
    (2004)
 6. Ajdarkhanov, M.B., Amirgaliev, E.N., La, L.L.: Correctness of algebraic extensions
    of models of classification algorithms. Kibernetika i Sistemnyj Analiz 5, 180–186
    (2001)
 7. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters
    31(8), 651–666 (2010)
 8. Ghosh, J., Acharya, A.: Cluster ensembles. Wiley Interdisciplinary Reviews: Data
    Mining and Knowledge Discovery 1(5), 305–315 (2011)
 9. Vega-Pons, S. Ruiz-Shulcloper, J.: A survey of clustering ensemble algorithms.
    IJPRAI 25(3), 337–372 (2011)
10. Fred, A., Jain, A.: Combining multiple clusterings using evidence accumulation.
    IEEE Transaction on Pattern Analysis and Machine Intelligence 27, 835–850 (2005)
11. Gray, R.M.: Vector quantization. IEEE ASSP Magazine 1(2), 4–29 (1984)
12. Berikov, V.: Weighted ensemble of algorithms for complex data clustering. Pattern
    Recognition Letters 38, 99–106 (2014)
13. Berikov, V.: Cluster ensemble with averaged co-association matrix maximizing the
    expected margin. In: 9th International Conference on Discrete Optimization and
    Operations Research and Scientific School (DOOR 2016). pp. 489–500 (2016)
14. Berikov, V., Pestunov, I.: Ensemble clustering based on weighted co-association
    matrices: Error bound and convergence properties. Pattern Recognition 63, 427–
    436 (2017)
15. Zhuravlev, Y.I., Yunusov, R.: A method of improving the taxonomy algorithm by
    pattern recognition methods of the voting type. USSR Computational Mathematics
    and Mathematical Physics 11(5), 327–333 (1971)
16. Jaing, M., Tseng, S. and Su, C.: Two-phase clustering process for outlier detection.
    Pattern Recognition Letters 22(67), 691–700 (2001)
17. Rahman, A., Verma, B.: Cluster-based ensemble of classifiers. Expert Systems 30,
    270–282 (2013)
18. Chapelle, O., Zien, A., Scholkopf, B. (Eds.): Semi-Supervised Learning. MIT Press
    (2006)
19. Chapelle, O., Weston, J., Scholkopf, B.: Cluster kernels for semi-supervised learn-
    ing. Adv. Neural Inf. Process. Syst. 15, 601-608 (2002)
20. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algo-
    rithm. In Advances in Neural Information Processing Systems 14 (2001)
21. Balcan, M.F., Blum, A., Srebro, N.: A theory of learning with similarity functions.
    Machine Learning 72, 89–11 (2008)
22. Iam-On, N., Boongoen, T.: Diversity-driven generation of link-based cluster en-
    semble and application to data classification. Expert Systems with Applications
    42(21), 8259–8273 (2015)
60      V. Berikov, L.Sh. Cherikbayeva

23. Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting
    output codes. Journal of Artificial Intelligence Research 2, 263-282 (1995)
24. Platt, J.: Sequential Minimal Optimization: A Fast Algorithm for Training Support
    Vector Machines. Advances In Kernel Methods - Support Vector Learning (1998)
25. Raudys, S.: Statistical and Neural Classifiers: An integrated approach to design.
    London: Springer-Verlag (2001)
26. Hyperspectral Remote Sensing Scenes: http://www.ehu.es/ccwintco/index.php/
    HyperspectralRemote SensingScenes, [On-line; accessed 09-February-2018]