Searching for Optimal Classifier Using a Combination of Cluster Ensemble and Kernel Method Vladimir B. Berikov1,2 and Lyailya Sh. Cherikbayeva3 1 Sobolev Institute of Mathematics, Novosibirsk, Russia berikov@math.nsc.ru 2 Novosibirsk State University, Novosibirsk, Russia 3 Al-Farabi Kazakh National University, Almaty, Kazakhstan lailash01@gmail.com Abstract. This work introduces a supervised classification algorithm based on a combination of ensemble clustering and kernel method. The main idea of the algorithm lies behind the expectation that the ensemble clustering as a preliminary stage would restore more accurately metric relations between data objects under noise distortions and existence of complex data structures, eventually rising the overall classification qual- ity. The algorithm consists in two major steps. On the first step, the averaged co-association matrix is calculated using cluster ensemble. It is proved that the matrix satisfies Mercer’s condition, i.e., it defines sym- metric non-negative definite kernel. On the next step, optimal classifier is found with the obtained kernel matrix as input. The classifier maximizes the width of hyperplane’s separation margin in the space induced by the cluster ensemble kernel. Numerical experiments with artificial examples and real hyperspectral image have shown that the proposed algorithm possesses classification accuracy comparable with some state-of-the-art methods, and in many cases outperforms them, especially in noise con- ditions. Keywords: Kernel based learning · Cluster ensemble · Co-association matrix · Support vector machine 1 Introduction Kernel based learning and collective decision making (ensemble approach) stay among top trends influencing the progress in machine learning. Kernel (potential) function defines an implicit non-linear mapping of the initial feature space into a new space with larger or even infinite dimensionality. Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: S. Belim et al. (eds.): OPTA-SCL 2018, Omsk, Russia, published at http://ceur-ws.org 46 V. Berikov, L.Sh. Cherikbayeva In the new space, initial configurations of patterns are transformed (with so called “kernel trick”) into the structures which often are more compact and linearly separable. To calculate distances or average values in the new space, it is not necessary to determine coordinates of the transformed points; it is enough to only know the values of kernel function. Such methods as Support Vector Machine (SVM), Kernel Fisher Discriminant (KFD), Kernel K-means, Kernel Principal Component Analysis, etc., are based on this methodology [1]. The decision function is defined by a linear combination of kernels determining distances to a number of sample elements. In the existing algorithms, the kernel matrix is usually calculated with use of the Euclidean distance between objects in the input feature space. Ensemble approach exploits the idea of collective decision making by usage of algorithms working on different settings such as subsets of parameters, sub- samples of data, combinations of features, etc. Ensemble based systems usually yield robust and effective solution, especially in case of uncertainty in data model or when it is not clear which of algorithm’s parameters are most appropriate for a particular problem. As a rule, properly organized ensemble (even composed of “weak” predictors) significantly improves the overall quality of decisions [2–6]. Cluster analysis aims at determining a partition of a dataset on natural clus- ters using objects descriptions and a certain criterion of compactness-remoteness of groups. Ensemble clustering is one of the successful implementations of the collective methodology. There are a number of major techniques for constructing the ensemble decision [7–9]. Following evidence accumulation approach [10], the clustering partition is found in two steps. On the first step, a number of cluster- ing results are obtained (for example, by usage of K-means for different number of clusters or random initializations of centroids). For each partition variant, the co-association boolean matrix is calculated. The matrix elements correspond to the pairs of data objects and indicate if the pair belong to the same cluster or not. On the second step, the averaged co-association matrix is calculated over all variants; it is used for constructing the resultant partition: the matrix elements are considered as distances or similarity measures between data points and any clustering algorithm designed for such type of input information is applied to get the final clustering partition. This paper introduces an algorithm of classifier construction using a combi- nation of ensemble clustering and kernel based learning. The proposed methodic is based on the hypothesis that the preliminary ensemble clustering allows one to restore more accurately metric relations between objects under noise distortions and existence of complex data structures. The obtained kernel matrix depends on the outputs of clustering algorithms and is less noise-addicted than conven- tional similarity matrix. Clustering with sufficiently large number of clusters can be viewed as Learning Vector Quantization methodic [11] known for lower- ing the average distortion in data. These reasons, as supposed, eventually result in an increase of recognition accuracy of the combination. The outline of the method is as follows. First of all, a number of variants of a dataset partitioning are obtained with base clustering algorithm. Then the averaged co-association Searching for Optimal Classifier 47 matrix is calculated, where the averaging is performed with weights dependent on the obtained ensemble’s characteristics. The matrix elements play the role of similarity measures between objects in the new feature space induced by im- plicit non-linear transformation of input features. On the second stage, a kernel classifier is found by usage of the obtained co-association matrix as input kernel matrix (we used SVM in numeric experiments). The aim of this paper is to verify the practicability of the suggested method- ology with theoretical analysis and experimental evaluation. There are two main types of cluster ensembles: homogeneous (when a single algorithm partitions data by varying its working settings) and heterogeneous ones (which includes a number of different algorithms). Heterogeneous cluster ensemble was considered in [12, 13], where methods for its weights optimization were suggested. Homogeneous cluster ensemble was investigated in [14] with use of the probabilistic model assuming the validity of some key assumptions. In the current work, we follow a scheme of homogeneous ensemble and perform theo- retical investigation of some of its properties using less restrictive assumptions. The rest of the paper is organized as follows. Section 2 briefly overviews re- lated works. Section 3 introduces necessary notions in the field of kernel based classifiers and ensemble clustering. In The Next Section We Prove That the Weighted Co-association Matrix obtained with cluster ensemble is a valid kernel matrix. The proposed algorithm of classifier design KCCE is also presented and some details of the optimization procedure are given. Section 5 provides a prob- abilistic analysis of the ensemble clustering stage. The Final Section Describes the Results of Numerical Experiments with KCCE. The conclusion summarizes the work and describes some of the future plans. 2 Related Works The idea of combining cluster analysis and pattern recognition methods is rather well-known in machine learning [15]. There are several natural reasons for the combination: – Cluster analysis can be viewed as a tool for data cleaning to eliminate outliers or noisy items from learning sample [16]. – Joint learning and control sample provides additional information on data distribution which can be utilized to improve the classifier performance (this way of reasoning is sometimes called the trunsductive learning). For example, the authors of [17] make a partition of the united sample into clusters which are used to design more accurate decision rule. – In semi-supervised learning context [18], usage of small amount of labeled data in combination with a large volume of unlabeled examples is useful for constructing more efficient classifier. A connection between cluster analysis and kernel based classifiers was es- tablished in [19], where cluster kernels were proposed implementing the cluster assumption in the form: “two points are likely to have the same class label if 48 V. Berikov, L.Sh. Cherikbayeva there is a path connecting them passing through regions of high density only”. Three types of kernels were presented: kernels from mixture models, random walk kernels and kernels induced by a cluster representation with spectral clustering [20]. The usage of a certain similarity function (which not necessarily possesses positive semi-definiteness property) instead of kernel function was proposed in [21]. A classifier is finding in two stages. On the first stage, the choice of some “supporting” points is performed. With regard to these points, according to the defined similarity function, initial observations are mapped into metric space of small dimensionality. On the second stage, a linear classification rule is con- structed in the new space implementing SVM-type algorithm to find the classi- fication margin of maximum width. Following the idea of combining cluster ensembles and supervised classifi- cation, the authors of [22] construct new feature space by usage of the degree of belonging of objects to clusters in the obtained variants of data partitioning with cluster ensemble. The transomed feature matrix is utilized as input training set for classification using conventional techniques such as Decision Tree, Naive Bayes, K-nearest neighbors, Neural Network. The method showed its effective- ness in comparison with a number of state-of-the-art procedures. Unlike the above mentioned works, we apply completely different combina- tion scheme based on the notion of kernel function. 3 Basic Preliminaries Suppose we are given a data set A = {a1 , . . . , aN } consisting of N objects (examples), A ⊂ Γ , where Γ is a statistical population. Information about the objects is presented in the form of a feature matrix Z = (X, Y) = (xi , yi )N i=1 , where xi = (xi,1 , . . . , xi,d ) ∈ Rd is input feature vector (d is feature space dimensionality), xi,m = Xm (ai ) is a value of feature Xm for object ai ; yi is a class label attributed to ith object, i = 1, . . . , N . For binary classification task we assume yi ∈ {−1, 1}. In multi-class classification problem, an arbitrary finite set of unordered class labels is defined. On the basis of the information about A (training sample), it is required to find a classifier (predictor, decision function) y = f (x), optimal in some sense, e.g. having minimal expected losses for unseen examples. To examine the performance of the classifier, it is possible to use test sample B = {b1 , . . . , bNt }, B ⊂ Γ described with feature matrix Xtest . We shall presume that the objects in A and B are independent and identically distributed (iid), that is, the sets are collected on the basis of independent random choice of objects from Γ without replacement following a fixed distribution. Kernel classifiers [1] make use of the notion of kernel function K(xi , xj ) ≥ 0, where K is a kind of similarity measure between two data points. Linear kernel classifier exemplifies P a binary decision function introduced within this approach: f (x) = sign( αi yi K(x, xi )), where sign is the sign function, α1 , . . . , αN are xi ∈X Searching for Optimal Classifier 49 non-negative weights. A number of methods for determining weights (Support Vector Machine, Kernel Fisher Discriminate, etc.) exist. For the SVM classifier, the weights are found as a solution to the constrained quadratic optimization problem of maximizing the width of the margin (separa- tion region) between two classes in Hilbert’s space induced by kernel mapping. KFD is a kernelized version of Fisher’s linear discriminant analysis (LDA) which aims at finding such a position of a straight line in feature space, for which the object’s projections are separated as better as possible according to a functional minimizing within-class scatter of projections and maximizing between-class distance. The general multi-class classification problem can be solved by the applica- tion of a series of binary classification tasks for SVM or KFD, e.g., one-against- all, one-against-one or Error Correcting Output Codes (ECOC) schemes [23]. Kernel k-NN classifier assigns data points according to k Nearest Neighbor rule, where neighboring points are determined with respect to similarity measure defined by kernel function. Consider a scheme of homogeneous cluster ensemble. Let a clustering algo- rithm µ be running a number of times under different conditions such as initial cluster centroids coordinates, subsets of features, number of clusters or other parameters. The joined data set A ∪ B is the input for the algorithm (if test sample is unavailable in the moment of classifier design, then set A is the input). In each lth trial, algorithm µ creates a partition of the given dataset composed of Kl clusters, where l = 1, . . . , L, and L is the given number of runs. For each variant of clustering, we define the evaluation function γl (cluster validity index or diversity measure). We suppose that the values are scaled to non-negative quantities and the better is the found variant according to certain criterion, the larger is the quantity. For a pair of different data objects (ai , aj ) ∈ A ∪ B, we define the value hl (i, j) = I[µl (ai ) = µl (aj )], where I[·] is the indicator function: I[true] = 1; I[f alse] = 0; µl (a) is the cluster label assigned by algorithm µ to object a in lth run. Ensemble matrix M stores the results of clusterings: M = (µl (ai ))l=1,...,L i=1,...,N +Nt . The averaged co-association matrix H = (h̄(i, j)) is defined over all gener- PL ated variants: h̄(i, j) = ul hl (i, j), where the standardized weights u1 , . . . , uL l=1 indicate the quality of clustering for the given variants: ul = Pγγl 0 , l = 1 . . . , L. l 4 Kernel Classification with Averaged Co-association Matrix Let K(x, x0 ): D × D → R be a symmetric function, either continuous or hav- ing a finite domain, D be a closed subset in Rd . According to Mercer’s the- orem, K(x, x0 ) is kernel function (i.e., it defines inner product in some metric space), if and only if for any finite set of m points {xi }m i=1 in D and real num- 50 V. Berikov, L.Sh. Cherikbayeva bers {ci }m m i=1 , matrix K = (K(i, j)) = (K(xi , xj ))i,j=1 is nonnegativity definite: m P ci cj K(i, j) ≥ 0. Let us prove the following i,j=1 Proposition 1. The averaged co-association matrix satisfies Mercer’s condi- tion. Proof. The symmetric property of H is obvious. The domain of H is a finite set (l) A ∪ B. Let Ir be the set of indices for data points belonging to rth cluster in m lth variant of partitioning. Then for any {ci }m P i=1 it holds true: ci cj h̄(i, j) = i,j=1 m P L P L P m P L P Kl P P ci cj ul hl (i, j) = ul ci cj hl (i, j) = ul ci cj i,j=1 l=1 l=1 i,j=1 l=1 k=1 i,j∈I (l) k L Kl ci )2 ≥ 0. P P P = ul ( l=1 k=1 i∈I (l) k From this property, it follows that the averaged co-association matrix is a valid kernel matrix and can be used in kernel based classification methods. Let us describe the main steps of the proposed algorithm KCCE (Kernel Classification with Cluster Ensemble). Algorithm KCCE. Input: training data set Z = (X, Y) = (xi , yi ), i = 1, . . . , N ; test data set Xtest ; L: number of runs for base clustering algorithm µ; Ω: set of allowable parameters (working conditions) of µ. Output: decision function y = f (x); class labels attributed to Xtest . Steps: 1. Generate L variants of clustering partition of X ∪ Xtest using algorithm µ with randomly chosen working parameters; calculate evaluation functions and weights; 2. For each pair (xi , xj ) ∈ X ∪ Xtest (i 6= j), if the pair are assigned to the same group in lth variant, then hl (i, j) := 1, otherwise hl (i, j) := 0; 3. Calculate the averaged co-association matrix H; 4. Find decision function with the preset type of kernel classifier and matrix H; 5. Classify test sample Xtest using the found decision function and matrix H; end. In this paper, we use K-means as base clustering algorithm, however it is possible to apply any other clustering technique. As the kernel classifier, we uti- lize soft margin version of SVM which aims at optimizing the following objective Searching for Optimal Classifier 51 function: 1 2 P 2 ||w|| + C ξi → min w,b,ξ i subject to: yi (hw, xi i + b) ≥ 1 − ξi ξi ≥ 0, i = 1, ..., N, where w is normal vector to the separating hyperplane in the space induced by kernel, b is hyperplane’s bias, ξi is a penalty imposed on ith example violating the separation margin, C ≥ 0 is soft margin parameter. By solving for the Lagrangian dual, one obtains the quadratic optimization problem: X 1X W (α) = αi − αi αj yi yj K(xi , xj ) → max i 2 i,j P subject to: αi yi = 0, 0 ≤ αi ≤ C, i = 1, . . . , N, i where K(·, ·) is kernel function. One may substitute this kernel by the cluster ensemble kernel H and get: X 1X X W (α) = αi − αi αj yi yj ul hl (i, j) i 2 i,j l Kl Kl X X 1X X X X 1X X = αi − ul αi αj yi yj = αi − ul ( αi yi )2 . i 2 i 2 l k=1 i,j∈I (l) (l) l k=1 i∈I k k We search for the optimal solution using Sequential Minimal Optimization (SMO) method [24]. A point xi for which αi > 0 is called support P vector. The bias term b is determined by any support vector xi∗ : b = yi∗ − yi αi H(xi , xi∗ ). i P decision for xj ∈ Xtest is calculated using the found multipliers: f (xj ) = The sign( yi αi H(xj , xj ) + b). i One can see that there is no need to store the kernel matrix: the objective function is computed using the ensemble matrix M kept in memory. Therefore KCCE has linear storage complexity with respect to data size. The time com- plexity depends on the type of utilized clustering and kernel algorithms and is linear with respect to data matrix dimensionality in case of K-means and SVM-SMO. In some classification tasks, test data are unavailable in the moment of clas- sifier design. To find the decision function f (x) for any new feature vector x, this observation should be attributed to clusters according to the obtained partition variants. It is possible to make this assignment using cluster centroid coordinates which can be stored in memory during the implementation of Step 1 in KCCE. The observation is assigned to the nearest centroid’s label for each clustering variant. 52 V. Berikov, L.Sh. Cherikbayeva 5 On the Reliability of Ensemble Clustering The suggested combined method includes two main phases: ensemble clustering and kernel classifier design. An important question is the reliability of the first step: are the elements of the obtained similarity matrix H fit true relationships between object pairs (e.g., belonging to same or different classes)? To study this problem, we use the methodology described in [12, 14, 13]. In the clustering process, true class labels are unavailable. Following a prob- abilistic approach, one may suppose that data sample is composed of a finite number of components. A latent groundtruth variable Y 0 defines the class num- ber to which an object belongs. Denote v(i, j) = I[ Y 0 (ai ) = Y 0 (aj ) ], (1) where ai and aj are arbitrary objects from input sample. This quantity deter- mines the true status of the pair (i.e., if ai and aj indeed belong to the same class). P P Function c(i, j) = I [ ul > ul ] will be called the ensemble l:hl (i,j)=1 l:hl (i,j)=0 decision for ai and aj following weighted voting procedure. For a pair (ai , aj ), their ensemble’s margin is defined as X X mg(i, j) = { ul − ul } l:hl (i,j)=v(i,j) l:hl (i,j)6=v(i,j) and can be rewritten in the form: L X mg(i, j) = ul {I[hl (i, j) = v(i, j)] − I[hl (i, j) 6= v(i, j)]} . l=1 This value indicates to what extend the number of right decisions for (ai , aj ) exceed the number of wrong ones. Evidently, it equals: L X mg(i, j) = ul (2v(i, j) − 1)(2hl (i, j) − 1). (2) l=1 The margin can not be calculated if the true partition is unknown. However, it was shown in [12, 14] that some of margin’s characteristics can be evaluated using a number of assumptions on the behavior of clustering algorithms: A1) Under the condition that input data matrix is fixed, for any pair of objects (ai , aj ) their true status is a random value V (i, j) with values defined in (1). A2) Algorithm µ is randomized, i.e. it depends on the vector Ω chosen at random from the set of parameters Ω. For fixed input data, µ is running L times with i.i.d. parameters Ω1 , . . . , ΩL being statistical copies of Ω. Conditional probabilities of correct decisions (partition or union of ai and aj under fixed V (i, j)) are denoted as q0 (i, j) = P [h(i, j, Ω) = 0|V (i, j) = 0], q1 (i, j) = P [h(i, j, Ω) = 1|V (i, j) = 1], Searching for Optimal Classifier 53 where h(i, j, Ω) is the decision for (ai , aj ) made by algorithm µ with random parameters Ω. It should be noted that the work [14] makes use of more restrictive assumption: q0 (i, j) = q1 (i, j), i.e. it presumes that the conditional probabilities of correct assigning of both kinds coincide. This assumption could be used for the qualitative analysis of cluster ensemble’s behavior; however, numerical results of [13] demonstrate that it can be violated. The measure of clustering validity, estimated with algorithm µ, is represented as a random value γ(X, Ω). Because the quality criterion is determined on the whole data set, one may consider this value practically independent on V (i, j) and h(i, j, Ω). The weights u1 , . . . , uL are random values following identical distribution defined by the distribution of γ(Ω) and its statistical copies γ1 (Ω1 ), . . . , γL (ΩL ). The weights are dependent on each other, and for any pair (ul1 , ul2 ) the degree of their dependence is characterized with the covariance coefficient  σ = cov[ul1 , ul2 ]. P P From i.i.d. assumption, and because of E[ul ] = E ul = 1, it follows l l 1 that E[ul ] = . Let us denote by s = V ar[ul ] the variance of weights. From L " # X X X 0 = V ar ul = V ar[ul ] + cov[ul1 , ul2 ], l l l1 ,l2 (l1 6=l2 ) one may conclude that LV ar[ul ] + L(L − 1)cov[ul1 , ul2 ] = 0. Thus we get s σ=− . (3) L−1 Proposition 2. Given the assumptions A1), A2) be valid, conditional mathe- matical expectation of ensemble margin for (ai , aj ) under V (i, j) = v equals: EΩ̄ [mg(i, j) | V (i, j) = v] = 2Q(v; i, j) − 1, where Ω̄ = (Ω1 , . . . , ΩL ), Q(v; i, j) = (1 − v) q0 (i, j) + v q1 (i, j), v ∈ {0, 1}. Proof. Let us denote by hl (i, j, Ωl ) = I[µ(i, Ωl ) = µ(j, Ωl )] the decision for (ai , aj ), where µ(i, Ωl ) is the cluster label assigned to object ai by algorithm µ in its lth run with usage of parameter vector Ωl . Until the proof end, arguments i, j are skipped for short: mg(i, j) = mg, V (i, j) = V , etc. FromP(2) we have: EΩ̄ [mg|V = v] = EΩ̄ [ul (Ωl )(2v − 1)(2hl (Ωl ) − 1)]. l Because Ωl and Ω are equally distributed and ul (Ωl ) is independent on hl (Ωl ), it holds true P that EΩ̄ [mg|V = v] = E[ul (Ω)] (2v −1)(2EΩ [h(Ω)]−1) = (2v −1)(2EΩ [h(Ω)]−1). l For v = 0 we have: (2v − 1)(2EΩ [h(Ω)] − 1) = −(2P [h(Ω) = 1|V = 0] − 1) = 2q0 − 1; and for v = 1: (2v − 1)(2EΩ [h(Ω)] − 1) = 2P [h(Ω) = 1|V = 1] − 1 = 2q1 − 1. Therefore EΩ̄ [mg|V = v] = 2Q(v; i, j) − 1. This completes the proof. 54 V. Berikov, L.Sh. Cherikbayeva Proposition 3. Given the assumptions A1), A2) be valid, conditional variance of ensemble margin for (ai , aj ) under V (i, j) = v equals: 1 V arΩ̄ [mg(i, j) | V (i, j) = v] = 4Q(v; i, j)(1 − Q(v; i, j)) (L s + ). L Proof. Let us again skip indices i, j for simplicity; and also let hl denote hl (Ωl ), h = h(Ω), ul = ul (Ω). From the properties of variance, it follows that under condition V = v it holds true: " # " # X X 2 V arΩ̄ [mg] = (2v − 1) V ar E[ul ] (2E[hl ] − 1) = V ar 2ul hl − 1 = l l " # X X X 4V ar ul hl ] = 4 V ar[ul hl ] + 4 cov[ul1 hl1 , ul2 hl2 ] = l l l1 ,l2 (l1 6=l2 ) X V ar[ul ]V ar[hl ] + (E[ul ])2 V ar[hl ] + (E[hl ])2 V ar[ul ] +  4 l X 4 cov[ul1 hl1 , ul2 hl2 ] = l1 ,l2 (l1 6=l2 ) X 4 L s V ar[h] + 4 V ar[h]/L + 4 L s(E[h])2 + 4 cov[ul1 hl1 , ul2 hl2 ]. (4) l1 ,l2 (l1 6=l2 ) Evidently, V ar[h | V = v] P = Q(v)(1 − Q(v)). From the independence of all pairs hl1 and hl2 , we have: cov[ul1 hl1 , ul2 hl2 ] = l1 ,l2 (l1 6=l2 ) P (E[ul1 ul2 ]E[hl2 hl1 ]−E[ul1 hl1 ]E[ul2 hl2 ]) = l1 ,l2 (l1 6=l2 ) X (E[ul1 ul2 ](E[h])2 − (E[h])2 E[ul1 ]E[ul2 ]) = l1 ,l2 (l1 6=l2 ) X (E[h])2 (E[ul1 ul2 ] − E[ul1 ]E[ul2 ]) = (E[h])2 L(L − 1) σ. l1 ,l2 (l1 6=l2 ) Using (3), (4) we see that 4 V ar [mg] = 4Q(1 − Q) L s + Q(1 − Q) + 4(E[h])2 L s − 4 (E[h])2 L s = L 1 4Q(1 − Q) (L s + ).  L Now we consider the upper bound for the weight’s variance. Proposition 4. Under the validity of the assumptions A1) and A2), the vari- 1 ance s = V ar[ul ] is upper bounded with the expression: s ≤ 2 (E[γ(Ω)−2 ] − 1). L Searching for Optimal Classifier 55 The proof technically repeats the one for analogous statement in [14], p. 430. Our main objective is finding dependencies between the observed ensemble characteristics and the probability of directly unobserved classification error for a pair ai , aj : Perr (i, j) = PΩ1 ,...,ΩL ,U (i,j) [ c(i, j) 6= V (i, j)]. It is clear that the probability of error in ensemble classification of a pair equals Perr (i, j) = P [mg(i, j) < 0]. Now let us consider the conditional probability of error under given state of objects: Perr (v; i, j) = PΩ1 ,...,ΩL ,V (i,j) [ c(i, j) 6= V (i, j) | V (i, j) = v], v ∈ {0, 1}. Proposition 5. Let the above introduced model assumptions A1), A2) be valid, and also let EΩ̄ [mg(i, j) | V (i, j) = v] > 0 for each (i, j). Then the conditional probability of error in classification of a given pair is upper bounded by the ex- pression: V arΩ̄ [mg(i, j) | V (i, j) = v] Perr (v; i, j) < , (EΩ̄ [mg(i, j) | V (i, j) = v])2 where conditional mathematical expectation and variance of margin are given by Propositions 2,3. This property directly follows from the Tchebychev’s inequality. The proof is similar to one given in [14], p. 433, and skipped in this work for the sake of brevity. From Proposition 2, it is clear that the margin expectation takes positive value if the following assumption is valid: A3) ∀i, j (i 6= j), 0.5 < q0 (i, j) ≤ 1, 0.5 < q1 (i, j) ≤ 1. It means that the base clustering algorithm has at least slightly better clas- sification quality than just random assignment of a pair to the same or different clusters. In machine learning theory, similar conditions are known as algorithm’s weak learnability. Let us formulate an important consequence of the obtained results. Proposition 6. Holding all other factors constant, under validity of assump- tions A1), A2) and A3), conditional probability of error converges to zero as the ensemble size increases. This property follows from Propositions 2-5 taking into account that the weight’s variance s is of order O(L−2 ). 6 Numerical Experiments This Section Describes Numerical Experiments with Kcce. In the First experi- ment, we used Monte Carlo statistical modeling technique to evaluate the quality of classification. We consider a simple case of two spherical Gaussian classes N (m1 , Σ) and N (m2 , Σ), each having diagonal covariance matrix Σ = σI, 56 V. Berikov, L.Sh. Cherikbayeva where σ > 0 is a parameter, m1 and m2 are population means (in our ex- periments, m1 = 0 and m2 = 1). The classes are of equal prior probabilities P1 = P2 = 0.5. In this simple case, it is possible to derive the Bayes probability of error PB = Φ(−δ/2), where δ = (m1 − m2 )T Σ −1 (m1 − m2 ) is the Macha- lanobis distance, Φ is the standard Gaussian cumulative distribution function. Moreover, the expected generalization error for the optimal sample-based ! linear classifier [25] asymptotically equals PN = Φ − 2δ q 1 1 . 1+ N (1+ δ2d d 2 )+ N 2 δ 2 In Monte Carlo modeling, we repeatedly generate data sets according to the given distributions. For each class, training sample size equals N . Each data set is analyzed in Matlab environment with SVM, KFD and KCCE. For SVM and ||x−x0 || 2 KFD, radial basis function ϕ(x, x0 ) = e 2σr with σr = 5 is used as a kernel. The soft margin constant is C = 10. The cluster ensemble is generated for KCCE by random initialization of centroids in K-means (number of clusters equals 2). Ensemble size equals 10. The weights of partition variants are constant values. The accuracy (probability of correct classification) of each algorithm is es- timated by independent test sample of size Nt = 1000. To make the results more statistically sound, we averaged the accuracy estimates over 100 Monte Carlo repetitions. The 95% confidence intervals for the probability of correct classification are evaluated for each algorithm. Figure 1 presents the results of modeling. The plots display the dependencies between algorithm’s accuracy and standard deviation σ for different combina- tions of feature space dimensionality and training sample size. The results show that there exist such examples of data distribution for which the proposed method demonstrates substantially more accurate classification than SVM or KFD, and approaches to the optimal Bayes classifier. In the second experiment we consider a real hyperspectral satellite image “Indian Pines” taken from [26]. This scene was gathered by AVIRIS sensor over the Indian Pines test site in North-western Indiana. The image size is 145 × 145 pixels; each pixel is characterized by the vector of 224 spectral intensities in 400- 2500 nm range. The image includes 16 classes describing different vegetation types, as one can see in Figure 2. There are unlabeled pixels not assigned to any of the classes. These pixels are excluded from the analysis. To study the effect of noise on the performance of the algorithms, randomly selected 100r% of the spectral intensity values have experienced a distorting effect: the corresponding value x is replaced by the quantity generated from the interval [x(1−p), x(1+p)], where r, p are preset parameters. The dataset has been randomly divided on training and test sample in proportion 1:3. We use multiclass SVM following “one-against-one” strategy. Cluster ensem- ble size is L = 200 . For the construction of each variant, three hyperspectral channels are randomly chosen. To obtain more diverse results of K-means, the number of its iterations is limited to 1, and the initial centroids are randomly sampled from data. In the ensemble generation,√data matrix Xtest is not used. The number of clusters in each variant equals d N e. The weights of clusterings are constant values. Searching for Optimal Classifier 57 a ) d = 10, N = 50 b) d = 10, N = 200 c ) d = 20, N = 50 d ) d = 20, N = 200 Fig. 1. Results of Monte Carlo experiments for a number of combinations of feature space dimensionality d and training sample size N . “svmEns”: averaged accuracy of the proposed algorithm KCCE; “svm”: averaged accuracy of SVM; “KernFish”: averaged accuracy of KFD algorithm; “SampleBayes”: accuracy of optimal sample-based linear classifier Fig. 2. Indian Pines hyperspectral image: (a) Composite image of hyperspectral data; (b) Ground-truth map 58 V. Berikov, L.Sh. Cherikbayeva We compare the proposed algorithm with conventional SVM using Euclidean metric, under similar conditions (the parameters are chosen as recommended default values in Matlab environment; RBF kernel with σ = 10 gives the best results). Table 1 shows the accuracy of classification (rate of correctly predicted class labels) on test sample for some of the noise parameters. The running time on a dual-core Intel Core i5 processor with a clock frequency of 2.8 GHz and 4 GB RAM is about 50 sec in average for KCCE and 14 sec for SVM (note that an unoptimized code is used in KCCE implementation, in contrast with efficient implementation of SVM). One can see that KCCE has revealed itself as more noise resistant than SVM, especially under large distortion rates. Table 1. Accuracy of KCCE and SVM on Indian Pines hyperspectral image Noise parameters r, p 0.05, 0.05 0.1, 0.1 0.15, 0.15 0.2, 0.2 KCCE accuracy 0.777 0.742 0.720 0.686 SVM accuracy 0.767 0.618 0.543 0.503 7 Conclusion In this work, we have introduced a supervised classification algorithm using a combination of ensemble clustering and kernel based classification. In the clus- tering ensemble, we used a scheme of a single clustering algorithm that con- structs base partitions with parameters taken at random. It was verified that the weighted co-association matrix obtained with a clustering ensemble is a valid kernel matrix. The proposed combined approach experimentally has been proven to be successful when comparing with Support Vector Machine and Ker- nel Fisher Discriminant. Monte-Carlo experiments demonstrated that there exist examples of data distribution for which the proposed method gets significantly more accurate predictions. The experiment with a real hyperspectral satellite im- age has shown that the suggested algorithm is more accurate than SVM under noise distortion. In the future, we plan to continue working under improving the performance of the suggested approach. For example, it will be useful to filter out points with unstable clusterings before using the kernel classifier. We expect that applying cluster ensemble with optimized weights [12] will further improve the accuracy. It will be interesting to apply the introduced approach for the solution of other types of machine learning problems such as regression or transfer learning. Acknowledgement. The work was carried out according to the scientific re- search program “Mathematical methods of pattern recognition and prediction” in the Sobolev Institute of mathematics SB RAS. The research was partly sup- ported by RFBR grant 18-07-00600 and partly by the Russian Ministry of Science and Education under the 5-100 Excellence Programme. Searching for Optimal Classifier 59 References 1. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004) 2. Zhuravlev, Y.I.: Principles of construction of justification of algorithms for the solution of badly formalized problems. Mathematical Notes of the Academy of Sciences of the USSR. 23(6), 493–501 (1978) 3. Schapire, R., Freund, Y., Bartlett, P., Lee, W.: Boosting the margin: a new expla- nation for the effectiveness of voting methods. Annals of Statistics 26(5), 1651–1686 (1998) 4. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001) 5. Kuncheva, L.: Combining Pattern Classifiers. Methods and Algorithms. Wiley, NJ (2004) 6. Ajdarkhanov, M.B., Amirgaliev, E.N., La, L.L.: Correctness of algebraic extensions of models of classification algorithms. Kibernetika i Sistemnyj Analiz 5, 180–186 (2001) 7. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8), 651–666 (2010) 8. Ghosh, J., Acharya, A.: Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(5), 305–315 (2011) 9. Vega-Pons, S. Ruiz-Shulcloper, J.: A survey of clustering ensemble algorithms. IJPRAI 25(3), 337–372 (2011) 10. Fred, A., Jain, A.: Combining multiple clusterings using evidence accumulation. IEEE Transaction on Pattern Analysis and Machine Intelligence 27, 835–850 (2005) 11. Gray, R.M.: Vector quantization. IEEE ASSP Magazine 1(2), 4–29 (1984) 12. Berikov, V.: Weighted ensemble of algorithms for complex data clustering. Pattern Recognition Letters 38, 99–106 (2014) 13. Berikov, V.: Cluster ensemble with averaged co-association matrix maximizing the expected margin. In: 9th International Conference on Discrete Optimization and Operations Research and Scientific School (DOOR 2016). pp. 489–500 (2016) 14. Berikov, V., Pestunov, I.: Ensemble clustering based on weighted co-association matrices: Error bound and convergence properties. Pattern Recognition 63, 427– 436 (2017) 15. Zhuravlev, Y.I., Yunusov, R.: A method of improving the taxonomy algorithm by pattern recognition methods of the voting type. USSR Computational Mathematics and Mathematical Physics 11(5), 327–333 (1971) 16. Jaing, M., Tseng, S. and Su, C.: Two-phase clustering process for outlier detection. Pattern Recognition Letters 22(67), 691–700 (2001) 17. Rahman, A., Verma, B.: Cluster-based ensemble of classifiers. Expert Systems 30, 270–282 (2013) 18. Chapelle, O., Zien, A., Scholkopf, B. (Eds.): Semi-Supervised Learning. MIT Press (2006) 19. Chapelle, O., Weston, J., Scholkopf, B.: Cluster kernels for semi-supervised learn- ing. Adv. Neural Inf. Process. Syst. 15, 601-608 (2002) 20. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algo- rithm. In Advances in Neural Information Processing Systems 14 (2001) 21. Balcan, M.F., Blum, A., Srebro, N.: A theory of learning with similarity functions. Machine Learning 72, 89–11 (2008) 22. Iam-On, N., Boongoen, T.: Diversity-driven generation of link-based cluster en- semble and application to data classification. Expert Systems with Applications 42(21), 8259–8273 (2015) 60 V. Berikov, L.Sh. Cherikbayeva 23. Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research 2, 263-282 (1995) 24. Platt, J.: Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Advances In Kernel Methods - Support Vector Learning (1998) 25. Raudys, S.: Statistical and Neural Classifiers: An integrated approach to design. London: Springer-Verlag (2001) 26. Hyperspectral Remote Sensing Scenes: http://www.ehu.es/ccwintco/index.php/ HyperspectralRemote SensingScenes, [On-line; accessed 09-February-2018]