Binary classification: Ensemble Methods Utilizing Decision Theory Tools Oksana Pichuginaa,b , Lyudmyla Kirichenkoc,d and Tamara Radivilovac a National Aerospace University "Kharkiv Aviation Institute", 17 Chkalova Street, Kharkiv, 61070 Ukraine b University of Toronto, 27 King’s College Circle, Toronto, M5S 1A1, Canada c Kharkiv National University of Radio Electronics, 14 Nauki Avenue, Kharkiv, 61166 Ukraine d Wroclaw University of Science and Technology, 27 Wyspianskiego, Wroclaw, 50-370 Poland Abstract Several ensemble methods of binary classification are presented. They are based on the use of decision theory tools at the stage of aggregating the results of binary classification and obtaining refined solutions to classifi- cation problems. Results of software implementation and computational experiments on benchmark instances are presented. The experiment conducted on unbalanced benchmark instances from KEEL–dataset repository demonstrates a noticeable improvement in the quality characteristics of binary classification such as accuracy and balanced accuracy. The presented approach is expected to be promising for complex classification instances such as unbalanced classification ones. Keywords Binary classification, Ensemble method, Decision Theory, priority vector, aggregation, accuracy Introduction The process of predicting classes for a given set of points is called classification. These classes are also called categories or labels. For instance, detecting spam in e-mail correspondence can be considered a classification problem, determining whether a given message is spam. This is a so-called binary classification problem in which only two classes participate: class 0 – regular messages, and class 1 - spam. Similarly, the problem of detecting fraudulent financial transactions can be seen as a binary classification problem, where class 0 is normal transactions, and class 1 is fraudulent. A distinctive feature of these two problems is that the number of elements of class 1 is much less than the number of other elements and the total number of observations. This imbalance greatly complicates qualitative classification and requires the development of classification theory in order to derive new approaches to solving complex classification problems. Classification is traditionally referred to as a supervised machine learning class. Here, a machine learns from a training set in which some characteristics of a training set and its labels are known. After that, elements of the instance for which labels are unknown are put into the machine as input, while labels are obtained as output playing the role of predictions. There are many practical applications of classification in finance, economics, medicine, engineering, management, and so on [1, 2, 3, 4, 5, 6, 7]. Despite the huge amount of data involved in machine learning, the capabilities of modern comput- ers make it possible in some cases to solve the same problem repeatedly in order to obtain a more 2nd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2022), December 2-4, 2022, Łódź, Poland " o.pichugina@khai.edu (O. Pichugina); lyudmyla.kirichenko@nure.ua (L. Kirichenko); tamara.radivilova@nure.ua (T. Radivilova)  0000-0002-7099-8967 (O. Pichugina); 0000-0002-2780-7993 (L. Kirichenko); 0000-0001-5975-0269 (T. Radivilova) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) accurate solution to the problem. That is, in this case, the solution accuracy is a higher priority than the resources of computers used. In particular, in supervised machine learning, which deals with two main problems – classification and regression – the direction of the so-called aggregation or ensemble methods has been developed last years, where the same problem is solved in several ways, and then the results obtained are aggregated, and the final solution is formed. In this paper, we will present several ensemble methods for solving the binary classification problem based on the local priority vector, traditionally used in the Analytic Hierarchy Process [8, 9, 10]. It will be shown that these approaches to aggregation are promising and, in some cases, give a significant improvement in the quality of classification compared to the results of applying standard methods of conventional classification. 1. Prerequisites 1.1. Classification problems typology A classification problem (CP) [11, 12] is a problem of identifying to which class a new instance belongs based on available class membership for samples in a training set (TS). Instances to which classification is applied form a test set (TeS). A TS-instance is given by a feature tuple 𝑧 and a class label 𝑦. At the same time, a TeS-instance is represented by a feature tuple, while a class label is unknown and needs to be found. 𝑧 can be a numeric vector but not necessarily since features presented in 𝑧-components characterize certain instance’ properties, categorical, ordinal, integer-valued, or real- valued. If categorical or ordinal features are present, a preprocessing stage is required to map 𝑥 into Euclidean space: 𝜙 𝑧 → 𝑥 ∈ 𝕏 ⊂ ℝ𝑚 . (1) Let us assume that the mapping (1) is done, and we deal with the training and test sets TS: {⟨𝑥𝑖 , 𝑦𝑖 ⟩}𝑖∈𝐽𝑛 ; (2) TeS: {𝑥𝑖′ }𝑖∈𝐽𝑛′ , (3) where 𝐽𝑛 = {1, ..., 𝑛}, 𝑥𝑖 ∈ ℝ𝑚 , 𝑖 ∈ 𝐽𝑛 and 𝑥𝑖′ ∈ ℝ𝑚 , 𝑖 ∈ 𝐽𝑛′ . A classification algorithm (CA) is intended to train a classifier, which is a function mapping an instances’ space 𝕏 into a class label space 𝕐: 𝑓 ∶ 𝕏 → 𝕐. (4) Let  = {𝐶0 , ..., 𝐶𝑙 } (5) be a set of classes. To 𝕏 we will refer to as an instances’ space and to a set 𝕐 = 𝐽𝑙0 (6) as a class label space (hereafter, 𝐽𝑙0 = {0, ..., 𝑙}). The following division of classification problems (CPs) is common [11, 12, 13, 14]: 1. Depending on the number of classes: a) Binary classification problems (Binary CPs, BCPs) if the number of classes is two, i.e., 𝑙 = 1; b) Multi-class classification problems (Multiclass CPs, MCPs) if 𝑙 > 1; 2. Depending on the proportions of the classes’ sizes: a) Balanced classification problems (BaCPs), if the difference in classes’ sizes is statisti- cally insignificant; b) Imbalanced classification problems (class imbalance problems, ICPs) if this differ- ence is significant. 1.2. Classification metrics Quality of classification is assessed by different metrics [11, 15, 16] such as the accuracy, balanced accuracy, recall, precision, 𝐹 -score, 𝐺-mean, etc., which utilize a confusion matrix. A confusion matrix (CM) is a matrix of the dimension 𝑠 + 1. Each its row represents the instances in an actual class (with an actual label), while its column represents the instances in a predicted class (with a predicted label). Namely, 𝐶𝑀 = (𝑛𝑖𝑗 )𝑖,𝑗∈𝐽𝑠0 , where 𝑛𝑖𝑗 is a number of instances from an actual class 𝐶𝑖 , whose predicted class is 𝐶𝑗 (with an actual label 𝑦𝑖 and a predicted label 𝑦𝑗 ). The number of 𝑛𝑖𝑖 is called the true prediction of the class 𝐶𝑖 (𝑖 ∈ 𝐽𝑠0 ). Let 𝑠 𝑛𝑖′ = ∑ 𝑛𝑖𝑗 , 𝑖 ∈ 𝐽𝑠0 , 𝑗=0 be a distribution of the classes in , where 𝑛𝑖′ be a class 𝐶𝑖 -size. Then the number 𝑛 of all -instances satisfies a relation 𝑠 𝑠 𝑛 = ∑ 𝑛𝑖′ = ∑ 𝑛𝑖𝑗 . 𝑖=0 𝑖,𝑗=0 In these notations, the listed above classification metrics are represented as follows. Classification accuracy (accuracy) is calculated as a sum of all true predictions divided by 𝑛: 1 𝑠 𝐴𝐶 = ∑ 𝑛𝑖𝑖 . (7) 𝑛 𝑖=0 𝐴𝐶 is inadequate in reflecting the classifier’s performance on classifying each single class, especially on small classes [17, 16]. That is why, in ICPs, other metrics are more commonly used. Among them are the following two: 𝑅𝑖 = 𝑛𝑛𝑖𝑖′ , 𝑖 ∈ 𝐽𝑠0 ; (8) 𝑖 𝑛 𝑃𝑗 = ∑ 𝑗𝑗𝑛𝑖𝑗 , 𝑗 ∈ 𝐽𝑠0 . (9) 𝑖 After replacing 𝑗 by 𝑖 the later becomes 𝑛𝑖𝑖 𝑃𝑖 = , 𝑖 ∈ 𝐽𝑠0 . (10) ∑𝑗 𝑛𝑗𝑖 𝑅𝑖 , 𝑃𝑖 are called a recall (the true positive rate, TPR) and a precision of a class 𝐶𝑖 , respectively. By themselves, neither of these two characteristics are adequate in reflecting the performance of any classifier on the class 𝐶𝑖 . Therefore, they are commonly integrated in a metric called the F-score (F-measure) for a class 𝐶𝑖 [18]: 2 ⋅ 𝑃𝑖 ⋅ 𝑅𝑖 𝐹𝑖 = , 𝑖 ∈ 𝐽𝑠0 . 𝑃𝑖 + 𝑅𝑖 A peculiarity of F-score is that it is high if both the recall and precision are high. If we are interested in the performances of all classes, the classification performance of each class should be equally represented in the evaluation metric. G-mean [19] is such a measure, which is the geometric mean of recall values of every class: 1/(𝑠+1) 𝑠 𝐺𝑚𝑒𝑎𝑛 = ∏ 𝑅𝑖 . ( 𝑖=0 ) Another metric is the balanced classification accuracy: 1 𝑠 𝐵𝐴𝐶 = ∑ 𝑅𝑖 . 𝑠 + 1 𝑖=0 The formula (8) can be rewritten as follows: 𝑇 𝑃𝑖 𝑅𝑖 = , 𝑖 ∈ 𝐽𝑠0 , (11) 𝑇 𝑃𝑖 + 𝐹 𝑁𝑖 where ′ 𝑇 𝑃𝑖 = 𝑛𝑖𝑖 , 𝐹 𝑁𝑖 = 𝑛𝑖 − 𝑇 𝑃𝑖 , (12) are called a true positive and a false negative for a class 𝐶𝑖 , respectively (𝑖 ∈ 𝐽𝑠0 ). Note, if 𝑖 ≠ 𝑗, then the value 𝑛𝑖𝑗 is an error of prediction for an 𝐶𝑖 -instance that it elongs to a class 𝐶𝑗 . ′ In these notations, a sum across the whole row 𝑖 is 𝑛𝑖 = 𝑇 𝑃𝑖 + 𝐹 𝑁𝑖 . A false positive for 𝐶𝑖 , denoted by 𝐹 𝑃𝑖 , is a sum of a column 𝑖 excluding 𝑇 𝑃𝑖 . A true negative for 𝐶𝑖 is 𝑇 𝑁𝑖 = 𝑛 − (𝑇 𝑃𝑖 − 𝐹 𝑁𝑖 − 𝐹 𝑃𝑖 ), i.e., it is a sum of elements of a CM except for the row 𝑖 and column 𝑖 [20]. Using this terminology, formulas (7), (8), and (10) become: 𝐴𝐶 = 𝑛1 ∑𝑠𝑖=0 𝑇 𝑃𝑖 ; (13) 𝑅𝑖 = 𝑇 𝑃𝑇𝑖 +𝐹 𝑃𝑖 0 𝑁𝑖 , 𝑖 ∈ 𝐽 𝑠 ; (14) 𝑃𝑖 = 𝑇 𝑃𝑇𝑖 +𝐹 𝑃𝑖 0 𝑃𝑖 , 𝑖 ∈ 𝐽𝑠 . (15) If BCPs are solved, a positive class is considered as the main class 𝐶 = 𝐶1 , and the subindex 𝑖 is omitted in formulas (11)-(15) yielding most common expression for an accuracy, a recall, and a precision: 𝐴𝐶 = 𝑛1 (𝑇 𝑃 + 𝑇 𝑁 ) , (16) 𝑇𝑃 𝑅 = 𝑇 𝑃+𝐹 𝑁, (17) 𝑇𝑃 𝑃 = 𝑇 𝑃+𝐹 𝑃, (18) where 𝐹 𝑁 = 𝑛00 , 𝐹 𝑃 = 𝑛01 , 𝐹 𝑁 = 𝑛10 , 𝑇 𝑃 = 𝑛11 . Respectively, in these notations, the expressions for 𝐵𝐴𝐶, 𝐹 -score, and 𝐺𝑚𝑒𝑎𝑛 are as follows: 𝐵𝐴𝐶 = 21 ( 𝑇 𝑁𝑇 +𝐹 𝑁 𝑇𝑃 𝑃 + 𝐹 𝑁 +𝑇 𝑃 ) ; (19) 𝐹 = 2⋅𝑃⋅𝑅 𝑃+𝑅 ; (20) 1/2 𝐺𝑚𝑒𝑎𝑛 = ( 𝑇 𝑁𝑇 +𝐹 𝑁 𝑇𝑃 𝑃 ⋅ 𝐹 𝑁 +𝑇 𝑃 ) . (21) One more often used method for evaluating classification performance in BCPs is Receiver Oper- ating Characteristics analysis (ROC) [21, 15, 11]. A ROC graph (ROC curve) depicts the relative trade-offs between TP and FP. It is plotted with 𝑅 against the false positive rate: 𝐹𝑃 𝐹 𝑃𝑅 = , (22) 𝐹𝑃 + 𝑇𝑁 where 𝑅 is on y-axis and FPR is on the x-axis. AUC (Area Under The Curve) is an area under the ROC curve [21]. 1.3. Ensemble techniques Combining classifiers and, as a consequence, combining results of classification, is one of the com- monly accepted methods for improving the quality of classification and increasing the reliability of these results [22, 23, 24]. Combining methods are widely used in regression [22] and classification [23, 24] problems. These methods are called ensemble algorithms (EAs). Two subclasses in this group of methods can be singled out – classification EAs (ECAs) and regression EAs (ERAs). Collections of EAs used together are called ensemble systems (ESs) [23]. Among them are classifi- cation ESs (ECSs) and regression ESs (ERSs). ESs are designed not only for a formation of a collection of predictions obtained in different combining ways, but also for analysis and comparison of these results in order to form a single solution to a prediction problem under consideration. EAs utilize an idea ”the more diverse the training set, base classifiers, and feature set, the better the performance of the ES” [23]. Based on that, in [25], six strategies were described for designing an EAs: 1. different initialization; 2. different parameter choice; 3. different architecture; 4. different classifiers; 5. different training sets; 6. different feature sets. Two of the listed strategies are most commonly used, namely, different training sets (also known as Homogeneity Scenario [26] and different classifiers (known as Heterogeneity Scenario). Re- spectively, ESs are divided into homogeneous ESs (HoESs) and heterogeneous ESs (HeESs). Thus, ECSs can be either classification HoESs (HoECSs) or classification HeESs (HeECSs). Further, we will focus on ECSs. Elements of HoECSs and HeECSs we will call homogeneous ECAs (HoECAs) and heterogeneous ECAs (HeECAs), respectively. In a HoECA, base classifiers are generated from applying a base CA on different training sets formed from the original training set. The predicted labels obtained by these classifiers are then combined in final predicted labels of the test set. In this category of ensemble methods are Random Forest, Adaptive Boosting, Bagging, Random Subspace (see [23] and references therein), etc. In contrast to HoECAs, in a HeECA, different CAs are applied on the same training set, thus pro- ducing a set of different base classifiers, which outputs are called meta-data [27]. Then metadata are combined in final predicted labels obtained as a result of applying this HeECA. Remark 1. Note that HoECAs involve a large number of different ”weak” classifiers. Therefore, a base CA should not be fine-tuned. At the same time, HeECAs commonly utilize a small number of ”strong” classifiers, hence involved base CAs need a preliminary tuning. Both groups of ECAs utilize probabilities, i.e., they deal with probabilistic CAs [13], thus preventing involving deterministic (non-probabilistic) CAs such as SVM, SGB, Kernel Logistic Regression, Logistic Model Trees etc. [28]. HeECAs are also divided into two groups – fixed HeECAs and trainable HeECAs [23]. The main difference between these two is that, when combining, fixed ones do not take into consideration the label information in the meta-data of training set while trainable methods do. Fixed HeECAs combines base classifiers’ results by sum, product, min, max, median, majority vote rules etc. [29]. Among trainable HeECAs are the Stacking Algorithm, Inference-based Combiner, Multiple Response Linear Regression, SCANN, Decision Template (see [23] and references therein) and so on. In this research, we set the task of constructing a universal HeECA consisting of fixed and trainable HeECAs and involving deterministic and probabilistic CAs. We expect that using well-known HoE- CAs along with standard CAs will result in designing a highly strong classifier, which outperforms existing ones. 2. The proposed heterogeneous ensemble classification system 2.1. Denotations Let us introduce some notations • 𝐽𝑛 = {1, ..., 𝑛} , 𝐽𝑛0 = 𝐽𝑛 ∪ {0}; { } • 𝐶𝐴 = 𝐶𝐴𝑗 𝑗∈𝐽 – is a set of base classification algorithms (CAs); 𝑚 { } • 𝐸𝐴 = 𝐸𝐶𝐴𝑗 ′ 𝑗 ′ ∈𝐽 ′ – is a set of our ensemble classification algorithms (methods) (EAMs), 𝑚 which is a HeECS (further referred to as an expert assessment based heterogeneous en- semble classification system, a expert assessment based HeECS, EA-HeECS). { } 𝐸𝐶𝐴 = 𝐸𝐶𝐴𝑗 ′ 𝑗 ′ ∈𝐽 ′ – is a set of our ensemble classification algorithms (EAMs, ECAs); 𝑚 • {𝑂𝑖 }𝑖∈𝐽𝑛 – is a set of instances (observations, examples, samples, statistical units); • ⟨𝑋 |𝑌 ⟩ – is a tuple of observations on samples and their labels: – 𝑋 = (𝑥 𝑖 )𝑖∈𝐽𝑛 = (𝑥𝑖𝑙 )𝑖∈𝐽𝑛 ,𝑙∈𝐽𝐿 – is a real-valued matrix of observations on 𝐿 independent (explanatory) variables; – 𝑌 = (𝑦𝑖 )𝑖∈𝐽𝑛 – is a vector of actual labels thus 𝑦𝑖 ∈ 𝕐, 𝑖 ∈ 𝐽𝑛 . (23) 𝑦𝑖 )𝑖∈𝐽𝑛 – is a vector of predicted labels such that – 𝑌̂ = (̂ 𝑦̂𝑖 = 𝑓 (𝑥 𝑖 ) ∈ 𝕐, 𝑖 ∈ 𝐽𝑛 . (24) • 𝐼 – is a K-fold split of 𝐽𝑛 , i.e., it is a partition of 𝐽𝑛 into 𝐾 subsets of nearly equal size: 𝐾 𝐼 = {𝐼𝑘 }𝑘∈𝐽𝐾 ∶ ⋃ 𝐼𝑘 = 𝐽𝑛 , 𝐼𝑘 ∩ 𝐼𝑘 ′ = ∅, ∀𝑘 ≠ 𝑘 ′ ; 𝑘=1 (25) 𝑛𝑘 ≈ 𝑛𝑘 ′ , ∀𝑘 ≠ 𝑘 ′ , where 𝑛𝑘 = |𝐼𝑘 |, 𝑘 ∈ 𝐽𝐾 . The partition (25) induces 𝐾 pairs of test sets: 𝑇𝑒𝑆𝑘 , 𝑘 ∈ 𝐽𝐾 , (26) and training sets: 𝑇 𝑆𝑘 , 𝑘 ∈ 𝐽𝐾 , (27) such that 𝑇𝑒𝑆𝑘 = {𝑥 𝑖 , 𝑦𝑖 }𝑖∈𝐼𝑘 , 𝑇 𝑆𝑘 = {𝑥 𝑖 , 𝑦𝑖 }𝑖∉𝐼𝑘 , 𝑘 ∈ 𝐽𝐾 , (28) The collection (26) defines a partition of the dataset ⟨𝑋 , 𝑌 ⟩. At the same time, the sets (27) defines a decomposition of the tuple, where each sample ⟨𝑥, 𝑦⟩ ∈ ⟨𝑋 , 𝑌 ⟩ is presented 𝑘 − 1 times. CAs yield predictions of labels, namely, ′ 𝑦̂𝑖𝑗 , 𝑦̂𝑖𝑗 ′ , 𝑖 ∈ 𝐽𝑛 , 𝑗 ∈ 𝐽𝑚 , 𝑗 ′ ∈ 𝐽𝑚′ , will be a predicted label of 𝑂𝑖 obtained by 𝐶𝐴𝑗 ∈ 𝐶𝐴 or 𝐸𝐶𝐴𝑗 ′ ∈ 𝐸𝐴 − 𝐻 𝑒𝐸𝐶𝑆, respectively. Note that since in CAs classification models are built and optimized on training sets, values ′ 𝑦̂𝑖𝑗 , 𝑦̂𝑖𝑗 ′ , 𝑖 ∈ 𝐼𝑘 , 𝑗 ∈ 𝐽𝑚 , 𝑗 ′ ∈ 𝐽𝑚′ , 𝑘 ∈ 𝐽𝐾 , (29) are real predictions, while ′ 𝑦̂𝑖𝑗 , 𝑦̂𝑖𝑗 ′ , 𝑖 ∉ 𝐼𝑘 , 𝑗 ∈ 𝐽𝑚 , 𝑗 ′ ∈ 𝐽𝑚′ , 𝑘 ∈ 𝐽𝐾 , (30) are labels’ predictions are compatible with the actual labels 𝑦𝑖 , 𝑖 ∉ 𝐼𝑘 , 𝑘 ∈ 𝐽𝐾 , (31) Therefor, they can be used for evaluating quality of classification on training sets and tuning CAs if necessary. Let us unite the predictions (29) matrices: ′ ′ 𝑌 𝑘 = (𝑦̂𝑖𝑗 )𝑖∈𝐼 ,𝑗∈𝐽 , 𝑌 𝑘 = (𝑦̂𝑖𝑗 ) , 𝑘 ∈ 𝐽𝐾 , (32) 𝑘 𝑚 𝑖∈𝐼𝑘 ,𝑗 ′ ∈𝐽𝑚′ by CAs and ECAs, respectively. For training sets (26), imbalanced ratios are: ′ ′ |𝑇 𝑆𝑘 | − 𝑛𝑘1 𝑛 − 𝑛𝑘 − 𝑛𝑘1 ′ 𝐼 𝑅𝑘 = ′ = ′ , where 𝑛𝑘1 = ∑ 𝑦𝑖 (33) 𝑛𝑘1 𝑛𝑘1 𝑖∉𝐼𝑘 is the size of 𝐶1 in 𝑇 𝑆𝑘 , 𝑘 ∈ 𝐽𝐾 . Note that, in ICPs, an imbalanced ratio 𝐼 𝑅 can be given prior. In this case, 𝐼 𝑅𝑘 = 𝐼 𝑅, 𝑘 ∈ 𝐽𝐾 , is reasonably to use instead of (33). Also, in some ECAs, weights of CAs playing a role of experts in solving CPs are used. The weights can also be assigned prior, but we will use a posterior information obtained as a result of comparison of (30) with (31). Namely, we introduce vectors of relative weights of CAs for a fold 𝑘: 𝑊 𝑘 = (𝑤𝑗𝑘 )𝑗∈𝐽 , 𝑘 ∈ 𝐽𝐾 , (34) 𝑚 where 𝑤𝑗𝑘 is a weight of a 𝐶𝐴𝑗 , which depends on a quality of classification achieved by this method on a training set 𝑇 𝑆𝑘 . The vectors (34) must satisfy the following conditions: ‖ ‖ 𝑊 𝑘 ≥ 0, ‖𝑊 𝑘 ‖ = 1, 𝑘 ∈ 𝐽𝐾 . (35) ‖ ‖1 They can be found in different ways depending on our preferences, either the accuracy, balanced accuracy, recall, precision, G-mean, F-score, AUC, or another classification measure M we aim to improve. All these approaches can be combined in the following way. Let 𝐴𝐶 𝑘 = (𝐴𝐶𝑗𝑘 )𝑗 , 𝐵𝐴𝐶 𝑘 = (𝐵𝐴𝐶𝑗𝑘 )𝑗 , 𝑅 𝑘 = (𝑅𝑗𝑘 )𝑗 , 𝑃 𝑘 = (𝑃𝑗𝑘 )𝑗 , 𝐺 𝑘 = (𝐺𝑗𝑘 )𝑗 , 𝐹 𝑘 = (𝐹𝑗𝑘 )𝑗 , (36) 𝐴𝑈 𝐶 𝑘 = (𝐴𝑈 𝐶𝑗𝑘 )𝑗 , 𝑀 = (𝑀𝑗𝑘 )𝑗 , 𝑘 ∈ 𝐽𝐾 , where 𝐴𝐶𝑗𝑘 , 𝐵𝐴𝐶𝑗𝑘 , 𝑅𝑗𝑘 , 𝑃𝑗𝑘 , 𝐺𝑗𝑘 , 𝐹𝑗𝑘 , 𝐴𝑈 𝐶𝑗𝑘 , 𝑀𝑗𝑘 , 𝑘 ∈ 𝐽𝐾 , 𝑗 ∈ 𝐽𝑚 , be values of AC, BAC, R, P, G-mean, F-score, AUC, and the metric 𝑀 ∈ [0, 1] achieved by a 𝐶𝐴𝑗 applied on a 𝑇 𝑆𝑘 ; 𝛼 = (𝛼𝑟 )𝑟∈𝐽8 ∈ ℝ8+ ∶ ‖𝛼‖1 = 1 (37) be a vector of weights of the listed metrics. Then the vector (34) is defined as follows: 1 𝑊𝑘 = 𝑘 𝑘 𝑘 𝑘 𝑘 𝑘 𝑘 𝑘 (𝛼1 𝐴𝐶 + 𝛼2 𝐵𝐴𝐶 + 𝛼3 𝑅 + 𝛼4 𝑃 + 𝛼5 𝐺 + 𝛼6 𝐹 + 𝛼7 𝐴𝑈 𝐶 + 𝛼8 𝑀 ) , (38) 𝐴𝑘 where ‖ ‖ 𝐴𝑘 = ‖𝛼1 𝐴𝐶 𝑘 + 𝛼2 𝐵𝐴𝐶 𝑘 + 𝛼3 𝑅 𝑘 + 𝛼4 𝑃 𝑘 + 𝛼5 𝐺 𝑘 + 𝛼6 𝐹 𝑘 + 𝛼7 𝐴𝑈 𝐶 𝑘 + 𝛼8 𝑀 𝑘 ‖ , 𝑘 ∈ 𝐽𝐾 . ‖ ‖1 For instance, a choice of 𝛼 = 𝛼 1 = (1, 07 ) – means that we are interested in increasing AC only; 𝛼 = 𝛼 2 = (0, 1, 06 ) – implies that we focus in higher BAC first of all; 𝛼 = 𝛼 3 = (02 , 1, 05 ) – that we are interested in higher 𝑅; 𝛼 = 𝛼 4 = (04 , 1, 03 ) – that we attempt to increase G-mean; 𝛼 = 𝛼 5 = (05 , 1, 02 ) – that we wish to maximize AUC; 𝛼 = 𝛼 6 = ((1/6)3 , 0, (1/6)3 , 0) – that all listed standard classification metrics, except for 𝑅 and 𝐌 are involved, and they are equal, etc. 2.2. New ECAs description Fix 𝑘 ∈ 𝐽𝐾 and 𝑗 ∈ 𝐽𝑚 . To the vector 𝑦̂𝑗𝑘 = (𝑦̂𝑖𝑗 )𝑖∈𝐼 , (39) 𝑘 a local priority vector (LPV) [9, 10, 30] 𝑝𝑗𝑘 = (𝑝𝑖𝑗 )𝑖∈𝐼 , (40) 𝑘 is associated such that ⎧ ⎪ 𝐼 𝑅 , if 𝑂 ∈ 𝐶1 , 𝑂𝑖′ ∈ 𝐶0 ; ⎪ 𝑘 −1 𝑖 𝑝𝑖𝑗 /𝑝𝑖′ 𝑗 = ⎨ (𝐼 𝑅𝑘 ) , if 𝑂 𝑖 ∈ 𝐶0 , 𝑂𝑖′ ∈ 𝐶1 ; ′ (𝑖, 𝑖 ∈ 𝐼𝑘 ) (41) ⎪ ⎩ 1, otherwise; ⎪ ‖ 𝑘‖ ‖𝑝𝑗 ‖ = 1. (42) ‖ ‖1 Remark 2. A vector 𝑝𝑗𝑘 will have at most 2 different coordinates, therefore, in order to find it, an auxiliary vector can be formed with unit coordinates for instances with 0 predicted label and the rest coordinates equal to 𝐼 𝑅𝑘 . Then, this vector needs a normalization yielding a vector satisfying (41) and (42). Based on an assumption that 𝐼 𝑅𝑘 is an imbalanced ratio for 𝑇𝑒𝑆𝑘 , i.e., imbalanced ratios are the same for training and test sets, we split the test set 𝑇𝑒𝑆 𝑘 into classes 𝐶0 , 𝐶1 such that 𝑛𝑘 |𝐶0 | = 𝑛𝑘0 , |𝐶1 | = 𝑛𝑘1 ∶ 𝑛𝑘1 = , 𝑛0 = 𝑛𝑘 − 𝑛𝑘1 . (43) ⌈ 1 + 𝐼 𝑅𝑘 ⌉ 𝑘 2.2.1. ECA1 (based on utilizing the geometric mean of expert estimates) The LPVs (40) are combined in the following way: • Find an auxiliary vector 1/𝑚 𝑚 ′𝑘 ′ ′ 𝑧1 = (𝑧𝑖1 ) ∶ 𝑧𝑖1 = ∏ 𝑝𝑖𝑗 , 𝑖 ∈ 𝐼𝑘 . (44) 𝑖∈𝐼𝑘 ( 𝑗=1 ) • Find a threshold value ′ ′ ′ ′ ′ 𝑡ℎ𝑟1𝑘 ∶ 𝑧𝑖1 1 ≥ 𝑧𝑖2 1 ≥ ... ≥ 𝑧𝑖 1 1 = 𝑡ℎ𝑟1𝑘 > 𝑧𝑖 1 1 ≥ ... ≥ 𝑧𝑖𝑛 1 . (45) 𝑛 𝑛 +1 𝑘 𝑘 𝑘 • Assign { ′ ′𝑘 1, 𝑖𝑓 𝑧𝑖1 ≥ 𝑡ℎ𝑟 𝑘1 ; 𝑦̂𝑖1 = (𝑖 ∈ 𝐼𝑘 ) . (46) 0, 𝑜𝑡ℎ𝑒𝑟 𝑤𝑖𝑠𝑒. 2.2.2. ECA2 (based on using the weighted geometric mean of expert estimates) First, we generalize (45), (46) in the following way: ′ ′ ′ ′ ′ 𝑡ℎ𝑟𝑗𝑘′ ∶ 𝑧𝑖1 𝑗 ′ ≥ 𝑧𝑖2 𝑗 ′ ≥ ... ≥ 𝑧𝑖 1 𝑗 ′ = 𝑡ℎ𝑟𝑗𝑘′ > 𝑧𝑖 1 𝑗 ′ ≥ ... ≥ 𝑧𝑖𝑛 𝑗 ′ ; (47) 𝑛 𝑛 +1 𝑘 𝑘 𝑘 { ′ 𝑘 ′ 1, 𝑖𝑓 𝑧𝑖𝑗 ′ ≥ 𝑡ℎ𝑟 𝑗 ′ ; 𝑦̂𝑖𝑗𝑘′ = (𝑖 ∈ 𝐼𝑘 ) , (48) 0, 𝑜𝑡ℎ𝑒𝑟 𝑤𝑖𝑠𝑒. where 𝑗 ′ ∈ 𝐽𝑚′ , thus replacing in the formulas the sub-index 1 by 𝑗 ′ . Now, for ECA2, the LPVs (40) are combined using the weights (34), formulas (47), (48) are applied with 𝑗 ′ = 2, and the auxiliary vector is 𝑚 ′ ′ ′ ′ 𝑤𝑘 𝑧𝑗 ′𝑘 = 𝑧2𝑘 = (𝑧𝑖2 ) ∶ 𝑧𝑖2 = ∏ 𝑝𝑖𝑗 𝑗 , 𝑖 ∈ 𝐼𝑘 . (49) 𝑖∈𝐼𝑘 𝑗=1 2.2.3. ECA3 (the majority voting) Here, we first assign 𝑗 ′ = 3; then find a vector of votes and a threshold: ′ ′ ′ 𝑧3𝑘 = (𝑧𝑖3 )𝑖∈𝐼 ∶ 𝑧𝑖3 = ∑𝑚 ̂𝑖𝑗𝑘 , 𝑖 ∈ 𝐼𝑘 ; 𝑗=1 𝑦 (50) 𝑘 𝑡ℎ𝑟3𝑘 = 𝑚2 . (51) Finally, apply the formula (48). 2.2.4. ECA4 (the weighted majority voting) In this case, we follow a scheme: • set 𝑗 ′ = 4; • find a vector of weighted votes and a threshold: ′ ′ ′ 𝑧4𝑘 = (𝑧𝑖4 )𝑖∈𝐼 ∶ 𝑧𝑖4 = ∑𝑚 𝑘 ̂𝑘 , 𝑖 ∈ 𝐼 ; 𝑗=1 𝑤𝑗 𝑦𝑖𝑗 𝑘 (52) 𝑘 𝑡ℎ𝑟4𝑘 = 𝑡ℎ𝑟3𝑘 = 𝑚2 , (53) • finally, (48) is applied. Remark 3. ECA3 and ECA4 can be implemented in a slightly different way in a manner of ECA1, ECA2, if the formula (47) is used for deriving 𝑡ℎ𝑟3𝑘 , 𝑡ℎ𝑟4𝑘 instead of (51), (53). 2.2.5. ECA5 (the iterative method of finding experts estimates) Set 𝑗 ′ = 5. Let 𝐽𝑇0 be a set of iterations and 𝑡 ∈ 𝐽𝑇0 be an iteration index, 𝑊𝑡𝑘 = (𝑤𝑗𝑡𝑘 )𝑗∈𝐽 (54) 𝑚 be a vector of the weights of CAs for a fold 𝑘 on iteration 𝑡 (𝑡 ∈ 𝐽𝑇0 , 𝑘 ∈ 𝐽𝐾 ). For the vector (54), constraints similar to (35) hold, namely, ‖ ‖ 𝑊𝑡𝑘 ≥ 0, ‖𝑊𝑡𝑘 ‖ = 1, 𝑘 ∈ 𝐽𝐾 , 𝑡 ∈ 𝐽𝑇0 , (55) ‖ ‖1 2.2.6. ECA5 outline • Input. 𝑘 ∈ 𝐽𝐾 , 𝜀 > 0, a matrix 𝑃 𝑘 = (𝑝𝑗𝑘 )𝑗∈𝐽 , 𝑚 where vectors 𝑝𝑗𝑘 , 𝑗 ∈ 𝐽𝑚 , are found by (40). • Step 0. Initialization. 𝑡 = 0, all the weights (55) are equal, thus 1 1 𝑊0𝑘 = ( , ..., ) ∈ ℝ𝑚 . 𝑚 𝑚 • Step 1. Set 𝑡 = 𝑡 + 1. • Step 2. Find an estimate of 𝑦 𝑘 on iteration 𝑡 denoted by 𝑦̂𝑘(𝑡) = (𝑦̂𝑖(𝑡) ) (56) 𝑖∈𝐼𝑘 in the following way: 𝑦̂𝑘(𝑡) = 𝑃 𝑘 𝑊𝑡−1 𝑘 . (57) • Step 3. A vector 𝑊𝑡𝑘 is evaluated based on 𝑊𝑡−1 𝑘 in a such a way that its components increases for those 𝑗 ∈ 𝐽𝑚 , whose vectors of estimates 𝑦̂𝑗𝑘 are closer to (57), and become lower for the rest of CAs. For that, an auxiliary parameter 𝑇 𝜆𝑘(𝑡) = (𝑦̂𝑘(𝑡) ) (𝑃 𝑘 𝟏) , (58) where 𝟏 ∈ ℝ𝑛 is a vector of units, is calculated first. Then 𝑊𝑡𝑘 is found with the help of (57), (58): 𝑇 𝑊𝑡𝑘 = (𝜆𝑘(𝑡) )−1 (𝑃 𝑘 ) 𝑦̂𝑘(𝑡) . (59) • Step 4. If the given accuracy 𝜀 is not achieved yet, i.e., ‖ 𝑘(𝑡) ‖ ‖𝑦̂ − 𝑦̂𝑘(𝑡−1) ‖ ‖ ‖1 > 𝜀, (60) ‖𝑦̂𝑘(𝑡−1) ‖ ‖ ‖1 then go to Step 1. Otherwise, set 𝑇 = 𝑡 and terminate. • Outputs are: ′ ′ 𝑊 𝑘 = 𝑊𝑇𝑘 , 𝑧5𝑘 = (𝑧𝑖5 ) = 𝑦̂𝑘(𝑇 ) , 𝑖 ∈ 𝐼𝑘 . 𝑖∈𝐼𝑘 Then (47), (48) are applied. 2.3. DS-ECS modification and genaralization To an approach of assigning the CAs’ weights described earlier in this section, where the whole training set 𝑇 𝑆𝑘 is used for that, we will refer to as Approach 1 (A1). Another approach to assign weights of CAs, that can be uses in ECAs (further referred to as Approach 2, A2), is based on an observation that, to be more realistic, it might be beneficial to split, in addition, the training sets (27) into auxiliary test and training sets: ′ ′ ′ ′ ′ ′ 𝑇𝑒𝑆𝑘 , 𝑇 𝑆𝑘 ∶ 𝑇𝑒𝑆𝑘 ∪ 𝑇 𝑆𝑘 = 𝑇 𝑆𝑘 , 𝑇𝑒𝑆𝑘 ∩ 𝑇 𝑆𝑘 = ∅, 𝑘 ∈ 𝐽𝐾 , (61) ′ for instance, making a 10%/90% split, additionally perform training on 𝑇 𝑆𝑘 , and then assign the ′ weights depending on the quality of classification achieved on 𝑇𝑒𝑆𝑘 . In this case, (34) is replaced by ′ ′ 𝑊 𝑘 = (𝑤𝑗 𝑘 ) , 𝑘 ∈ 𝐽𝐾 , (62) 𝑗∈𝐽𝑚 ′ where 𝑤𝑗 𝑘 is the weight of 𝐶𝐴𝑗 , which reflects a quality of classification reached on the auxiliary test ′ ′ set 𝑇𝑒𝑆𝑘 by applying 𝐶𝐴𝑗 on an auxiliary training set 𝑇 𝑆𝑘 . Now, for Approach 2, the formula (36) becomes: ′ ′ ′ ′ ′ ′ 𝐴𝐶 𝑘 = (𝐴𝐶𝑗 𝑘 )𝑗 , 𝐵𝐴𝐶 𝑘 = (𝐵𝐴𝐶𝑗 𝑘 )𝑗 , 𝑅 𝑘 = (𝑅𝑗 𝑘 )𝑗 , ′ ′ ′ ′ ′ ′ 𝑃 𝑘 = (𝑃𝑗 𝑘 )𝑗 , 𝐺 𝑘 = (𝐺𝑗 𝑘 )𝑗 , 𝐹 𝑘 = (𝐹𝑗 𝑘 )𝑗 , (63) ′ ′ ′ 𝐴𝑈 𝐶 𝑘 = (𝐴𝑈 𝐶𝑗 𝑘 )𝑗 , 𝑀 = (𝑀𝑗 𝑘 )𝑗 , 𝑘 ∈ 𝐽𝐾 , ′ ′ ′ ′ ′ ′ ′ ′ where 𝐴𝐶𝑗 𝑘 , 𝐵𝐴𝐶𝑗 𝑘 , 𝑅𝑗 𝑘 , 𝑃𝑗 𝑘 , 𝐺𝑗 𝑘 , 𝐹𝑗 𝑘 , 𝐴𝑈 𝐶𝑗 𝑘 , 𝑀𝑗 𝑘 , 𝑘 ∈ 𝐽𝐾 are AC, BAC, R, P, G-mean, F-score, AUC, 𝐌 ′ evaluated on 𝑇 𝑆𝑘 after applying 𝐶𝐴𝑗 (𝑗 ∈ 𝐽𝑚 ). Respectively, (38) becomes: ′ ′ ′ ′ ′ ′ ′ ′ ′ 𝑊 𝑘 = 𝐴1′ 𝑘 (𝛼1 𝐴𝐶 𝑘 + 𝛼2 𝐵𝐴𝐶 𝑘 + 𝛼3 𝑅 𝑘 + 𝛼4 𝑃 𝑘 + 𝛼5 𝐺 𝑘 + 𝛼6 𝐹 𝑘 + 𝛼7 𝐴𝑈 𝐶 𝑘 + 𝛼8 𝑀 𝑘 ) , ′ ‖ ′ ′ ′ ′ ′ ′ ′ ′ ‖ (64) 𝐴 𝑘 = ‖𝛼1 𝐴𝐶 𝑘 + 𝛼2 𝐵𝐴𝐶 𝑘 + 𝛼3 𝑅 𝑘 + 𝛼4 𝑃 𝑘 + 𝛼5 𝐺 𝑘 + 𝛼6 𝐹 𝑘 + 𝛼7 𝐴𝑈 𝐶 𝑘 + 𝛼8 𝑀 𝑘 ‖ , ‖ ‖1 where 𝑘 ∈ 𝐽𝐾 . Besides, 𝐸𝐶𝐴2 , 𝐸𝐶𝐴4 , which use the weights of CAs, can be adapted to usage of (64), namely, (49) and (52) become: 𝑚 ′ ′ ′ ′ 𝑤 𝑘 𝑧2𝑘 = (𝑧𝑖2 ) ∶ 𝑧𝑖2 = ∏ 𝑝𝑖𝑗 𝑗 , 𝑖 ∈ 𝐼𝑘 ; (65) 𝑖∈𝐼𝑘 𝑗=1 𝑚 ′ ′ ′ ′ 𝑧4𝑘 = (𝑧𝑖4 ) ∶ 𝑧𝑖4 = ∑ 𝑤𝑗 𝑘 𝑦̂𝑖𝑗𝑘 , 𝑖 ∈ 𝐼𝑘 . (66) 𝑖∈𝐼𝑘 𝑗=1 ′ ′ Remark 4. We will refer to these modifications as 𝐸𝐶𝐴2 , 𝐸𝐶𝐴4 , respectively. 3. Computational Experiment For validating of the proposed ensemble approaches, 7 conventional classification algorithms imple- mented in R were selected: 1. Linear Logistic Regression (GLM) [31]; 2. k-Nearest Neighbors (KNN) [32]; 3. Linear Support Vector Machine (SVM) [33]; 4. Kernel SVM (KSVM) [34]; 5. Naive Bayes (NB) [35]; 6. Decision Tree (DT) – C4.5 [36], CART [2]; 7. Random Forest (RF) [2] and 2 unbalanced benchmark instances from KEEL–dataset repository [37] were used – Pima Indians Diabetes (Pima) and Haberman Breast Cancer (Haberman) for which standard classification algorithm gives rather low quality – accuracy around 70%. As a result of application of our ensemble approaches, accuracy and balanced accuracy was im- proved by 1% and 6%, respectively. The best method is ECA2. Conclusion In this paper, we offer new ensemble methods for binary classification which use Decision Theory tools for combining conventional classifiers’ results. Our aggregative techniques work better on datasets, where standard methods are weak such as Haberman and Pima. Thus, the proposed ap- proaches are promising for application in complex real-world classification problems. Acknowledgments The work was partially supported by Beethoven Grant No. DFG NCN 2016/23/G/ST1/04083. References [1] L. Kirichenko, T. Radivilova, V. Bulakh, Machine Learning in Classification Time Series with Fractal Properties, Data 4 (2019) 5. doi:10.3390/data4010005. [2] L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen, Classification and Regression Trees, 1st ed., Chapman and Hall/CRC, Boca Raton, Fla., 1984. [3] D. Forsyth, Applied Machine Learning, Springer International Publishing, 2019. doi:10.1007/ 978-3-030-18114-7. [4] P. Vuttipittayamongkol, E. Elyan, A. Petrovski, On the class overlap problem in imbalanced data classification, Knowledge-Based Systems 212 (2021) 106631. doi:10.1016/j.knosys.2020. 106631. [5] V. A. Perepelitsa, N. K. Maksishko, I. V. Kozin, Using a model of cellular automata and classi- fication methods for prediction of time series with memory 42 (2006) 807–816. doi:10.1007/ s10559-006-0121-4. [6] L. Kirichenko, O. Pichugina, T. Radivilova, K. Pavlenko, Application of wavelet transform for machine learning classification of time series, in: S. Babichev, V. Lytvynenko (Eds.), Lecture Notes in Data Engineering, Computational Intelligence, and Decision Making, Lecture Notes on Data Engineering and Communications Technologies, Springer International Publishing, 2023, pp. 547–563. doi:10.1007/978-3-031-16203-9_31. [7] L. Kirichenko, O. Pichugina, H. Zinchenko, Clustering time series of complex dynamics by fea- tures, in: Selected Papers of the VIII International Scientific Conference “Information Technol- ogy and Implementation" (IT&I-2021). Conference Proceedings, volume 3132 of CEUR Workshop Proceedings, 2021, pp. 83–93. ISSN: 1613-0073. [8] O. Pichugina, Decision Making Tools For Choice Software Development Environment, in: 2020 IEEE KhPI Week on Advanced Technology (KhPIWeek), 2020, pp. 450–454. doi:10.1109/ KhPIWeek51551.2020.9250109. [9] T. L. Saaty, J. M. Alexander, Conflict Resolution: The Analytic Hierachy Approach, Praeger Pub, New York, 1989. [10] T. L. Saaty, Analytic Hierarchy Process, in: S. I. Gass, M. C. Fu (Eds.), Encyclopedia of Operations Research and Management Science, Springer US, 2013, pp. 52–64. doi:10.1007/ 978-1-4419-1153-7_31. [11] C. Drummond, Classification, in: C. Sammut, G. I. Webb (Eds.), Encyclopedia of Machine Learn- ing, Springer US, Boston, MA, 2010, pp. 168–171. doi:10.1007/978-0-387-30164-8_111. [12] E. Alpaydin, Introduction to machine learning, Adaptive computation and machine learning, 2nd ed., MIT Press, Cambridge, Mass, 2010. OCLC: ocn317698631. [13] C. C. Aggarwal, Data Classification: Algorithms and Applications, 1st ed., Chapman & Hall/CRC, 2014. [14] N. Japkowicz, S. Stephen, The class imbalance problem: A systematic study, Intelligent Data Analysis (2002) 429–449. [15] I. H. Witten, E. Frank, M. A. Hall, C. J. Pal, Data Mining: Practical Machine Learning Tools and Techniques, 4th ed., Morgan Kaufmann, Amsterdam, 2016. [16] Y. Sun, M. S. Kamel, Y. Wang, Boosting for Learning Multiple Classes with Imbalanced Class Distribution, in: Sixth International Conference on Data Mining (ICDM’06), 2006, pp. 592–602. doi:10.1109/ICDM.2006.29, iSSN: 2374-8486. [17] C. Drummond, R. C. Holte, Severe Class Imbalance: Why Better Algorithms Aren’t the An- swer, in: J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, L. Torgo (Eds.), Machine Learning: ECML 2005, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2005, pp. 539– 546. doi:10.1007/11564096_52. [18] D. Lewis, W. A. Gale, Training text classifiers by uncertainty sampling, in: Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information, NY, New York, 1998, pp. 73– 79. URL: /paper/Training-text-classifiers-by-uncertainty-sampling-Lewis-Gale/ d9d4c586b985af2ec42e4fec24cd1806d9aefaff. [19] M. Kubat, R. C. Holte, S. Matwin, Machine Learning for the Detection of Oil Spills in Satellite Radar Images, Machine Learning 30 (1998) 195–215. doi:10.1023/A:1007452223027. [20] L. Abdi, S. Hashemi, To combat multi-class imbalanced problems by means of over- sampling and boosting techniques, Soft Computing 19 (2015) 3369–3385. doi:10.1007/ s00500-014-1291-z. [21] A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning al- gorithms, Pattern Recognition 30 (1997) 1145–1159. doi:10.1016/S0031-3203(96)00142-2. [22] L. Abdi, S. Hashemi, To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques, IEEE Transactions on Knowledge and Data Engineering 28 (2016) 238–251. doi:10. 1109/TKDE.2015.2458858. [23] T. T. Nguyen, X. C. Pham, A. W.-C. Liew, W. Pedrycz, Aggregation of Classifiers: A Justifiable Information Granularity Approach, IEEE Transactions on Cybernetics 49 (2019) 2168–2177. doi:10.1109/TCYB.2018.2821679. [24] D. Ndirangu, W. Mwangi, L. Nderu, A Hybrid Ensemble Method for Multiclass Classification and Outlier Detection, International Journal of Sciences: Basic and Applied Research (IJSBAR) 45 (2019) 192–213. [25] R. Duin, The combining classifier: to train or not to train?, in: Object recognition supported by user interaction for service robots, volume 2, 2002, pp. 765–770 vol.2. doi:10.1109/ICPR. 2002.1048415, iSSN: 1051-4651. [26] T. T. Nguyen, T. T. T. Nguyen, X. C. Pham, A. W.-C. Liew, A novel combining classifier method based on Variational Inference, Pattern Recognition 49 (2016) 198–212. doi:10.1016/ j.patcog.2015.06.016. [27] S. Džeroski, B. Ženko, Is Combining Classifiers with Stacking Better than Selecting the Best One?, Machine Learning 54 (2004) 255–273. doi:10.1023/B:MACH.0000015881.36452.6e. [28] V. Rodrigues, S. Deusdado, Deterministic Classifiers Accuracy Optimization for Cancer Mi- croarray Data, in: F. Fdez-Riverola, M. Rocha, M. S. Mohamad, N. Zaki, J. A. Castellanos- Garzón (Eds.), Practical Applications of Computational Biology and Bioinformatics, 13th Inter- national Conference, volume 1005, Springer International Publishing, Cham, 2020, pp. 154–163. doi:10.1007/978-3-030-23873-5_19. [29] J. Kittler, M. Hatef, R. Duin, J. Matas, On combining classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 226–239. doi:10.1109/34.667881. [30] M. Brunelli, Introduction to the Analytic Hierarchy Process, SpringerBriefs in Operations Re- search, Springer International Publishing, 2015. doi:10.1007/978-3-319-12502-2. [31] P. McCullagh, J. A. Nelder, Generalized Linear Models, 2nd ed., Chapman and Hall/CRC, Boca Raton, 1989. [32] N. S. Altman, An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression, The American Statistician 46 (1992) 175–185. doi:10.2307/2685209. [33] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (1995) 273–297. doi:10. 1007/BF00994018. [34] I. W. Tsang, J. T. Kwok, P.-M. Cheung, Core Vector Machines: Fast SVM Training on Very Large Data Sets, Journal of Machine Learning Research 6 (2005) 363–392. URL: http://jmlr.org/papers/ v6/tsang05a.html. [35] P. Domingos, M. Pazzani, On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning 29 (1997) 103–130. doi:10.1023/A:1007413511361. [36] J. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed., Morgan Kaufmann, San Mateo, Calif, 1992. [37] I. Triguero, S. González, J. M. Moyano, S. García, J. Alcalá-Fdez, J. Luengo, A. Fernández, M. J. d. Jesús, L. Sánchez, F. Herrera, KEEL 3.0: An Open Source Software for Multi-Stage Analysis in Data Mining, International Journal of Computational Intelligence Systems 10 (2017) 1238–1249. doi:10.2991/ijcis.10.1.82.