=Paper=
{{Paper
|id=Vol-2972/paper10
|storemode=property
|title=Ensemble Techniques for Lazy Classification Based on Pattern Structures
|pdfUrl=https://ceur-ws.org/Vol-2972/paper10.pdf
|volume=Vol-2972
|authors=Ilya Semenkov,Sergei O. Kuznetsov
|dblpUrl=https://dblp.org/rec/conf/ijcai/SemenkovK21
}}
==Ensemble Techniques for Lazy Classification Based on Pattern Structures==
Ensemble Techniques for Lazy Classification Based on Pattern Structures Ilya Semenkov1 and Sergei O. Kuznetsov1 Higher School of Economics – National Research University, Moscow, Myasnitskaya street 20, 101000, Russia Abstract. This paper presents different versions of classification en- semble methods based on pattern structures. Each of these methods is described and tested on multiple datasets (including datasets with exclu- sively numerical and exclusively nominal features). As a baseline model Random Forest generation is used. For some classification tasks the clas- sification algorithms based on pattern structures showed better perfor- mance than Random Forest. The quality of the algorithms is noticeably dependent on ensemble aggregation function and on boosting weighting scheme. Keywords: Formal Concept Analysis (FCA) · pattern structures · boost- ing algorithms · ensemble algorithms 1 Introduction Pattern structures were introduced in [1] for the analysis of data with complex structure. In [6] [5] a model of lazy (query-based) classification using pattern structures was proposed. The model shows quite good performance, which, how- ever, is lower than that of ensemble classifiers that employ boosting techniques. The main goal of this paper is to study various ensemble approaches based on pattern structures, boosting, and different aggregation functions. The model is known for good interpretability, so we would try to use ensemble techniques to improve its prediction quality. 2 Model description 2.1 Pattern structures The main idea of the pattern structures is the use of intersection (similarity) op- eration, with the properties of a lower semilattice, defined on object descriptions to avoid binarization (discretization) prone to creating artifacts [1]. An opera- tion of this kind allows one to define Galois connection and closure operator, which can be used to extend standard FCA-based tools of knowledge discovery to non-binary data without scaling (binarizing) them. At the first step of the model we transform features in the following way: consider x ∈ Rn to be an observation, then we transform it using the following Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). transformation T : (x1 , x2 , ..., xn ) → ((x1 , x1 ), (x2 , x2 ), ..., (xn , xn )). Basically, for each feature j, its value in the observation x gets transformed from a number xi into a 2-dimensional vector (xi , xi ) with the same value repeated twice. This is used later in the definition of similarity operation. After the transformation each observation x has 2 values for each feature i, namely xi,1 and xi,2 . In this paper, we define the similarity of two transformed observations x, y in the standard way of interval pattern structures [4] [5] [6]: x u y = ((min(x1,1 , y1,1 ), max(x1,2 , y1,2 )), ..., (min(xn,1 , yn,1 ), max(xn,2 , yn,2 ))) where in xi,j and yi,j i indicates a feature and j indicates one of two values for a feature. In other words for each feature i there were xi = (xi,1 , xi,2 ), yi = (yi,1 , yi,2 ). After similarity operation (x u y)i = (min(xi,1 , yi,1 ), max(xi,2 , yi,2 )). Below, for simplicity this operation will be called intersection. The subsumption relation is defined as the natural order of the semilattice: x @ y ≡ x u y = x. A hypothesis for a description x is a description xh ∈ Cj : @y ∈ Ci (i 6= j) : (x u xh ) @ y, where Ci stands for a set of elements from class i. So, if a description xh does not fit any observation from classes Cj (j 6= i), but fits at least 1 observation of the class Ci , then it is considered to be a hypothesis for the class Ci . The aggregation function is applied either to the whole set of all intersections between a test observation and every element of the training sample, or to the set of hypotheses (extracted from the training set). Aggregation functions There are many reasonable aggregation functions. In our experiments we have used the following ones: P 1. avglengths: C = arg minCi P 1 wk k∈AC dist (k), where k∈AC i Pn i dist (k) = j=1 (kj,2 − kj,1 ), wk , the weight of k-th observation in a training set; P 2. k per class closest avg: C = arg minCi P 1 wk k∈Lm,C dist (k), where k∈Lm,C Pn dist (k) = j=1 (kj,2 − kj,1 ), Lm,C is the set of m elements from class C which are closest to the prediction P observation, m is a hyperparameter; 3. k closest: C = arg maxCi k∈Lm wk · 1 (k = Ci ), where Pn dist (k) = j=1 (kj,2 − kj,1 ), Lm , the set of m elements which are closest to the prediction observation regardless of their class,m is a hyperparameter; P 4. count with threshold t: C = arg maxCi k∈AC 1 dist(k) wk < t , where Pn i dist (k) = j=1 (kj,2 − kj,1 ), t, a threshold. Note: in each case ACi is a set of intersections of the test observation with each observation in a class Ci , ACi = {xtest u x : (x ∈ Xtrain ) ∧ (c(x) = Ci )} if hypotheses are not used and ACi = {xtest u xh : (xh ∈ Xtrain ) ∧ (c(xh ) = Ci )} if hypotheses are used (specified in tables with results if they are used), c(x) is a function which returns class of training object. 2.2 SAMME pattern structures It is not possible to directly use the gradient boosting techniques as there is no loss function which gets directly optimized. Thus, an analog of AdaBoost is used as we know how to implement weighted datasets in pattern structures. The general scheme called Stagewise Additive Modeling using a Multi-Class Exponential loss function or SAMME is presented in [3]. After adapting it to pattern structures it looks as follows: Algorithm 1: SAMME for pattern structures analysis, training Input: (X, y), training dataset, M is the number of models in ensemble ; initialize wi = n1 , n - number of objects in X; for m ∈ 1, ..., M do initialize model T (m) ; classify each observation in X using weighted dataset with a new model; calculate error: 1 n X errm = Pn wi 1 yi 6= T (m) (xi ) i=1 wi i=1 calculate model weight: m 1 − errm α = log + log (K − 1) errm where K is the amount of classes; recalculate object weights: m (m) wi ← wi · e a ·1 (yi 6=T (xi )) , i ∈ { 1, . . . , n } end Prediction of ensemble: M X C (x) = arg max αm · 1 T (m) (x) = k k m=1 Methods of weighting models Additionally to the original method of calcu- lation of α, others were used such as: 1 1. Uniform: αm = M Pn 2. Linear: α = Pn 1 wi i=1 wi 1 yi = T (m) (xi ) m i=1 e i ( i Pn w 1 y =T (m) (xi )) m 3. Exponential: α = i=1 Pn ewi i=1 4. Logarithmic: αm = log (1 + (1 − errm )) = log (2 − errm ) 1 5. Iternum: αm = m 3 Datasets description 3.1 Cardiotocography Data Set The dataset consists of preprocessed 2126 fetal cardiotocograms (CTGs). It con- tains 23 numerical attributes with no missing values. It could be used for 3 class prediction as well as a 10 class classification (classes are more specific). The dataset is available here [7]. The dataset is unbalanced. Before the classifica- tion, data was standartized. 3.2 Male Fertility dataset The dataset [8] consists of 9 features about patient health, injuries, lifestyle and bad habits. All features are nominal. There are 100 observations in the dataset. The final goal is binary classification: normal semen sample or sample with deviations. 3.3 Divorces The dataset [9] consists of 54 features which are the answers to questions about relationship experience. All features are nominal. There are 170 observations in the dataset. The final goal is binary classification: got divorced or not. 4 Random Forest For each dataset Random Forest was chosen as a baseline, since this algorithm is an efficient ensemble of decision trees (which are also good in explanation) [2]. In this algorithm the ensemble of Decision trees is built, where each tree is tuned using random subsample from the training sample (Bagging) and random subsample of features. Gridsearch with crossvalidation was used to tune it. 5 Results Accuracy, precision, recall and F1-score were chosen as quality metrics. For datasets with more than 2 target classes, precision, recall and F1-score were calculated for each class separately and averaged afterwards. Metrics are measured on a randomly chosen test set. SAMME was also run with the aggregation function k per class closest avg, since it has shown good performance. Due to space limitations, we present only results of the most interesting experiments, together with the baseline model Random Forest. In the table pattern structures are referred to as FCA and SAMME pattern structures is referred to as SAMME FCA. Table 1. Cardiotocography Data Set (3 classes) Algorithm Accuracy Precision Recall F1-score Ensemble size SAMME FCA (‘k per class closest avg’ original method) 0.934 0.892 0.833 0.858 5 FCA (‘k per class closest avg’) 0.934 0.892 0.833 0.858 1 Random Forest 0.925 0.867 0.839 0.852 100 SAMME FCA (‘k per class closest avg’ uniform method) 0.925 0.877 0.829 0.848 5 SAMME FCA (‘k per class closest avg’ linear method) 0.906 0.838 0.804 0.819 5 SAMME FCA (‘k per class closest avg’ exponential method) 0.761 0.582 0.630 0.602 5 SAMME FCA (‘k per class closest avg’ logarithmic method) 0.864 0.760 0.735 0.747 5 SAMME FCA (‘k per class closest avg’ iternum method) 0.897 0.828 0.784 0.803 5 Table 2. Cardiotocography Data Set (10 classes) Algorithm Accuracy Precision Recall F1-score Ensemble size SAMME FCA (‘k per class closest avg’ original method) 0.784 0.803 0.680 0.712 5 FCA (‘k per class closest avg’ original method) 0.784 0.803 0.680 0.712 1 Random Forest 0.793 0.720 0.723 0.704 100 SAMME FCA (‘k per class closest avg’ uniform method) 0.765 0.727 0.636 0.66 5 SAMME FCA (‘k per class closest avg’ linear method) 0.681 0.609 0.591 0.594 5 SAMME FCA (‘k per class closest avg’ exponential method) 0.465 0.423 0.412 0.409 5 SAMME FCA (‘k per class closest avg’ logarithmic method) 0.728 0.652 0.623 0.629 5 SAMME FCA (‘k per class closest avg’ iternum method) 0.643 0.602 0.575 0.582 5 As it can be seen in Table 1 and Table 2, with original weighting the metrics are identical to the simple FCA one-model. This happens because the first model has a significantly bigger α1 and dominates others in ensemble. That is why other model weightings are tested. However, in both cases best SAMME FCA and FCA models have higher average F1-score than the tuned baseline and for the first case they also win in terms of accuracy. Comparing SAMME FCA models it can be seen that the original weighting is better in terms of the presented metrics even though it effectively uses the first model only. In both SAMME and FCA the ensemble size is smaller than in the random forest. On the divorces dataset (Table 3) due to its size and simplicity Random Forest manages to come out as an absolute winner having 100% in every score. A lot of SAMME FCA and FCA models show the same relatively high scores. SAMME again uses only the first model with original weights. However, a lot of other weighting techniques have similar metric values on this dataset. Because of that even though random forest uses 5 models, FCA can use less models. On the fertility dataset (Table 4) the tuned Random Forest wins again. Espe- cially the difference is significant in F1-score, precision and recall. The behaviour of the SAMME FCA metrics is similar to the previous dataset: different model weighting methods give similar results. In both SAMME and FCA the ensemble size is smaller than in random forest. Table 3. Divorce Predictors Data Set Algorithm Accuracy Precision Recall F1-score Ensemble size FCA ”avglengths” 0.953 0.958 0.952 0.953 1 FCA ”k per class closest avg” 0.953 0.958 0.952 0.953 1 FCA ”k closest” 0.953 0.958 0.952 0.953 1 FCA ”count with treshold t” 0.674 0.806 0.667 0.629 1 FCA ”avglengths” (with hypotheses) 0.953 0.958 0.952 0.953 1 FCA ”k per class closest avg” (with hypotheses) 0.953 0.958 0.952 0.953 1 FCA ”k closest” (with hypotheses) 0.953 0.958 0.952 0.953 1 FCA ”count with treshold t” (with hypotheses) 0.512 0.256 0.500 0.338 1 SAMME FCA ”avglengths” 0.953 0.958 0.952 0.953 5 SAMME FCA original ”k per class closest avg” 0.953 0.958 0.952 0.953 5 SAMME FCA ”k closest” 0.953 0.958 0.952 0.953 5 SAMME FCA ”count with treshold t” 0.674 0.806 0.667 0.629 5 SAMME FCA ”avglengths” (with hypotheses) 0.953 0.958 0.952 0.953 5 SAMME FCA ”k per class closest avg” (with hypotheses) 0.953 0.958 0.952 0.953 5 SAMME FCA ”k closest” (with hypotheses) 0.953 0.958 0.952 0.953 5 SAMME FCA ”count with treshold t” (with hypotheses) 0.512 0.256 0.500 0.338 5 Random Forest 1.0 1.0 1.0 1.0 5 SAMME FCA uniform ”k per class closest avg” 0.953 0.958 0.952 0.953 5 SAMME FCA linear ”k per class closest avg” 0.953 0.958 0.952 0.953 5 SAMME FCA exponential ”k per class closest avg” 0.953 0.958 0.952 0.953 5 SAMME FCA logarithmic ”k per class closest avg” 0.953 0.958 0.952 0.953 5 SAMME FCA iternum ”k per class closest avg” 0.953 0.958 0.952 0.953 5 6 Conclusion Even though the model itself seems promising, right now it has multiple issues. First, SAMME boosting technique does not give additional quality. The original SAMME method of calculating α effectively uses only the first classifier, while the other weighting methods do not give stronger metric values than the original method. The problem seems to be in the fact that classifiers with indices > 1 in ensemble are just not good enough. So, there is a reason while original methods stick to the first classifier instead of using all of them. While other weighting might use several parts of the ensemble, it does not improve the metrics, because the classifiers built after the first one have bad metrics most of the time. The potential room for improvement is to make consequent classifiers to produce a better quality results possibly by changing the way dataset gets weighted. Secondly, we can see that this algorithm performed worse on several datasets. Even though it was better than Random Forest on the complex ones, there is still a room for improvement for simpler ones. This again can be done by improving the quality of the whole ensemble, which Random Forest seems to efficiently perform. SAMME FCA works slower than Random Forest. However, SAMME FCA has much better explainability: it generates only 5 classifiers (in some cases Table 4. Fertility Data Set Algorithm Accuracy Precision Recall F1-score Ensemble size FCA ”avglengths” 0.680 0.600 0.826 0.561 1 FCA ”k per class closest avg” 0.920 0.460 0.500 0.479 1 FCA ”k closest” 0.920 0.460 0.500 0.479 1 FCA ”count with treshold t” 0.560 0.577 0.761 0.476 1 FCA ”avglengths” (with hypotheses) 0.760 0.557 0.641 0.554 1 FCA ”k per class closest avg” (with hypotheses) 0.080 0.272 0.272 0.080 1 FCA ”k closest” (with hypotheses) 0.920 0.460 0.500 0.479 1 FCA ”count with treshold t” (with hypotheses) 0.520 0.571 0.739 0.449 1 SAMME FCA ”avglengths” 0.920 0.460 0.500 0.479 5 SAMME FCA ”k per class closest avg” 0.920 0.460 0.500 0.479 5 SAMME FCA ”k closest” 0.920 0.460 0.500 0.479 5 SAMME FCA ”count with treshold t” 0.920 0.460 0.500 0.479 5 SAMME FCA ”avglengths” (with hypotheses) 0.760 0.557 0.641 0.554 5 SAMME FCA ”k per class closest avg” (with hypotheses) 0.800 0.455 0.435 0.444 5 SAMME FCA ”k closest” (with hypotheses) 0.920 0.460 0.500 0.479 5 SAMME FCA ”count with treshold t” (with hypotheses) 0.920 0.460 0.500 0.479 5 Random Forest 0.960 0.979 0.750 0.823 100 SAMME FCA uniform ”k per class closest avg” 0.920 0.460 0.500 0.479 5 SAMME FCA linear ”k per class closest avg” 0.920 0.460 0.500 0.479 5 SAMME FCA exponential ”k per class closest avg” 0.920 0.460 0.500 0.479 5 SAMME FCA logarithmic ”k per class closest avg” 0.920 0.460 0.500 0.479 5 SAMME FCA iternum ”k per class closest avg” 0.920 0.460 0.500 0.479 5 effectively uses only one of them), while Random Forest consists of 5 trees only on Divorces dataset and consists of 100 trees in every other case. References 1. Bernhard Ganter, Sergei O. Kuznetsov. Pattern Structures and Their Projections. In: Delugach H.S., Stumme G. (eds) Conceptual Structures: Broadening the Base. ICCS 2001. Lecture Notes in Computer Science, Volume 2120 (2001). Springer, Berlin, Heidelberg. 2. Leo Breiman. Random Forests. Machine Learning, Volume 45 (2001), 5–32. 3. Ji Zhu, Hui Zou, Saharon Rosset, Trevor Hastie. Multi-class AdaBoost. Statistics and Its Interface, Volume 2 (2009), 349–360. 4. Mehdi Kaytoue, Sergei O. Kuznetsov, Amedeo Napoli, Sébastien Duplessis. Mining gene expression data with pattern structures in formal concept analysis. Information Sciences, Volume 181 (2011), Issue 10, 1989-2001. 5. Sergei O. Kuznetsov. Fitting Pattern Structures to Knowledge Discovery in Big Data. In: Cellier P., Distel F., Ganter B. (eds) Formal Concept Analysis. ICFCA 2013. Lecture Notes in Computer Science, Volume 7880 (2013). Springer, Berlin, Heidelberg. 6. Sergei O. Kuznetsov. Scalable Knowledge Discovery in Complex Data with Pattern Structures. In: Maji P., Ghosh A., Murty M.N., Ghosh K., Pal S.K. (eds) Pattern Recognition and Machine Intelligence. PReMI 2013. Lecture Notes in Computer Science, Volume 8251 (2013). Springer, Berlin, Heidelberg. 7. UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/ datasets/Cardiotocography [date of access: 27.08.2020]. 8. UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/ datasets/Divorce+Predictors+data+set [date of access: 21.03.2021]. 9. UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/ datasets/Fertility [date of access: 21.03.2021].