Introduction

Ensemble Techniques for Lazy Classification Based on Pattern Structures

Ilya Semenkov

Sergei O. Kuznetsov

0 0 Higher School of Economics - National Research University , Moscow, Myasnitskaya street 20, 101000 , Russia

This paper presents different versions of classification ensemble methods based on pattern structures. Each of these methods is described and tested on multiple datasets (including datasets with exclusively numerical and exclusively nominal features). As a baseline model Random Forest generation is used. For some classification tasks the classification algorithms based on pattern structures showed better performance than Random Forest. The quality of the algorithms is noticeably dependent on ensemble aggregation function and on boosting weighting scheme.

Formal Concept Analysis (FCA) pattern structures boosting algorithms ensemble algorithms

Introduction

Pattern structures were introduced in [ 1 ] for the analysis of data with complex structure. In [ 6 ] [ 5 ] a model of lazy (query-based) classification using pattern structures was proposed. The model shows quite good performance, which, however, is lower than that of ensemble classifiers that employ boosting techniques. The main goal of this paper is to study various ensemble approaches based on pattern structures, boosting, and different aggregation functions. The model is known for good interpretability, so we would try to use ensemble techniques to improve its prediction quality. 2.1

Model description Pattern structures

The main idea of the pattern structures is the use of intersection (similarity) operation, with the properties of a lower semilattice, defined on object descriptions to avoid binarization (discretization) prone to creating artifacts [ 1 ]. An operation of this kind allows one to define Galois connection and closure operator, which can be used to extend standard FCA-based tools of knowledge discovery to non-binary data without scaling (binarizing) them.

At the first step of the model we transform features in the following way: consider x ∈ Rn to be an observation, then we transform it using the following transformation T : (x1, x2, ..., xn) → ((x1, x1), (x2, x2), ..., (xn, xn)). Basically, for each feature j, its value in the observation x gets transformed from a number xi into a 2-dimensional vector (xi, xi) with the same value repeated twice. This is used later in the definition of similarity operation.

After the transformation each observation x has 2 values for each feature i, namely xi,1 and xi,2. In this paper, we define the similarity of two transformed observations x, y in the standard way of interval pattern structures [ 4 ] [ 5 ] [ 6 ]: x u y = ((min(x1,1, y1,1), max(x1,2, y1,2)), ..., (min(xn,1, yn,1), max(xn,2, yn,2))) where in xi,j and yi,j i indicates a feature and j indicates one of two values for a feature. In other words for each feature i there were xi = (xi,1, xi,2), yi = (yi,1, yi,2). After similarity operation (x u y)i = (min(xi,1, yi,1), max(xi,2, yi,2)). Below, for simplicity this operation will be called intersection.

The subsumption relation is defined as the natural order of the semilattice: x @ y ≡ x u y = x.

A hypothesis for a description x is a description xh ∈ Cj : @y ∈ Ci(i 6= j) : (x u xh) @ y, where Ci stands for a set of elements from class i. So, if a description xh does not fit any observation from classes Cj (j 6= i), but fits at least 1 observation of the class Ci, then it is considered to be a hypothesis for the class Ci.

The aggregation function is applied either to the whole set of all intersections between a test observation and every element of the training sample, or to the set of hypotheses (extracted from the training set).

Aggregation functions There are many reasonable aggregation functions. In our experiments we have used the following ones: 1. avglengths: C = arg minCi Pk∈A1Ci wk Pk∈ACi dist (k), where dist (k) = Pn

j=1 (kj,2 − kj,1), wk, the weight of k-th observation in a training set; 2. k per class closest avg: C = arg minCi Pk∈Lm1,C wk P k∈Lm,C dist (k), where dist (k) = Pn

j=1 (kj,2 − kj,1), Lm,C is the set of m elements from class C which are closest to the prediction observation, m is a hyperparameter; 3. k closest: C = arg maxCi Pk∈Lm wk · 1 (k = Ci), where dist (k) = Pn

j=1 (kj,2 − kj,1), Lm, the set of m elements which are closest to the prediction observation regardless of their class, m is a hyperparameter; 4. count with threshold t: C = arg maxCi Pk∈ACi 1 diswtk(k) < t , where dist (k) = Pn

j=1 (kj,2 − kj,1), t, a threshold.

Note: in each case ACi is a set of intersections of the test observation with each observation in a class Ci, ACi = {xtest u x : (x ∈ Xtrain) ∧ (c(x) = Ci)} if hypotheses are not used and ACi = {xtest u xh : (xh ∈ Xtrain) ∧ (c(xh) = Ci)} if hypotheses are used (specified in tables with results if they are used), c(x) is a function which returns class of training object. 2.2

SAMME pattern structures

It is not possible to directly use the gradient boosting techniques as there is no loss function which gets directly optimized. Thus, an analog of AdaBoost is used as we know how to implement weighted datasets in pattern structures.

The general scheme called Stagewise Additive Modeling using a Multi-Class Exponential loss function or SAMME is presented in [ 3 ]. After adapting it to pattern structures it looks as follows:

Algorithm 1: SAMME for pattern structures analysis, training Input: (X, y), training dataset, M is the number of models in ensemble ; initialize wi = n1 , n - number of objects in X; for m ∈ 1, ..., M do initialize model T (m); classify each observation in X using weighted dataset with a new model; calculate error: errm =

1 n Pn X wi1 yi 6= T (m) (xi)

i=1 wi i=1 calculate model weight: αm = log 1 − errm errm

+ log (K − 1) where K is the amount of classes; recalculate object weights:

wi ← wi · e am·1 (yi6=T (m) (xi )) , i ∈ { 1, . . . , n } end

Prediction of ensemble:

M C (x) = arg max X k m=1

αm · 1 T (m) (x) = k Methods of weighting models Additionally to the original method of calculation of α, others were used such as: 1. Uniform: αm = M1 2. Linear: αm =

Pin=11 wi Pin=1 wi1 yi = T (m) (xi)

Pin=1 ewi1(yi=T (m)(xi)) 3. Exponential: αm = Pin=1 ewi 4. Logarithmic: αm = log (1 + (1 − errm)) = log (2 − errm) 5. Iternum: αm = m1

Datasets description Cardiotocography Data Set

The dataset consists of preprocessed 2126 fetal cardiotocograms (CTGs). It contains 23 numerical attributes with no missing values. It could be used for 3 class prediction as well as a 10 class classification (classes are more specific). The dataset is available here [ 7 ]. The dataset is unbalanced. Before the classification, data was standartized. 3.2

Male Fertility dataset

The dataset [ 8 ] consists of 9 features about patient health, injuries, lifestyle and bad habits. All features are nominal. There are 100 observations in the dataset. The final goal is binary classification: normal semen sample or sample with deviations. 3.3

Divorces

The dataset [ 9 ] consists of 54 features which are the answers to questions about relationship experience. All features are nominal. There are 170 observations in the dataset. The final goal is binary classification: got divorced or not. 4

Random Forest

For each dataset Random Forest was chosen as a baseline, since this algorithm is an efficient ensemble of decision trees (which are also good in explanation) [ 2 ].

In this algorithm the ensemble of Decision trees is built, where each tree is tuned using random subsample from the training sample (Bagging) and random subsample of features.

Gridsearch with crossvalidation was used to tune it. 5

Results

Accuracy, precision, recall and F1-score were chosen as quality metrics. For datasets with more than 2 target classes, precision, recall and F1-score were calculated for each class separately and averaged afterwards.

Metrics are measured on a randomly chosen test set.

SAMME was also run with the aggregation function k per class closest avg, since it has shown good performance.

Due to space limitations, we present only results of the most interesting experiments, together with the baseline model Random Forest.

In the table pattern structures are referred to as FCA and SAMME pattern structures is referred to as SAMME FCA. Algorithm Accuracy Precision Recall F1-score Ensemble size SAMME FCA (‘k per class closest avg’ original method) 0.934 0.892 0.833 0.858 5 FCA (‘k per class closest avg’) 0.934 0.892 0.833 0.858 1 Random Forest 0.925 0.867 0.839 0.852 100 SAMME FCA (‘k per class closest avg’ uniform method) 0.925 0.877 0.829 0.848 5 SAMME FCA (‘k per class closest avg’ linear method) 0.906 0.838 0.804 0.819 5 SAMME FCA (‘k per class closest avg’ exponential method) 0.761 0.582 0.630 0.602 5 SAMME FCA (‘k per class closest avg’ logarithmic method) 0.864 0.760 0.735 0.747 5 SAMME FCA (‘k per class closest avg’ iternum method) 0.897 0.828 0.784 0.803 5 Algorithm Accuracy Precision Recall F1-score Ensemble size SAMME FCA (‘k per class closest avg’ original method) 0.784 0.803 0.680 0.712 5 FCA (‘k per class closest avg’ original method) 0.784 0.803 0.680 0.712 1 Random Forest 0.793 0.720 0.723 0.704 100 SAMME FCA (‘k per class closest avg’ uniform method) 0.765 0.727 0.636 0.66 5 SAMME FCA (‘k per class closest avg’ linear method) 0.681 0.609 0.591 0.594 5 SAMME FCA (‘k per class closest avg’ exponential method) 0.465 0.423 0.412 0.409 5 SAMME FCA (‘k per class closest avg’ logarithmic method) 0.728 0.652 0.623 0.629 5 SAMME FCA (‘k per class closest avg’ iternum method) 0.643 0.602 0.575 0.582 5

As it can be seen in Table 1 and Table 2, with original weighting the metrics are identical to the simple FCA one-model. This happens because the first model has a significantly bigger α1 and dominates others in ensemble. That is why other model weightings are tested.

However, in both cases best SAMME FCA and FCA models have higher average F1-score than the tuned baseline and for the first case they also win in terms of accuracy. Comparing SAMME FCA models it can be seen that the original weighting is better in terms of the presented metrics even though it effectively uses the first model only. In both SAMME and FCA the ensemble size is smaller than in the random forest.

On the divorces dataset (Table 3) due to its size and simplicity Random Forest manages to come out as an absolute winner having 100% in every score.

A lot of SAMME FCA and FCA models show the same relatively high scores.

SAMME again uses only the first model with original weights. However, a lot of other weighting techniques have similar metric values on this dataset. Because of that even though random forest uses 5 models, FCA can use less models.

On the fertility dataset (Table 4) the tuned Random Forest wins again. Especially the difference is significant in F1-score, precision and recall. The behaviour of the SAMME FCA metrics is similar to the previous dataset: different model weighting methods give similar results. In both SAMME and FCA the ensemble size is smaller than in random forest. Even though the model itself seems promising, right now it has multiple issues.

First, SAMME boosting technique does not give additional quality. The original SAMME method of calculating α effectively uses only the first classifier, while the other weighting methods do not give stronger metric values than the original method. The problem seems to be in the fact that classifiers with indices > 1 in ensemble are just not good enough. So, there is a reason while original methods stick to the first classifier instead of using all of them. While other weighting might use several parts of the ensemble, it does not improve the metrics, because the classifiers built after the first one have bad metrics most of the time. The potential room for improvement is to make consequent classifiers to produce a better quality results possibly by changing the way dataset gets weighted.

Secondly, we can see that this algorithm performed worse on several datasets.

Even though it was better than Random Forest on the complex ones, there is still a room for improvement for simpler ones. This again can be done by improving the quality of the whole ensemble, which Random Forest seems to efficiently perform.

SAMME FCA works slower than Random Forest. However, SAMME FCA has much better explainability: it generates only 5 classifiers (in some cases effectively uses only one of them), while Random Forest consists of 5 trees only on Divorces dataset and consists of 100 trees in every other case.

Bernhard

Ganter ,

Sergei O.

Kuznetsov . Pattern Structures and Their Projections . In: Delugach H.S., Stumme

. ( eds) Conceptual Structures: Broadening the Base . ICCS 2001. Lecture Notes in Computer Science , Volume 2120 ( 2001 ). Springer, Berlin, Heidelberg.

Leo

Breiman . Random Forests. Machine Learning , Volume 45 ( 2001 ), 5 - 32 .

Zhu , Hui Zou, Saharon Rosset,

Trevor

Hastie . Multi-class AdaBoost. Statistics and Its Interface , Volume 2 ( 2009 ), 349 - 360 .

Mehdi

Kaytoue , Sergei O. Kuznetsov , Amedeo Napoli, S´ebastien Duplessis. Mining gene expression data with pattern structures in formal concept analysis . Information Sciences , Volume 181 ( 2011 ), Issue 10, 1989 - 2001 .

5. Sergei

Kuznetsov . Fitting Pattern Structures to Knowledge Discovery in Big Data . In: Cellier P., Distel

, Ganter

. ( eds) Formal Concept Analysis . ICFCA 2013. Lecture Notes in Computer Science , Volume 7880 ( 2013 ). Springer, Berlin, Heidelberg.

6. Sergei

Kuznetsov . Scalable Knowledge Discovery in Complex Data with Pattern Structures . In: Maji P., Ghosh

, Murty

M.N.

, Ghosh

, Pal S .K. (eds) Pattern Recognition and Machine Intelligence . PReMI 2013. Lecture Notes in Computer Science , Volume 8251 ( 2013 ). Springer, Berlin, Heidelberg.

UCI

Machine Learning Repository . URL: https://archive.ics.uci.edu/ml/ datasets/Cardiotocography [date of access: 27 . 08 . 2020 ].

UCI

Machine Learning Repository . URL: https://archive.ics.uci.edu/ml/ datasets/Divorce+Predictors+data+ set [date of access: 21.03 . 2021 ].

UCI

Machine Learning Repository . URL: https://archive.ics.uci.edu/ml/ datasets/Fertility [date of access: 21 . 03 . 2021 ].