=Paper= {{Paper |id=Vol-2972/paper10 |storemode=property |title=Ensemble Techniques for Lazy Classification Based on Pattern Structures |pdfUrl=https://ceur-ws.org/Vol-2972/paper10.pdf |volume=Vol-2972 |authors=Ilya Semenkov,Sergei O. Kuznetsov |dblpUrl=https://dblp.org/rec/conf/ijcai/SemenkovK21 }} ==Ensemble Techniques for Lazy Classification Based on Pattern Structures== https://ceur-ws.org/Vol-2972/paper10.pdf
      Ensemble Techniques for Lazy Classification
            Based on Pattern Structures

                    Ilya Semenkov1 and Sergei O. Kuznetsov1

 Higher School of Economics – National Research University, Moscow, Myasnitskaya
                             street 20, 101000, Russia



       Abstract. This paper presents different versions of classification en-
       semble methods based on pattern structures. Each of these methods is
       described and tested on multiple datasets (including datasets with exclu-
       sively numerical and exclusively nominal features). As a baseline model
       Random Forest generation is used. For some classification tasks the clas-
       sification algorithms based on pattern structures showed better perfor-
       mance than Random Forest. The quality of the algorithms is noticeably
       dependent on ensemble aggregation function and on boosting weighting
       scheme.

       Keywords: Formal Concept Analysis (FCA) · pattern structures · boost-
       ing algorithms · ensemble algorithms


1     Introduction
Pattern structures were introduced in [1] for the analysis of data with complex
structure. In [6] [5] a model of lazy (query-based) classification using pattern
structures was proposed. The model shows quite good performance, which, how-
ever, is lower than that of ensemble classifiers that employ boosting techniques.
The main goal of this paper is to study various ensemble approaches based on
pattern structures, boosting, and different aggregation functions. The model is
known for good interpretability, so we would try to use ensemble techniques to
improve its prediction quality.


2     Model description
2.1    Pattern structures
The main idea of the pattern structures is the use of intersection (similarity) op-
eration, with the properties of a lower semilattice, defined on object descriptions
to avoid binarization (discretization) prone to creating artifacts [1]. An opera-
tion of this kind allows one to define Galois connection and closure operator,
which can be used to extend standard FCA-based tools of knowledge discovery
to non-binary data without scaling (binarizing) them.
    At the first step of the model we transform features in the following way:
consider x ∈ Rn to be an observation, then we transform it using the following

Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
transformation T : (x1 , x2 , ..., xn ) → ((x1 , x1 ), (x2 , x2 ), ..., (xn , xn )). Basically,
for each feature j, its value in the observation x gets transformed from a number
xi into a 2-dimensional vector (xi , xi ) with the same value repeated twice. This
is used later in the definition of similarity operation.
    After the transformation each observation x has 2 values for each feature i,
namely xi,1 and xi,2 . In this paper, we define the similarity of two transformed
observations x, y in the standard way of interval pattern structures [4] [5] [6]:

x u y = ((min(x1,1 , y1,1 ), max(x1,2 , y1,2 )), ..., (min(xn,1 , yn,1 ), max(xn,2 , yn,2 )))

where in xi,j and yi,j i indicates a feature and j indicates one of two values
for a feature. In other words for each feature i there were xi = (xi,1 , xi,2 ), yi =
(yi,1 , yi,2 ). After similarity operation (x u y)i = (min(xi,1 , yi,1 ), max(xi,2 , yi,2 )).
Below, for simplicity this operation will be called intersection.
    The subsumption relation is defined as the natural order of the semilattice:
x @ y ≡ x u y = x.
    A hypothesis for a description x is a description xh ∈ Cj : @y ∈ Ci (i 6=
j) : (x u xh ) @ y, where Ci stands for a set of elements from class i. So, if a
description xh does not fit any observation from classes Cj (j 6= i), but fits at
least 1 observation of the class Ci , then it is considered to be a hypothesis for
the class Ci .
    The aggregation function is applied either to the whole set of all intersections
between a test observation and every element of the training sample, or to the
set of hypotheses (extracted from the training set).

Aggregation functions There are many reasonable aggregation functions. In
our experiments we have used the following ones:
                                              P
 1. avglengths: C = arg minCi P 1 wk k∈AC dist (k), where
                                   k∈AC              i
               Pn                        i
    dist (k) = j=1 (kj,2 − kj,1 ), wk , the weight of k-th observation in a training
    set;                                                   P
 2. k per class closest avg: C = arg minCi P 1 wk k∈Lm,C dist (k), where
                                                 k∈Lm,C
               Pn
    dist (k) = j=1 (kj,2 − kj,1 ), Lm,C is the set of m elements from class C
    which are closest to the prediction
                              P            observation, m is a hyperparameter;
 3. k closest: C = arg maxCi k∈Lm wk · 1 (k = Ci ), where
               Pn
    dist (k) = j=1 (kj,2 − kj,1 ), Lm , the set of m elements which are closest to
    the prediction observation regardless of their class,m is a hyperparameter;
                                                                      
                                                P
 4. count with threshold t: C = arg maxCi k∈AC 1 dist(k)      wk   < t  , where
               Pn                                      i
    dist (k) = j=1 (kj,2 − kj,1 ), t, a threshold.

    Note: in each case ACi is a set of intersections of the test observation with
each observation in a class Ci , ACi = {xtest u x : (x ∈ Xtrain ) ∧ (c(x) = Ci )} if
hypotheses are not used and ACi = {xtest u xh : (xh ∈ Xtrain ) ∧ (c(xh ) = Ci )}
if hypotheses are used (specified in tables with results if they are used), c(x) is
a function which returns class of training object.
2.2     SAMME pattern structures
It is not possible to directly use the gradient boosting techniques as there is no
loss function which gets directly optimized. Thus, an analog of AdaBoost is used
as we know how to implement weighted datasets in pattern structures.
    The general scheme called Stagewise Additive Modeling using a Multi-Class
Exponential loss function or SAMME is presented in [3]. After adapting it to
pattern structures it looks as follows:
 Algorithm 1: SAMME for pattern structures analysis, training
  Input: (X, y), training dataset, M is the number of models in ensemble
   ;
  initialize wi = n1 , n - number of objects in X;
  for m ∈ 1, ..., M do
      initialize model T (m) ;
      classify each observation in X using weighted dataset with a new
        model;
      calculate error:

                                             1
                                                   n
                                                   X                               
                        errm = Pn                            wi 1 yi 6= T (m) (xi )
                                         i=1 wi i=1

            calculate model weight:
                                                              
                               m                 1 − errm
                           α       = log                           + log (K − 1)
                                                   errm

         where K is the amount of classes;
        recalculate object weights:
                                     m                 (m)
                    wi ← wi · e a ·1 (yi 6=T                 (xi ))
                                                                      , i ∈ { 1, . . . , n }

      end
      Prediction of ensemble:
                                                 M
                                                 X                         
                     C (x) = arg max                   αm · 1 T (m) (x) = k
                                         k
                                                 m=1


Methods of weighting models Additionally to the original method of calcu-
lation of α, others were used such as:
                   1
1. Uniform: αm = M                                 
                         Pn
2. Linear: α = Pn 1 wi i=1 wi 1 yi = T (m) (xi )
            m
                   i=1

                           e i ( i
                       Pn   w 1 y =T (m) (xi ))
                 m
3. Exponential: α = i=1 Pn ewi
                               i=1
4. Logarithmic: αm = log (1 + (1 − errm )) = log (2 − errm )
                  1
5. Iternum: αm = m
3     Datasets description

3.1   Cardiotocography Data Set

The dataset consists of preprocessed 2126 fetal cardiotocograms (CTGs). It con-
tains 23 numerical attributes with no missing values. It could be used for 3 class
prediction as well as a 10 class classification (classes are more specific). The
dataset is available here [7]. The dataset is unbalanced. Before the classifica-
tion, data was standartized.


3.2   Male Fertility dataset

The dataset [8] consists of 9 features about patient health, injuries, lifestyle
and bad habits. All features are nominal. There are 100 observations in the
dataset. The final goal is binary classification: normal semen sample or sample
with deviations.


3.3   Divorces

The dataset [9] consists of 54 features which are the answers to questions about
relationship experience. All features are nominal. There are 170 observations in
the dataset. The final goal is binary classification: got divorced or not.


4     Random Forest

For each dataset Random Forest was chosen as a baseline, since this algorithm is
an efficient ensemble of decision trees (which are also good in explanation) [2].
   In this algorithm the ensemble of Decision trees is built, where each tree is
tuned using random subsample from the training sample (Bagging) and random
subsample of features.
   Gridsearch with crossvalidation was used to tune it.


5     Results

Accuracy, precision, recall and F1-score were chosen as quality metrics. For
datasets with more than 2 target classes, precision, recall and F1-score were
calculated for each class separately and averaged afterwards.
    Metrics are measured on a randomly chosen test set.
    SAMME was also run with the aggregation function k per class closest avg,
since it has shown good performance.
    Due to space limitations, we present only results of the most interesting
experiments, together with the baseline model Random Forest.
    In the table pattern structures are referred to as FCA and SAMME pattern
structures is referred to as SAMME FCA.
                  Table 1. Cardiotocography Data Set (3 classes)

Algorithm                                                Accuracy Precision Recall F1-score Ensemble size
SAMME FCA (‘k per class closest avg’ original method)    0.934    0.892     0.833 0.858     5
FCA (‘k per class closest avg’)                          0.934    0.892     0.833 0.858     1
Random Forest                                            0.925    0.867     0.839 0.852     100
SAMME FCA (‘k per class closest avg’ uniform method)     0.925    0.877     0.829 0.848     5
SAMME FCA (‘k per class closest avg’ linear method)      0.906    0.838     0.804 0.819     5
SAMME FCA (‘k per class closest avg’ exponential method) 0.761    0.582     0.630 0.602     5
SAMME FCA (‘k per class closest avg’ logarithmic method) 0.864    0.760     0.735 0.747     5
SAMME FCA (‘k per class closest avg’ iternum method)     0.897    0.828     0.784 0.803     5


                 Table 2. Cardiotocography Data Set (10 classes)

Algorithm                                                Accuracy Precision Recall F1-score Ensemble size
SAMME FCA (‘k per class closest avg’ original method)    0.784    0.803     0.680 0.712     5
FCA (‘k per class closest avg’ original method)          0.784    0.803     0.680 0.712     1
Random Forest                                            0.793    0.720     0.723 0.704     100
SAMME FCA (‘k per class closest avg’ uniform method)     0.765    0.727     0.636 0.66      5
SAMME FCA (‘k per class closest avg’ linear method)      0.681    0.609     0.591 0.594     5
SAMME FCA (‘k per class closest avg’ exponential method) 0.465    0.423     0.412 0.409     5
SAMME FCA (‘k per class closest avg’ logarithmic method) 0.728    0.652     0.623 0.629     5
SAMME FCA (‘k per class closest avg’ iternum method)     0.643    0.602     0.575 0.582     5



    As it can be seen in Table 1 and Table 2, with original weighting the metrics
are identical to the simple FCA one-model. This happens because the first model
has a significantly bigger α1 and dominates others in ensemble. That is why other
model weightings are tested.
    However, in both cases best SAMME FCA and FCA models have higher
average F1-score than the tuned baseline and for the first case they also win
in terms of accuracy. Comparing SAMME FCA models it can be seen that the
original weighting is better in terms of the presented metrics even though it
effectively uses the first model only. In both SAMME and FCA the ensemble
size is smaller than in the random forest.
    On the divorces dataset (Table 3) due to its size and simplicity Random
Forest manages to come out as an absolute winner having 100% in every score.
A lot of SAMME FCA and FCA models show the same relatively high scores.
SAMME again uses only the first model with original weights. However, a lot of
other weighting techniques have similar metric values on this dataset. Because
of that even though random forest uses 5 models, FCA can use less models.
    On the fertility dataset (Table 4) the tuned Random Forest wins again. Espe-
cially the difference is significant in F1-score, precision and recall. The behaviour
of the SAMME FCA metrics is similar to the previous dataset: different model
weighting methods give similar results. In both SAMME and FCA the ensemble
size is smaller than in random forest.
                      Table 3. Divorce Predictors Data Set

Algorithm                                             Accuracy Precision Recall F1-score Ensemble size
FCA ”avglengths”                                      0.953    0.958     0.952 0.953     1
FCA ”k per class closest avg”                         0.953    0.958     0.952 0.953     1
FCA ”k closest”                                       0.953    0.958     0.952 0.953     1
FCA ”count with treshold t”                           0.674    0.806     0.667 0.629     1
FCA ”avglengths” (with hypotheses)                    0.953    0.958     0.952 0.953     1
FCA ”k per class closest avg” (with hypotheses)       0.953    0.958     0.952 0.953     1
FCA ”k closest” (with hypotheses)                     0.953    0.958     0.952 0.953     1
FCA ”count with treshold t” (with hypotheses)         0.512    0.256     0.500 0.338     1
SAMME FCA ”avglengths”                                0.953    0.958     0.952 0.953     5
SAMME FCA original ”k per class closest avg”          0.953    0.958     0.952 0.953     5
SAMME FCA ”k closest”                                 0.953    0.958     0.952 0.953     5
SAMME FCA ”count with treshold t”                     0.674    0.806     0.667 0.629     5
SAMME FCA ”avglengths” (with hypotheses)              0.953    0.958     0.952 0.953     5
SAMME FCA ”k per class closest avg” (with hypotheses) 0.953    0.958     0.952 0.953     5
SAMME FCA ”k closest” (with hypotheses)               0.953    0.958     0.952 0.953     5
SAMME FCA ”count with treshold t” (with hypotheses) 0.512      0.256     0.500 0.338     5
Random Forest                                         1.0      1.0       1.0    1.0      5
SAMME FCA uniform ”k per class closest avg”           0.953    0.958     0.952 0.953     5
SAMME FCA linear ”k per class closest avg”            0.953    0.958     0.952 0.953     5
SAMME FCA exponential ”k per class closest avg”       0.953    0.958     0.952 0.953     5
SAMME FCA logarithmic ”k per class closest avg”       0.953    0.958     0.952 0.953     5
SAMME FCA iternum ”k per class closest avg”           0.953    0.958     0.952 0.953     5



6   Conclusion

Even though the model itself seems promising, right now it has multiple issues.
First, SAMME boosting technique does not give additional quality. The original
SAMME method of calculating α effectively uses only the first classifier, while
the other weighting methods do not give stronger metric values than the original
method. The problem seems to be in the fact that classifiers with indices > 1 in
ensemble are just not good enough. So, there is a reason while original methods
stick to the first classifier instead of using all of them. While other weighting
might use several parts of the ensemble, it does not improve the metrics, because
the classifiers built after the first one have bad metrics most of the time. The
potential room for improvement is to make consequent classifiers to produce a
better quality results possibly by changing the way dataset gets weighted.
    Secondly, we can see that this algorithm performed worse on several datasets.
Even though it was better than Random Forest on the complex ones, there is still
a room for improvement for simpler ones. This again can be done by improving
the quality of the whole ensemble, which Random Forest seems to efficiently
perform.
    SAMME FCA works slower than Random Forest. However, SAMME FCA
has much better explainability: it generates only 5 classifiers (in some cases
                           Table 4. Fertility Data Set

Algorithm                                             Accuracy Precision Recall F1-score Ensemble size
FCA ”avglengths”                                      0.680    0.600     0.826 0.561     1
FCA ”k per class closest avg”                         0.920    0.460     0.500 0.479     1
FCA ”k closest”                                       0.920    0.460     0.500 0.479     1
FCA ”count with treshold t”                           0.560    0.577     0.761 0.476     1
FCA ”avglengths” (with hypotheses)                    0.760    0.557     0.641 0.554     1
FCA ”k per class closest avg” (with hypotheses)       0.080    0.272     0.272 0.080     1
FCA ”k closest” (with hypotheses)                     0.920    0.460     0.500 0.479     1
FCA ”count with treshold t” (with hypotheses)         0.520    0.571     0.739 0.449     1
SAMME FCA ”avglengths”                                0.920    0.460     0.500 0.479     5
SAMME FCA ”k per class closest avg”                   0.920    0.460     0.500 0.479     5
SAMME FCA ”k closest”                                 0.920    0.460     0.500 0.479     5
SAMME FCA ”count with treshold t”                     0.920    0.460     0.500 0.479     5
SAMME FCA ”avglengths” (with hypotheses)              0.760    0.557     0.641 0.554     5
SAMME FCA ”k per class closest avg” (with hypotheses) 0.800    0.455     0.435 0.444     5
SAMME FCA ”k closest” (with hypotheses)               0.920    0.460     0.500 0.479     5
SAMME FCA ”count with treshold t” (with hypotheses) 0.920      0.460     0.500 0.479     5
Random Forest                                         0.960    0.979     0.750 0.823     100
SAMME FCA uniform ”k per class closest avg”           0.920    0.460     0.500 0.479     5
SAMME FCA linear ”k per class closest avg”            0.920    0.460     0.500 0.479     5
SAMME FCA exponential ”k per class closest avg”       0.920    0.460     0.500 0.479     5
SAMME FCA logarithmic ”k per class closest avg”       0.920    0.460     0.500 0.479     5
SAMME FCA iternum ”k per class closest avg”           0.920    0.460     0.500 0.479     5



effectively uses only one of them), while Random Forest consists of 5 trees only
on Divorces dataset and consists of 100 trees in every other case.
References
1. Bernhard Ganter, Sergei O. Kuznetsov. Pattern Structures and Their Projections.
   In: Delugach H.S., Stumme G. (eds) Conceptual Structures: Broadening the Base.
   ICCS 2001. Lecture Notes in Computer Science, Volume 2120 (2001). Springer,
   Berlin, Heidelberg.
2. Leo Breiman. Random Forests. Machine Learning, Volume 45 (2001), 5–32.
3. Ji Zhu, Hui Zou, Saharon Rosset, Trevor Hastie. Multi-class AdaBoost. Statistics
   and Its Interface, Volume 2 (2009), 349–360.
4. Mehdi Kaytoue, Sergei O. Kuznetsov, Amedeo Napoli, Sébastien Duplessis. Mining
   gene expression data with pattern structures in formal concept analysis. Information
   Sciences, Volume 181 (2011), Issue 10, 1989-2001.
5. Sergei O. Kuznetsov. Fitting Pattern Structures to Knowledge Discovery in Big
   Data. In: Cellier P., Distel F., Ganter B. (eds) Formal Concept Analysis. ICFCA
   2013. Lecture Notes in Computer Science, Volume 7880 (2013). Springer, Berlin,
   Heidelberg.
6. Sergei O. Kuznetsov. Scalable Knowledge Discovery in Complex Data with Pattern
   Structures. In: Maji P., Ghosh A., Murty M.N., Ghosh K., Pal S.K. (eds) Pattern
   Recognition and Machine Intelligence. PReMI 2013. Lecture Notes in Computer
   Science, Volume 8251 (2013). Springer, Berlin, Heidelberg.
7. UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/
   datasets/Cardiotocography [date of access: 27.08.2020].
8. UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/
   datasets/Divorce+Predictors+data+set [date of access: 21.03.2021].
9. UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/
   datasets/Fertility [date of access: 21.03.2021].