<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ensemble Techniques for Lazy Classification Based on Pattern Structures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilya Semenkov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergei O. Kuznetsov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Higher School of Economics - National Research University</institution>
          ,
          <addr-line>Moscow, Myasnitskaya street 20, 101000</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents different versions of classification ensemble methods based on pattern structures. Each of these methods is described and tested on multiple datasets (including datasets with exclusively numerical and exclusively nominal features). As a baseline model Random Forest generation is used. For some classification tasks the classification algorithms based on pattern structures showed better performance than Random Forest. The quality of the algorithms is noticeably dependent on ensemble aggregation function and on boosting weighting scheme.</p>
      </abstract>
      <kwd-group>
        <kwd>Formal Concept Analysis (FCA)</kwd>
        <kwd>pattern structures</kwd>
        <kwd>boosting algorithms</kwd>
        <kwd>ensemble algorithms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Pattern structures were introduced in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for the analysis of data with complex
structure. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] a model of lazy (query-based) classification using pattern
structures was proposed. The model shows quite good performance, which,
however, is lower than that of ensemble classifiers that employ boosting techniques.
The main goal of this paper is to study various ensemble approaches based on
pattern structures, boosting, and different aggregation functions. The model is
known for good interpretability, so we would try to use ensemble techniques to
improve its prediction quality.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Model description</title>
      <sec id="sec-2-1">
        <title>Pattern structures</title>
        <p>
          The main idea of the pattern structures is the use of intersection (similarity)
operation, with the properties of a lower semilattice, defined on object descriptions
to avoid binarization (discretization) prone to creating artifacts [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. An
operation of this kind allows one to define Galois connection and closure operator,
which can be used to extend standard FCA-based tools of knowledge discovery
to non-binary data without scaling (binarizing) them.
        </p>
        <p>At the first step of the model we transform features in the following way:
consider x ∈ Rn to be an observation, then we transform it using the following
transformation T : (x1, x2, ..., xn) → ((x1, x1), (x2, x2), ..., (xn, xn)). Basically,
for each feature j, its value in the observation x gets transformed from a number
xi into a 2-dimensional vector (xi, xi) with the same value repeated twice. This
is used later in the definition of similarity operation.</p>
        <p>
          After the transformation each observation x has 2 values for each feature i,
namely xi,1 and xi,2. In this paper, we define the similarity of two transformed
observations x, y in the standard way of interval pattern structures [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]:
x u y = ((min(x1,1, y1,1), max(x1,2, y1,2)), ..., (min(xn,1, yn,1), max(xn,2, yn,2)))
where in xi,j and yi,j i indicates a feature and j indicates one of two values
for a feature. In other words for each feature i there were xi = (xi,1, xi,2), yi =
(yi,1, yi,2). After similarity operation (x u y)i = (min(xi,1, yi,1), max(xi,2, yi,2)).
Below, for simplicity this operation will be called intersection.
        </p>
        <p>The subsumption relation is defined as the natural order of the semilattice:
x @ y ≡ x u y = x.</p>
        <p>A hypothesis for a description x is a description xh ∈ Cj : @y ∈ Ci(i 6=
j) : (x u xh) @ y, where Ci stands for a set of elements from class i. So, if a
description xh does not fit any observation from classes Cj (j 6= i), but fits at
least 1 observation of the class Ci, then it is considered to be a hypothesis for
the class Ci.</p>
        <p>The aggregation function is applied either to the whole set of all intersections
between a test observation and every element of the training sample, or to the
set of hypotheses (extracted from the training set).</p>
        <p>Aggregation functions There are many reasonable aggregation functions. In
our experiments we have used the following ones:
1. avglengths: C = arg minCi Pk∈A1Ci wk Pk∈ACi dist (k), where
dist (k) = Pn</p>
        <p>j=1 (kj,2 − kj,1), wk, the weight of k-th observation in a training
set;
2. k per class closest avg: C = arg minCi Pk∈Lm1,C wk P
k∈Lm,C dist (k), where
dist (k) = Pn</p>
        <p>j=1 (kj,2 − kj,1), Lm,C is the set of m elements from class C
which are closest to the prediction observation, m is a hyperparameter;
3. k closest: C = arg maxCi Pk∈Lm wk · 1 (k = Ci), where
dist (k) = Pn</p>
        <p>j=1 (kj,2 − kj,1), Lm, the set of m elements which are closest to
the prediction observation regardless of their class, m is a hyperparameter;
4. count with threshold t: C = arg maxCi Pk∈ACi 1 diswtk(k) &lt; t , where
dist (k) = Pn</p>
        <p>j=1 (kj,2 − kj,1), t, a threshold.</p>
        <p>Note: in each case ACi is a set of intersections of the test observation with
each observation in a class Ci, ACi = {xtest u x : (x ∈ Xtrain) ∧ (c(x) = Ci)} if
hypotheses are not used and ACi = {xtest u xh : (xh ∈ Xtrain) ∧ (c(xh) = Ci)}
if hypotheses are used (specified in tables with results if they are used), c(x) is
a function which returns class of training object.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>SAMME pattern structures</title>
        <p>It is not possible to directly use the gradient boosting techniques as there is no
loss function which gets directly optimized. Thus, an analog of AdaBoost is used
as we know how to implement weighted datasets in pattern structures.</p>
        <p>
          The general scheme called Stagewise Additive Modeling using a Multi-Class
Exponential loss function or SAMME is presented in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. After adapting it to
pattern structures it looks as follows:
        </p>
        <p>Algorithm 1: SAMME for pattern structures analysis, training
Input: (X, y), training dataset, M is the number of models in ensemble
;
initialize wi = n1 , n - number of objects in X;
for m ∈ 1, ..., M do
initialize model T (m);
classify each observation in X using weighted dataset with a new
model;
calculate error:
errm =</p>
        <p>1 n
Pn X wi1 yi 6= T (m) (xi)</p>
        <p>i=1 wi i=1
calculate model weight:
αm = log
1 − errm
errm</p>
        <p>+ log (K − 1)
where K is the amount of classes;
recalculate object weights:</p>
        <p>wi ← wi · e am·1 (yi6=T (m) (xi )) , i ∈ { 1, . . . , n }
end</p>
      </sec>
      <sec id="sec-2-3">
        <title>Prediction of ensemble:</title>
        <p>M
C (x) = arg max X
k m=1</p>
        <p>αm · 1 T (m) (x) = k
Methods of weighting models Additionally to the original method of
calculation of α, others were used such as:
1. Uniform: αm = M1
2. Linear: αm =</p>
        <p>Pin=11 wi Pin=1 wi1 yi = T (m) (xi)</p>
        <p>Pin=1 ewi1(yi=T (m)(xi))
3. Exponential: αm = Pin=1 ewi
4. Logarithmic: αm = log (1 + (1 − errm)) = log (2 − errm)
5. Iternum: αm = m1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Datasets description</title>
      <sec id="sec-3-1">
        <title>Cardiotocography Data Set</title>
        <p>
          The dataset consists of preprocessed 2126 fetal cardiotocograms (CTGs). It
contains 23 numerical attributes with no missing values. It could be used for 3 class
prediction as well as a 10 class classification (classes are more specific). The
dataset is available here [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The dataset is unbalanced. Before the
classification, data was standartized.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Male Fertility dataset</title>
        <p>
          The dataset [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] consists of 9 features about patient health, injuries, lifestyle
and bad habits. All features are nominal. There are 100 observations in the
dataset. The final goal is binary classification: normal semen sample or sample
with deviations.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Divorces</title>
        <p>
          The dataset [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] consists of 54 features which are the answers to questions about
relationship experience. All features are nominal. There are 170 observations in
the dataset. The final goal is binary classification: got divorced or not.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Random Forest</title>
      <p>
        For each dataset Random Forest was chosen as a baseline, since this algorithm is
an efficient ensemble of decision trees (which are also good in explanation) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In this algorithm the ensemble of Decision trees is built, where each tree is
tuned using random subsample from the training sample (Bagging) and random
subsample of features.</p>
      <p>Gridsearch with crossvalidation was used to tune it.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>Accuracy, precision, recall and F1-score were chosen as quality metrics. For
datasets with more than 2 target classes, precision, recall and F1-score were
calculated for each class separately and averaged afterwards.</p>
      <p>Metrics are measured on a randomly chosen test set.</p>
      <p>SAMME was also run with the aggregation function k per class closest avg,
since it has shown good performance.</p>
      <p>Due to space limitations, we present only results of the most interesting
experiments, together with the baseline model Random Forest.</p>
      <p>In the table pattern structures are referred to as FCA and SAMME pattern
structures is referred to as SAMME FCA.
Algorithm Accuracy Precision Recall F1-score Ensemble size
SAMME FCA (‘k per class closest avg’ original method) 0.934 0.892 0.833 0.858 5
FCA (‘k per class closest avg’) 0.934 0.892 0.833 0.858 1
Random Forest 0.925 0.867 0.839 0.852 100
SAMME FCA (‘k per class closest avg’ uniform method) 0.925 0.877 0.829 0.848 5
SAMME FCA (‘k per class closest avg’ linear method) 0.906 0.838 0.804 0.819 5
SAMME FCA (‘k per class closest avg’ exponential method) 0.761 0.582 0.630 0.602 5
SAMME FCA (‘k per class closest avg’ logarithmic method) 0.864 0.760 0.735 0.747 5
SAMME FCA (‘k per class closest avg’ iternum method) 0.897 0.828 0.784 0.803 5
Algorithm Accuracy Precision Recall F1-score Ensemble size
SAMME FCA (‘k per class closest avg’ original method) 0.784 0.803 0.680 0.712 5
FCA (‘k per class closest avg’ original method) 0.784 0.803 0.680 0.712 1
Random Forest 0.793 0.720 0.723 0.704 100
SAMME FCA (‘k per class closest avg’ uniform method) 0.765 0.727 0.636 0.66 5
SAMME FCA (‘k per class closest avg’ linear method) 0.681 0.609 0.591 0.594 5
SAMME FCA (‘k per class closest avg’ exponential method) 0.465 0.423 0.412 0.409 5
SAMME FCA (‘k per class closest avg’ logarithmic method) 0.728 0.652 0.623 0.629 5
SAMME FCA (‘k per class closest avg’ iternum method) 0.643 0.602 0.575 0.582 5</p>
      <p>As it can be seen in Table 1 and Table 2, with original weighting the metrics
are identical to the simple FCA one-model. This happens because the first model
has a significantly bigger α1 and dominates others in ensemble. That is why other
model weightings are tested.</p>
      <p>However, in both cases best SAMME FCA and FCA models have higher
average F1-score than the tuned baseline and for the first case they also win
in terms of accuracy. Comparing SAMME FCA models it can be seen that the
original weighting is better in terms of the presented metrics even though it
effectively uses the first model only. In both SAMME and FCA the ensemble
size is smaller than in the random forest.</p>
      <p>On the divorces dataset (Table 3) due to its size and simplicity Random
Forest manages to come out as an absolute winner having 100% in every score.</p>
      <p>A lot of SAMME FCA and FCA models show the same relatively high scores.</p>
      <p>SAMME again uses only the first model with original weights. However, a lot of
other weighting techniques have similar metric values on this dataset. Because
of that even though random forest uses 5 models, FCA can use less models.</p>
      <p>On the fertility dataset (Table 4) the tuned Random Forest wins again.
Especially the difference is significant in F1-score, precision and recall. The behaviour
of the SAMME FCA metrics is similar to the previous dataset: different model
weighting methods give similar results. In both SAMME and FCA the ensemble
size is smaller than in random forest.
Even though the model itself seems promising, right now it has multiple issues.</p>
      <p>First, SAMME boosting technique does not give additional quality. The original
SAMME method of calculating α effectively uses only the first classifier, while
the other weighting methods do not give stronger metric values than the original
method. The problem seems to be in the fact that classifiers with indices &gt; 1 in
ensemble are just not good enough. So, there is a reason while original methods
stick to the first classifier instead of using all of them. While other weighting
might use several parts of the ensemble, it does not improve the metrics, because
the classifiers built after the first one have bad metrics most of the time. The
potential room for improvement is to make consequent classifiers to produce a
better quality results possibly by changing the way dataset gets weighted.</p>
      <p>Secondly, we can see that this algorithm performed worse on several datasets.</p>
      <p>Even though it was better than Random Forest on the complex ones, there is still
a room for improvement for simpler ones. This again can be done by improving
the quality of the whole ensemble, which Random Forest seems to efficiently
perform.</p>
      <p>SAMME FCA works slower than Random Forest. However, SAMME FCA
has much better explainability: it generates only 5 classifiers (in some cases
effectively uses only one of them), while Random Forest consists of 5 trees only
on Divorces dataset and consists of 100 trees in every other case.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Ganter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sergei O.</given-names>
            <surname>Kuznetsov</surname>
          </string-name>
          .
          <article-title>Pattern Structures and Their Projections</article-title>
          . In: Delugach H.S.,
          <string-name>
            <surname>Stumme</surname>
            <given-names>G</given-names>
          </string-name>
          . (
          <article-title>eds) Conceptual Structures: Broadening the Base</article-title>
          .
          <source>ICCS 2001. Lecture Notes in Computer Science</source>
          , Volume
          <volume>2120</volume>
          (
          <year>2001</year>
          ). Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Leo</given-names>
            <surname>Breiman</surname>
          </string-name>
          .
          <source>Random Forests. Machine Learning</source>
          , Volume
          <volume>45</volume>
          (
          <year>2001</year>
          ),
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Ji</given-names>
            <surname>Zhu</surname>
          </string-name>
          , Hui Zou, Saharon Rosset,
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Hastie</surname>
          </string-name>
          .
          <source>Multi-class AdaBoost. Statistics and Its Interface</source>
          , Volume
          <volume>2</volume>
          (
          <year>2009</year>
          ),
          <fpage>349</fpage>
          -
          <lpage>360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Mehdi</given-names>
            <surname>Kaytoue</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sergei O. Kuznetsov</surname>
          </string-name>
          , Amedeo Napoli, S´ebastien Duplessis.
          <article-title>Mining gene expression data with pattern structures in formal concept analysis</article-title>
          .
          <source>Information Sciences</source>
          , Volume
          <volume>181</volume>
          (
          <year>2011</year>
          ), Issue 10,
          <fpage>1989</fpage>
          -
          <lpage>2001</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sergei</surname>
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Kuznetsov</surname>
          </string-name>
          .
          <article-title>Fitting Pattern Structures to Knowledge Discovery in Big Data</article-title>
          . In: Cellier P.,
          <string-name>
            <surname>Distel</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganter</surname>
            <given-names>B</given-names>
          </string-name>
          . (
          <article-title>eds) Formal Concept Analysis</article-title>
          .
          <source>ICFCA 2013. Lecture Notes in Computer Science</source>
          , Volume
          <volume>7880</volume>
          (
          <year>2013</year>
          ). Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sergei</surname>
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Kuznetsov</surname>
          </string-name>
          .
          <article-title>Scalable Knowledge Discovery in Complex Data with Pattern Structures</article-title>
          . In: Maji P.,
          <string-name>
            <surname>Ghosh</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murty</surname>
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pal S</surname>
          </string-name>
          .K. (eds)
          <article-title>Pattern Recognition and Machine Intelligence</article-title>
          .
          <source>PReMI 2013. Lecture Notes in Computer Science</source>
          , Volume
          <volume>8251</volume>
          (
          <year>2013</year>
          ). Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>UCI</given-names>
            <surname>Machine</surname>
          </string-name>
          <article-title>Learning Repository</article-title>
          . URL: https://archive.ics.uci.edu/ml/ datasets/Cardiotocography [date of access:
          <volume>27</volume>
          .
          <fpage>08</fpage>
          .
          <year>2020</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>UCI</given-names>
            <surname>Machine</surname>
          </string-name>
          <article-title>Learning Repository</article-title>
          . URL: https://archive.ics.uci.edu/ml/ datasets/Divorce+Predictors+data+
          <source>set [date of access: 21.03</source>
          .
          <year>2021</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>UCI</given-names>
            <surname>Machine</surname>
          </string-name>
          <article-title>Learning Repository</article-title>
          . URL: https://archive.ics.uci.edu/ml/ datasets/Fertility [date of access:
          <volume>21</volume>
          .
          <fpage>03</fpage>
          .
          <year>2021</year>
          ].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>