A First Study on What MDL Can Do for FCA

        Tatiana Makhalova1,2 , Sergei O. Kuznetsov1 , and Amedeo Napoli2
              1
                National Research University Higher School of Economics,
                        3 Kochnovsky Proezd, Moscow, Russia
                 2
                    LORIA, (CNRS – Inria – U. of Lorraine), BP 239
                            Vandœuvre-lès-Nancy, France
          tpmakhalova@hse.ru, skuznetsov@hse.ru, amedeo.napoil@loria.fr


         Abstract. Formal Concept Analysis can be considered as a classifica-
         tion engine able to build classes of objects with a description or concepts
         and to organize these concepts within a concept lattice. The concept lat-
         tice can be navigated for selecting significant concepts. Then the problem
         of finding significant concepts among the potential exponential number
         of concepts arises. Some measures exist that can be used for focusing on
         interesting concepts such as support, stability, and other. MDL (mini-
         mum description length) is also a good candidate that was rarely used
         in FCA by now for such objective. In this paper we argue that MDL
         can give FCA practitioners a good measure for selecting significant and
         representative concepts.


  1    Introduction

  Formal concept analysis (FCA) plays an important role in Data Mining and Ma-
  chine Learning. Concept lattices support mainly unsupervised settings, improv-
  ing tasks such as building taxonomies and ontologies, computing implications
  and association rules, clustering and solving classification tasks. These tasks in
  practice are coupled with the problem of exponential explosion of the number of
  formal concepts.
      By now, to tackle this issue, a lot of different approaches have been proposed,
  including data pre- and postprocessing, background knowledge incorporation,
  computing approximate concepts (see [9] for an overview). As a result, one ex-
  pects to get a small set of interesting, meaningful, non-redundant concepts [3].
      In this paper, we focus on the characterization of such a small set of concepts.
  Instead of using an interesting measure in postprocessing step, we propose to
  rely on the minimum description length (MDL) principle [7], which allows one
  to select small sets of diverse and interpretable concepts. Providing the best
  lossless compression of the data, the MDL optimal sets of patterns (itemsets)
  automatically provides a balance between the quality of fit of the data and the
  complexity of the model without any user-defined parameters to be set [1].
      In this paper we propose a first study on the application of the MDL principle
  in FCA settings. To the best of our knowledge, this is one of the first papers to
  study the effective use of MDL in the framework of FCA.

c paper author(s), 2018. Proceedings volume published and copyrighted by its editors.
  Paper published in Dmitry I. Ignatov, Lhouari Nourine (Eds.): CLA 2018, pp. 25–36,
  Department of Computer Science, Palacký University Olomouc, 2018. Copying
  permitted only for private and academic purposes.
26     Tatiana Makhalova, Sergei O. Kuznetsov, and Amedeo Napoli


    The rest of the paper has the following structure. In Section 2 we remind
the basic definition used in FCA and discuss how FCA can be used to solve
classification problem. Section 3 introduces the MDL principle. In Section 4 we
discuss what MDL can bring to FCA. Section 5 gives conclusion and directions
for future work.


2     Preliminaries
In this section we recall the main notions that are used in the paper. We study
attribute sets of formal concepts, which have several alternative names, such as
itemset, intent, or pattern. We discuss the application of FCA in unsupervised
and supervised settings.

2.1   FCA. Basic Notions
Here we briefly recall FCA terminology [5]. A formal context is a triple (G, M, I),
where G = {g1 , g2 , ..., gn } is called a set objects, M = {m1 , m2 , . . . , mk } is
called a set attributes and I ⊆ G × M is a relation called incidence relation, i.e.
                                                                                0
(g, m) ∈ I if the object g has the attribute m. The derivation operators (·) are
defined for A ⊆ G and B ⊆ M as follows:
                          A0 = {m ∈ M | ∀g ∈ A : gIm}
                          B 0 = {g ∈ G | ∀m ∈ B : gIm}

A0 is the set of attributes common to all objects of A and B 0 is the set of objects
sharing all attributes of B. An object g is said to contain a pattern (set of items)
                                                    0                              00
B ⊆ M if B ⊆ g 0 . The double application of (·) is a closure operator, i.e. (·)
is extensive, idempotent and monotone. Sets A ⊆ G, B ⊆ M , such that A = A00
and B = B 00 , are said to be closed.
    A (formal) concept is a pair (A, B), where A ⊆ G, B ⊆ M and A0 = B,
  0
B = A. A is called the (formal) extent and B is called the (formal) intent of the
concept (A, B). A partial order 6 is defined on the set of concepts as follows:
(A, B) ≤ (C, D) iff A ⊆ C (D ⊆ B), a pair (A, B) is a subconcept of (C, D),
while (C, D) is a superconcept of (A, B).
    The number of formal concepts grows exponentially w.r.t. the size of a formal
context, i.e. the number of objects in G and attributes in M . Thus, it becomes
almost impossible to analyze and interpret the whole set of generated concepts.
Pattern discovery techniques are designed to solve this problem. The goal of
pattern discovery within the framework of FCA is to find a non-redundant set of
concepts that are interesting w.r.t. specified constrains/interestingness criterion.
The criterion can be applied to both intent and extent, whereas pattern discovery
in general is related solely to the itemset assessment.

Example. Let us consider the toy example given in Table 1. We will use either D1
or D2 to compute classifiers and the remaining objects will be used to assess the
quality of the classifiers. The set of attributes M includes columns m1 , . . . , m9 .
                                        A First Study on What MDL Can Do for FCA                                                              27

                                  Table 1. An example of dataset.


                                                                                                                             m7 : yellow-
                                                                                                 release CO2
                                                                      m3 : change


                                                                                                  m5 :do not

                                                                                                               m6 : black-
                                           m1 : 4 legs


                                                                                                                             m8 : green
                  partitions


                                                         m2 : hairs


                                                                                    m4 : cold-
                                                                                    resistant


                                                                                                                             target w:
                                                                                                                                brown

                                                                                                                             m9 : gray
                                                                                                                 white


                                                                                                                              animal
                                                                          size
                    data
                                 objects
                               g1 dog       ×             ×              ×                                        ×                       +
                               g2 cat       ×             ×              ×                                                    ×           +
                               g3 frog      ×                            ×                                                        ×       +


                       D1
                               g4 car                                                  ×                                              ×   –

                  D2
                               g5 ball                                   ×             ×            ×             ×                       –
                               g6 chair     ×                                          ×            ×                             ×       –
                               g7 fur coat                ×                            ×            ×                                 ×   –
                               g8 sunflower                              ×                                                    ×           –
                               g9 fish                                   ×             ×                                              ×   +
                               g10 leopard ×              ×              ×                                                    ×           +
                               g11 table    ×                                          ×            ×                             ×       –


The additional attribute “target”, i.e., class labels, is not taken into account
under unsupervised settings.
    Filtering concepts based on their extent and/or intent belongs to class of
unsupervised problems, since any supplemental information is unavailable. In
the next subsection we consider the problem of concept selection in supervised
settings.


2.2   FCA under Supervised Settings: Concept-based Classifiers

In supervised settings along with the objects and their descriptions an additional
attribute w is given. It specifies the class of an object. We denote the set of its
values by ε.
    We shall confine ourselves to two-class classification and study the simplest
case, where each object belongs to a single class. In the defined settings ε =
{+, −}. We consider the case where a set of objects G = G+ ∪ G− is divided
into 2 disjoint subsets, i.e., a set of positive examples G+ , negative examples
G− . In practice, a set Gτ of unlabeled examples appears as well. The objects
are described by attributes from M and the target attribute w is defined as
follows: gIw = “ + ” for g ∈ G+ and gIw = “ − ” for g ∈ G− . The objects from
G+ ∪ G− compose training and test sets, which are used to generate concepts
and to estimate quality of classifiers, respectively.
    It is assumed that there exists an unknown function that maps each object
g ∈ G (or its description g 0 ⊆ M ) to an element in ε. The goal is to reconstruct
as accurately as possible the unknown function using an observable subset of
labeled objects. To do that, one builds classifiers, which can be constructed by
means of FCA as follows.
    For each concept (A, B) a class label e ∈ ε is defined by majority of labeled
objects in the extent, i.e. class((A, B)) = arg maxe∈ε |Ae |/|A|, where Ae = Ge ∩
A. To classify an unlabeled object g w.r.t. a pair (B, e) we set the following
classification principle:               (
                                          e, if B ⊆ g 0
                            (B, e)(g) =                                        (1)
                                          ∅, otherwise.
28       Tatiana Makhalova, Sergei O. Kuznetsov, and Amedeo Napoli

Table 2. The values of quality measures (Formulas 2-5) for classifiers
{“cold-resistant”}− and {“4 legs”, “change size”}+ from the running example in Ta-
ble 1. The attribute sets of the classifiers are intents of formal concepts computed on
D1 . The set {g8 , g9 , g10 , g11 } is used to assess classifiers.
                     classifier/
                                      {“cold-resistant”}− {“4 legs”, “change size”}+
                     measure
                     prec        |{g11 }|/|{g9 , g11 }| = 1/2        |{g10 }|/|{g10 }| = 1
                     recall      |{g11 }|/|{g8 , g11 }| = 1/2 |{g10 }|/|{g9 , g10 }| = 1/2
                                                     ∗
                     sup            |{g9 , g11 }|/|G | = 1/2         |{g10 }|/|G∗ | = 1/4
                     acc           |{g10 , g11 }|/|G∗ | = 1/2        |{g10 }|/|G∗ | = 1/4


     According to Formula 1, we get a non-empty response e from (B, e) if an
object description g 0 contains attribute set B. To simplify notation we will write
B̂ e instead of (B, e). In general, B could be any itemset, not necessarily closed.
     The details on classification problem in terms of FCA can be found in [6, 8].
     To identify the best classifier a test set G∗ ⊆ G+ ∪ G− is used, we will write
G+ and G∗− for subsets of positive and negative objects in the test set G∗ , i.e., for
   ∗

G∗ ∩G+ and G∗ ∩G− , respectively. In our paper we estimate classifiers using the
measures listed below and provide small examples of their usage (see Table 2).
Precision measures how  n many correct answers    o are
                                                    n given by the classifier:
                                                                             o
      prec(B̂ e , G∗ ) = g | B̂ e (g) = e, g ∈ G∗e / g | B̂ e (g) = e, g ∈ G∗ .    (2)

   Recall measures how many objects from a target class are characterized by
the classifier (i.e. whether a classifier is specific or general):
                                      n                         o
                  recall(B̂ e , G∗ ) = g | B̂ e (g) = e, g ∈ G∗e /|G∗e |. (3)

      Support measures how many objects can be classified (correctly or not):
                                  n                        o
                 sup(B̂ e , G∗ ) = g | B̂ e (g) = e, g ∈ G∗ /|G∗ |.           (4)

   The accuracy takes into account examples from the remaining classes ε \ {e}
unclassified by B̂ e :
                  n                         o n                                        o
                   g | B̂ e (g) = e, g ∈ G∗e ∪ g | B̂ e (g) = ∅, g ∈ G∗c , c ∈ ε \ {e}
acc(B̂ e , G∗ ) =                                                                        .
                                                |G∗ |
                                                                                       (5)
   However, other measures can also be examined [2], e.g., F1 score, AUC, etc.
   In the next section we consider an ensemble of classifiers based on single
concept-based classifiers defined in Formula 1.

2.3   Concept-based Classifiers
                            n o
A set of the classifiers S = B̂je , e ∈ ε with a rule for aggregation of their
                                             j∈J
responses constitute an ensemble of classifiers. We call an ensemble concept-
based classifiers (CBC) if the itemsets B are intents of formal concepts. As an
aggregation rule the following principles might be chosen: class of the majority
                               A First Study on What MDL Can Do for FCA            29


of labels, classification if responses are the same class labels, the highest prior-
ity class from the set of responses.The best CBCs are those that ensure high
accuracy and have a small size of S.

Example. Let us turn back to the running example from Table 1. The supple-
mentary information on class labels (the target attribute w) is given in column
“target”. We consider the following set of classifiers:

        S = {({“cold-resistant”} , −), ({“4 legs”, “change size”} , +)} .

     As a rule for aggregation of responses we use the priority principle, let us
suppose that “–” class has higher priority than “+”, thus we can apply classifiers
from “–” class and, if the object remains unclassified, we try to classify it with
“+” class classifiers.
     CBC works for g10 as follows. A classifiers with the highest priority (“–” class)
                                                  0
are firstly applied. Since {“cold-resistant”} 6⊆ g10 , we get an empty response and
turn to the classifiers of lower priority (“+” class). {“4 legs”, “change size”}⊆
  0
g10 , we get “+” response and classify g10 as an member of “+” class. The object
                                                         0
g11 is classified as “–”, since {“cold-resistant”}⊆ g11    . g8 remains unclassified,
since we get empty responses from all classifiers in S, g9 is misclassified by
({“cold-resistant”} , –).
     Thus, a formal concept is a well-interpreted, quite intuitive and handy tool
for describing subsets of objects both un- and supervised settings. However,
as it was mentioned earlier, the huge number of generated concepts hampers
interpretation of the results as well as its practical application.
     In the next section we provide an approach that can be used to select a small
set of diverse concepts.


3     Minimal Description Length Principle
3.1   Describing Data with MDL . Unsupervised Settings
The MDL principle in the context of pattern mining is formulated as follows:
the best set of patterns is the set that best compresses the database [10].
    The main element of this approach is a code table (CT), which is composed
of “some” itemsets with their length. The best code table minimizes the total
length in bits L(D, CT ) = L(D | CT ) + L(CT | D), where L(D | CT ) is the
length of the dataset encoded with the code table CT and L(CT | D) is the
length of the code table CT computed w.r.t. D. To encode an object g in a
dataset one needs to select a subset of disjoint itemsets that cover all attributes
of g. By u(B) = | {t ∈ D | B ∈ cover(t)} | we denote the usage of itemset B
in dataset
     P      D, i.e., how many times B is used to cover objects in D, where
U = B∈CT u(B) is the total usage of itemsets. The principles of building code
tables will be discussed further.
    To define the length of an itemset we use an optimal prefix code given by
Shannon entropy, i.e., l(B) = −logP r(B), where probability is defined as follows:
30     Tatiana Makhalova, Sergei O. Kuznetsov, and Amedeo Napoli


P r(B) = u(B)/U . Thus, the itemsets with higher frequency have smaller code
lengths. The details on itemset encoding and memory that is needed to store
itemsets of code table are out of scope of this paper. Since we are interested in the
length of the description rather than in the encoding itself (materialized codes),
we consider simplified version of L(CT, D) where only terms that characterize
“specificity” of itemsets for the dataset are taken into account:
                          X      X                    X                 u(B)
           L(D | CT ) =                    l(B) = −          u(B) log        ,
                                                                         U
                          g∈D B∈cover(g)              B∈CT
                                       X
                       L(CT | D) =           code(B) + l(B),
                                      B∈CT

where code(B) is the length in bits to store itemset B in a code table.

Principles of computing a code table. A code table is computed in an incremental
manner: starting from a set of single-attribute patterns, i.e. {{m} | m ∈ M }.
The optimization procedure is based on adding a new itemset to the code table,
correcting the information about usage of the other itemsets in the code table,
recomputing the itemset lengths and re-encoding the data. At each iteration a
new pattern can be added or not to the code table.
   A set of candidates might be composed of any kind of patterns: arbitrary
itemsets, closed ones, δ-itemsets, etc. The items in a set of candidates are sorted
w.r.t. chosen interestingness measures, in particular, in Krimp [10] patterns are
ordered by the length of itemset (its size of attribute set) and frequency.
   Note that the problem of computing optimal code table implies exhaustive
search for the best combination of patterns, and in practice some heuristics are
used.

Example. Let us consider the principle of computing a CT using the running
example. The iterative process of updating a CT for class “animal” and recom-
puting the cover of D2 is described in Table 3. As candidates we use the set of
concepts sorted by area, i.e., frequency × length, the ordered list of candidates
with their areas is given in columns “Candidate set, area”. Columns “CT” cor-
respond to the CT at each iteration. The CT contains itemsets and the number
of times each itemset is used to cover objects. The covering is given in columns
“Data with covering”. At the beginning it consists of single-attribute itemsets.
The first object m1 m2 m3 m6 is covered by four single-attribute itemsets. The first
candidate is m1 m2 m3 with an area equal to 6. At the next step this candidate
is used to cover attributes in the dataset covered by single-attribute itemsets.
With the chosen candidate, the covering for two objects is changed (compare
the first two objects in columns “Data with covering” in “Step 0” and “Step
1”). Step by step a new itemset with the maximal area is added to the CT. If
a new CT compresses the data better then the old one, the itemset is accepted
to the CT, otherwise, it is removed from both the CT and the candidate set. If
the CT is changed, the usage of itemsets in the CT and area for candidates are
recomputed.
                                        A First Study on What MDL Can Do for FCA                             31

Table 3. An iterative procedure of computing a code table for class “animal” and the
cover of D2 . The names “i” and “u” stand for an itemset and its usage in covering
         Step 0                                                   Step 1
                CT              Data with          Candidate          CT       Data with Candidate
         i                 u      covering          set, area i            u    covering       set, area
         m3                4 (m1 )(m2 )(m3 )(m6 ) m1 m2 m3 , 6    m1 m2 m3 2 (m1 m2 m3 )(m6 ) m1 m3 m8 , 3
         m1                3 (m1 )(m2 )(m3 )(m7 ) m1 m3 , 6       m3       2 (m1 m2 m3 )(m7 ) m3 m4 m9 , 3
         m2                2 (m1 )(m3 )(m8 )      m1 m2 m3 m6 , 4 m1 , m4 1 (m1 )(m3 )(m8 ) m1 m3 , 2
         m4                1 (m3 )(m4 )(m9 )      m1 m2 m3 m7 , 4 m6 -m9 1 (m3 )(m4 )(m9 )
         m6 -m9            1                      m1 m3 m8 , 3    m2 , m5 0
         m5                0                      m3 m4 m9 , 3
         Step 2                                                   Step 3
                CT              Data with          Candidate          CT       Data with Candidate
         i                 u      covering          set, area i            u    covering       set, area
         m1 m2 m3          2 (m1 m2 m3 )(m6 )     m3 m4 m9 , 3    m1 m2 m3 2 (m1 m2 m3 )(m6 )
         m1 m3 m8          1 (m1 m2 m3 )(m7 )                     m1 m3 m8 1 (m1 m2 m3 )(m7 )
         m3 , m4           1 (m1 m3 m8 )                          m3 m4 m9 1 (m1 m3 m8 )
         m6 , m7 , m9      1 (m3 )(m4 )(m9 )                      m6 , m7 1 (m3 m4 m9 )
         m1 , m2 , m5 , m8 0                                      m1 -m5 0
                                                                  m8 -m9 0


3.2   MDL under Supervised Settings
Being purely unsupervised, the MDL principle can be adapted for usage in su-
pervised settings. The idea is to find a compressed representation for objects
using code tables of each target class separately. Classes have their own code
tables. A code table consists of typical patterns and their lengths (the more typ-
ical patterns have shorter lengths). To classify a new object, its set of attributes
is covered by itemsets from code tables of each class. Then, the encoding lengths
for each class are computed and the class that corresponds to the minimal en-
coding length is assigned to the object. The length reflects typicality of an object
for a particular class (code table).

Example. Consider classification with code tables “CT1 ” and “CT2 ” from Ta-
ble 4 that have been computed on sets D1 and D2 , respectively. The details on
the computing of the code tables is out of the scope of this paper (see the Krimp
algorithm [10]). Each column“CTi ” contains code tables CTA and CTN A for
“animal” and “not animal” classes, respectively. Let us consider how an object
g9 is classified with the code tables from “CT1 ”. The main steps of the covering
of g90 are given in Table 5. To find a covering we use a greedy strategy, i.e., we
start from the first itemset in a code table and then stop iterating over itemsets
when all attributes of the object are covered. To cover g90 with CTN A we take the
first itemset m1 m4 m5 m8 (see Step 1 in Table 5), it does not cover g90 , at the next
step we take the second itemset m4 m9 , it covers a subset of g90 and m3 remains
uncovered (see Step 2). The iterations over itemsets from CTN A continues until
all the attributes of g9 will be covered.
    Classification with code tables from Table 4 are given in Table 6. The ob-
jects are covered by itemsets from tables of “animal” and “not animal” as it
is described in Table 5). The class where the object has the shortest length is
assigned to this object.
    In this section we considered how the MDL principle is used to select patterns
(itemsets) in un- and supervised settings. In the next section we study how MDL
works in the FCA framework.
32          Tatiana Makhalova, Sergei O. Kuznetsov, and Amedeo Napoli

Table 4. Code tables CT1,A and CT1,N A , CT2,A and CT2,N A computed on datasets
D1 and D2 , respectively. The lengths of itemsets are given with their relative size. A
shorter itemset is more typical, i.e., more often used to cover the data on which they
had been computed.
                                            CT1                                                            CT1
                                                    code table                     code table                      code table
                    code table
                                                   “not animal”                     “animal”                      “not animal”
                  “animal”, CTA
                                                      CTN A                           CTA                            CTN A

                    itemsets


                                                                                   itemsets


                                                                                                                   itemsets
                                                    itemsets
                                       length


                                                                       length


                                                                                                      length


                                                                                                                                      length
                                       in CT


                                                                       in CT


                                                                                                      in CT


                                                                                                                                      in CT
                               usage


                                                               usage


                                                                                              usage


                                                                                                                              usage
                                                                                m1 m2 m3 2
                 m1 m2 m3 1                     m1 m4 m5 m8 1                   m3 m4 m9 1                     m1 m4 m5 m8 1
                 m1 m3 m8 1                     m4 m9       2                   m1 m3 m8 1                     m4 m9       2
                 m1       0                     m1          0                   m1       0                     m1          0
                 m2       0                     m2          1                   m2       0                     m2          1
                 m3       0                     m3          1                   m3       0                     m3          2
                 m4       0                     m4          1                   m4       0                     m4          1
                 m5       0                     m5          2                   m5       0                     m5          2
                 m6       1                     m6          1                   m6       1                     m6          1
                 m7       1                     m7          0                   m7       1                     m7          1
                 m8       0                     m8          0                   m8       0                     m8          0
                 m9       0                     m9          0                   m9       0                     m9          0


Table 5. The steps of the covering process of object g9 by itemsets from the code
tables of classes “animal” and “not animal”, CT1,A and CT1,N A , respectively. To cover
g90 with CT1,A we try to use every itemset from the top, i.e., m1 m2 m3 , m1 m3 m8 , m1 ,
etc. We stop when all attributes are covered. The covering procedure for CT1,A and
CT1,N A stops after m9 and m3 , respectively, is being considered.
              Covering with CT1 for “animals”                 Covering with CT1 for “not animals”
             Used itemset        Remaining attributes          Used itemset       Remaining attributes
     Step                                             Step
         (an attempt to cover)       in g90 to cover       (an attempt to cover)       in g90 to cover
       0 -                       m3 m4 m9               0                         m3 m4 m9
       1 m1 m2 m3                m3 m4 m9               1 m1 m4 m5 m8             m3 m4 m9
       2 m1 m3 m8                m3 m4 m9               2 m4 m9                   m3
       3 m1                      m3 m4 m9               3 m1                      m3
       4 m2                      m3 m4 m9               4 m2                      m3
       5 m3                      m4 m9                  5 m3                      {∅}
       6 m4                      m9
                             ...
      11 m9                      {∅}


4      MDL in FCA: First Steps

To show that MDL can improve the practical application of FCA, in this section
we discuss the results of experiments on the embedding of MDL within FCA.
We used the discretized datasets from LUCS-KDD repository [4]. We split the
data into 10 parts and in each of 10 experiments we use 9 of them as a training
set and one as a test set. The average performance is reported in the paper. To
compute code tables and to cover objects the Krimp algorithm [10] is used.


4.1         Descriptive Patterns. FCA in Unsupervised Settings

In unsupervised learning (where the target attribute is not given), one is inter-
ested in a small number of meaningful patterns. In our experiments we compute
the set of closed itemsets and apply the MDL principle to them. For MDL we
                                              A First Study on What MDL Can Do for FCA                  33

Table 6. Classification with code tables (MDL-based approach) from Table 4. The
class of the code table where an object has the shortest encoding length is assigned to
the object
    Encoding with itemsets                     animal       Encoding with itemsets not animal decision
    g80 = (m3 )(m7 )                      ,                 g80 = (m3 )(m7 )            ,       not animal
    g90 = (m3 )(m4 )(m9 )                 ,      ,          g90 = (m3 )(m4 m9 )         , ,     not animal
 CT1 0                                                        0
    g10 = (m1 m2 m3 )(m7 )            ,                     g10 = (m1 )(m2 )(m3 )(m7 )    , , , animal
      0                                                       0
    g11  = (m1 )(m4 )(m5 )(m8 )           ,      ,      ,   g11 = (m1 m4 m5 m8 )                not animal
      0                                                       0
    g10  = (m1 m2 m3 )(m7 )       ,                         g10  = (m1 )(m2 )(m3 )(m7 )   , ,,  animal
 CT2 0                                                        0
    g11 = (m1 )(m4 )(m5 )(m8 )            ,      ,      ,   g11  = (m1 m4 m5 m8 )               not animal


 Table 7. The parameters of sets of formal concepts and their proper MDL-subsets.
                                 nmb. of   avg. length                                nmb. of   avg. length
               nmb. nmb.                                            nmb. nmb.
  dataset                       concepts    of intent    dataset                     concepts    of intent
              of obj. of attr.                                     of obj. of attr.
                               total MDL total MDL                                  total MDL total MDL
auto             205      135 67 557     57 8.83 19.26 horse colic     368      83 173 808 101 6.96 3.92
breast           699       16     642    24 7.36 9.04 iris             150      19     107    13 3.08 3.92
car            1 728       25 12 617     94 5.12 3.47 led7           3 200      24   1 937 152 4.60 6.80
chess         28 056       58 152 753 1 675 4.85 4.32 mushroom       8 124      90 181 945 211 15.23 19.53
dermatology      366       49 16 324     47 6.98 5.70 nursery       12 960      30 176 536 392 6.53 5.56
ecoli            336       29     694    25 5.49 6.08 page blocks    5 473      44     715    45 5.79 10.27
flare          1 389       38 16 303 106 6.82 8.64 pima indians        768      38   1 609    50 4.99 5.86
glass            214       46   4 704    50 5.06 4.32 ticTacToe        958      29 42 685 160 5.44 4.02
heart            303       50 36 708     54 7.14 5.09 wine             178      68 13 170     52 5.14 3.90
hepatitis        155       52 199 954    44 8.14 5.59 zoo              101      42   4 563   17 7.34 12.24


sort itemsets in the candidate set by “length” (the cardinality of intent) and
“frequency” (the cardinality of extent).
    MDL-optimal concepts are concepts whose intents are included in a code
table with a non-empty usage. The results of the experiments are given in Ta-
ble 7. For instance, the “auto” dataset consists of 205 objects and 135 binary
attributes. The total number of formal concepts is 67 557, 57 of them are MDL-
optimal.
    Our experiments show that the selected itemsets might be shorter or longer
on average than the itemsets in the whole set of closed concepts (see column
“avg. length of intent” in Table 7). However, around 2% (12 % at most) of the
concepts are selected with the MDL-principle. Thus, MDL can be considered as
a threshold-free alternative for selection of interesting itemsets. It is important
to notice that the subset of MDL-optimal itemsets is composed of diverse pat-
terns and expert assumptions on interestingness of concepts can be embedded
by ordering candidates w.r.t. particular interestingness measures. Since a greedy
strategy is used to make a code table, one gets a set of diverse itemsets that are
in agreement with interestingness.

4.2    Classifier Comparison. FCA in Supervised Settings
In this section we study both the accuracy of ensembles of classifiers S and
their basic elements B̂. We also compare the accuracy of the ensembles with
commonly used classification methods, e.g., random forest, multilayer perceptron
and support vector machine.
Comparison of single classifiers. Here we consider formal concepts as single
classifiers. We study precision, recall (Formulas 2 and 3, respectively) and pre-
cision loss (Formula 6). Formula 6 is also used as a measure of overfitting, i.e.,
34     Tatiana Makhalova, Sergei O. Kuznetsov, and Amedeo Napoli


cases where accuracy on a test set is worse than on a training one. If precision
loss is close to 1, we call it overfitting, the classifier makes much more errors on
a test set than on a training set. If precision loss is close to 0, we get “expected”
precision on a test set.
       prloss(B̂ e , G+ ∪ G− ) = precision(B̂ e , G \ G∗ ) − precision(B̂ e , G∗ )   (6)

     In Figure 1 we show the performance of single concepts in the following 2-
dimensional spaces: “precision” and “precision loss” (blue shapes), “recall” and
“precision loss” (green shapes). Due to lack of space we provide two typical
kinds of distributions using results for “Breast cancer” and “Wine” datasets.
Precision loss is shown on the vertical axis, recall and precision are combined in
the horizontal axis. The pictures are density plots for the (sub)set of concepts
in the chosen dimensions. Intense-colored regions correspond to the regions with
high concentration of concepts.
     Let us consider the classifiers of “Breast cancer” dataset (the first line in
Figure 1). The first figure shows that most concepts have precision loss close to
0, i.e., they have similar precision both on training and test set. Classifiers with
precision loss close to 0 are preferable, since they have the “expected” precision
even on unobservable data. Let us consider the position on the horizontal axis.
The blue shape (“precision” axis) is located close to 1, it means that most
concepts have high precision on a training set. A long stretch of the green shape
along the axis means that the set of classifiers consists of both specific (having
relatively big intents) and general concepts (recall is from 0 to 1).
     The second plot corresponds to MDL-optimal classifiers. In both spaces,
i.e., “precision” and “precision loss” (blue shapes), “recall” and “precision loss”
(green shapes), classifiers are concentrated in two points on the “precision loss”
axis, the bottom shape is brighter than the upper one. It means that a lot of
classifiers have similar precision on training and test sets, and there are several
classifiers that have much smaller precision on a test set (they are overfitting
classifiers). Since the blue shapes are located close to 0 on the horizontal axis,
we conclude that the classifiers are very precise on a training set. The concen-
tration of the green shapes around 0 on the horizontal axis means that most
classifiers have recall close to 0, thus, the classifiers are very specific.
     Classifiers of “Wine” dataset demonstrate another typical distribution of the
classifiers. We can read the plots as it is done above. Here we discuss the key
difference between two kinds of datasets.
     In our experiments, the set of classifiers on the whole set of formal concepts
was comprised of either mostly one type of classifiers with precision loss close
to 0 or two types of classifiers: with precision loss close to 0 and to 1. Thus, the
set of concept-classifiers contains either mostly non-overfitting (good) classifiers
or non- and overfitting ones. MDL-based subset usually includes both non- and
overfitted classifiers, all these classifiers are quite specific (have big intents).
     A reasonable question arises: what is the accuracy of ensembles of classifiers
built on such different types of concepts? In the next paragraph we examine
the accuracy of ensembles of classifiers that are constituted by MLD-optimal
subsets.
                                A First Study on What MDL Can Do for FCA             35


Fig. 1. Precision loss of the whole set of concepts (CBC), MDL-optimal subsets of
classifiers for “Breast cancer” (top row) and “Wine” (bottom row) datasets

Comparison of ensembles of classifiers In this section we compare ensem-
bles built on a set of formal concepts (“CBC”) and its MDL-optimal subset
(“MDL”) with commonly used classification methods like random forests (RFs),
multilayer perceptrons3 (MLP) and support vector machines (SVM). We study
the average accuracy on a test set, and the results are summarized in Table 8.
As it was noticed above, we split our dataset into 10 folds (9:1 for train and test
sets). The maximal accuracy over the 10 folds is also reported in the table. We
pay our attention to the number of classifiers that constitute an ensemble. The
smaller the number of classifiers the faster one can obtain the response and the
better this response can be interpreted.
    Our experiments show that among itemset-based classifiers (CBC, MDL,
RF) MDL-based approach demonstrate quite good performances and have a
much smaller set of classifiers. An ensemble with a small number of classifiers
performs faster and is better interpretable. Thus, it is easier to classify with
MDL-ensembles and understand the obtained results as well.


5     Conclusion
In this paper we have addressed the problem of selecting meaningful, non-
redundant sets of formal concepts. We have proposed to use the MDL principle
and to show how the expert understanding of interestingness might be incorpo-
rated into it.
   The MDL principle ensures a good compression even when the set of formal
concepts is huge (for example, 24 MDL-optimal among 6 432 concepts for “Breast
cancer”, 392 among 176 537 concepts for “nursery” dataset).
3
    We select the configuration that ensures the best accuracy among the following ones:
    (100:50:50),(100:50:25),(50:50:50)
36     Tatiana Makhalova, Sergei O. Kuznetsov, and Amedeo Napoli

                    Table 8. Performance of ensembles of classifiers
             Dataset name Classifier Avg. accuracy Max. acc. Avg. rule numb. Max. rule numb
                           CBC         0,86 ± 0,04      0,90    355,00 ± 7,10            361
                           MDL        0,94 ± 0,03       0,99    27,80 ± 0,98              30
                           RF          0,93 ± 0,04      0,99     30,60 ± 4,48             37
             breast cancer
                           MLP        0,94 ± 0,03       0,99                –              –
                           SVM         0,93 ± 0,03      0,99                –              –
                           CBC         0,84 ± 0,05      0,94 394,70 ± 14,47              410
                           MDL         0,77 ± 0,10      0,85    37,30 ± 2,10              40
                           RF          0,77 ± 0,05      0,85 1236,40 ± 532,54           1830
             ecoli
                           MLP         0,84 ± 0,03      0,88                –              –
                           SVM        0,86 ± 0,04       0,94                –              –
                           CBC         0,93 ± 0,07      1,00    113,20 ± 5,57            119
                           MDL         0,94 ± 0,06      1,00     18,50 ± 1,36             20
                           RF          0,95 ± 0,07      1,00 169,40 ± 192,45             531
             iris
                           MLP         0,93 ± 0,07      1,00                –              –
                           SVM         0,93 ± 0,07      1,00                –              –

   In supervised settings, MDL principle tends to choose the specific classifiers,
some of them have a precision loss close to 0.9. However, the MDL principle
ensures high classification accuracy.
   One interesting direction for future work is to study how some interesting-
ness measures, such as stability, might be embedded into the MDL approach.
Another interesting direction is to study connection between the MDL principle
and Pareto optimality.

Acknowledgements
The work was supported by the Russian Science Foundation under grant 17-
11-01294 and performed at National Research University Higher School of Eco-
nomics, Moscow, Russia.

References
 1. Aggarwal, C.C., Han, J.: Frequent pattern mining. Springer (2014)
 2. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A., Nielsen, H.: Assessing the
    accuracy of prediction algorithms for classification: an overview. Bioinformatics
    16(5), 412–424 (2000)
 3. Buzmakov, A., Kuznetsov, S.O., Napoli, A.: Fast generation of best interval pat-
    terns for nonmonotonic constraints. In: Joint European Conference on Machine
    Learning and Knowledge Discovery in Databases. pp. 157–172. Springer (2015)
 4. Coenen, F.: The lucs-kdd discretised/normalised arm and carm data library (2003),
    http://www.csc.liv.ac.uk/ frans/KDD/Software/LUCS KDD DN
 5. Ganter, B., Wille, R.: Formal concept analysis: Logical foundations (1999)
 6. Ganter, B., Kuznetsov, S.O.: Formalizing hypotheses with concepts. In: Interna-
    tional Conference on Conceptual Structures. pp. 342–356. Springer (2000)
 7. Grünwald, P.D.: The minimum description length principle. MIT press (2007)
 8. Kuznetsov, S.O.: Machine learning and formal concept analysis. In: Eklund, P.
    (ed.) Concept Lattices. Springer Berlin Heidelberg, Berlin, Heidelberg (2004)
 9. Kuznetsov, S.O., Makhalova, T.: On interestingness measures of formal concepts.
    Information Sciences 442-443, 202 – 219 (2018)
10. Vreeken, J., Van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress.
    Data Mining and Knowledge Discovery 23(1), 169–214 (2011)