<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A First Study on What MDL Can Do for FCA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tatiana Makhalova</string-name>
          <email>tpmakhalova@hse.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergei O. Kuznetsov</string-name>
          <email>skuznetsov@hse.ru</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amedeo Napoli</string-name>
          <email>amedeo.napoil@loria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>36, Department of Computer Science, Palacky University Olomouc</institution>
          ,
          <addr-line>2018. Copying permitted only for private and academic purposes</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LORIA, (CNRS - Inria - U. of Lorraine)</institution>
          ,
          <addr-line>BP 239 Vandoeuvre-l`es-Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>3 Kochnovsky Proezd, Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>c paper author(s), 2018. Proceedings volume published and copyrighted by its editors. Paper published in Dmitry I. Ignatov</institution>
          ,
          <addr-line>Lhouari Nourine (Eds.): CLA 2018, pp. 25</addr-line>
        </aff>
      </contrib-group>
      <fpage>25</fpage>
      <lpage>36</lpage>
      <abstract>
        <p>Formal Concept Analysis can be considered as a classification engine able to build classes of objects with a description or concepts and to organize these concepts within a concept lattice. The concept lattice can be navigated for selecting significant concepts. Then the problem of finding significant concepts among the potential exponential number of concepts arises. Some measures exist that can be used for focusing on interesting concepts such as support, stability, and other. MDL (minimum description length) is also a good candidate that was rarely used in FCA by now for such objective. In this paper we argue that MDL can give FCA practitioners a good measure for selecting significant and representative concepts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Formal concept analysis (FCA) plays an important role in Data Mining and
Machine Learning. Concept lattices support mainly unsupervised settings,
improving tasks such as building taxonomies and ontologies, computing implications
and association rules, clustering and solving classification tasks. These tasks in
practice are coupled with the problem of exponential explosion of the number of
formal concepts.</p>
      <p>
        By now, to tackle this issue, a lot of different approaches have been proposed,
including data pre- and postprocessing, background knowledge incorporation,
computing approximate concepts (see [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for an overview). As a result, one
expects to get a small set of interesting, meaningful, non-redundant concepts [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In this paper, we focus on the characterization of such a small set of concepts.
Instead of using an interesting measure in postprocessing step, we propose to
rely on the minimum description length (MDL) principle [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which allows one
to select small sets of diverse and interpretable concepts. Providing the best
lossless compression of the data, the MDL optimal sets of patterns (itemsets)
automatically provides a balance between the quality of fit of the data and the
complexity of the model without any user-defined parameters to be set [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>In this paper we propose a first study on the application of the MDL principle
in FCA settings. To the best of our knowledge, this is one of the first papers to
study the effective use of MDL in the framework of FCA.</p>
      <p>The rest of the paper has the following structure. In Section 2 we remind
the basic definition used in FCA and discuss how FCA can be used to solve
classification problem. Section 3 introduces the MDL principle. In Section 4 we
discuss what MDL can bring to FCA. Section 5 gives conclusion and directions
for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>In this section we recall the main notions that are used in the paper. We study
attribute sets of formal concepts, which have several alternative names, such as
itemset, intent, or pattern. We discuss the application of FCA in unsupervised
and supervised settings.
2.1</p>
      <sec id="sec-2-1">
        <title>FCA. Basic Notions</title>
        <p>
          Here we briefly recall FCA terminology [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. A formal context is a triple (G, M, I),
where G = {g1, g2, ..., gn} is called a set objects, M = {m1, m2, . . . , mk} is
called a set attributes and I ⊆ G × M is a relation called incidence relation, i.e.
(g, m) ∈ I if the object g has the attribute m. The derivation operators (·)0 are
defined for A ⊆ G and B ⊆ M as follows:
        </p>
        <p>A0 = {m ∈ M | ∀g ∈ A : gIm}</p>
        <p>B0 = {g ∈ G | ∀m ∈ B : gIm}
A0 is the set of attributes common to all objects of A and B0 is the set of objects
sharing all attributes of B. An object g is said to contain a pattern (set of items)
B ⊆ M if B ⊆ g0. The double application of (·)0 is a closure operator, i.e. (·)00
is extensive, idempotent and monotone. Sets A ⊆ G, B ⊆ M , such that A = A00
and B = B00, are said to be closed.</p>
        <p>A (formal) concept is a pair (A, B), where A ⊆ G, B ⊆ M and A0 = B,
B0 = A. A is called the (formal) extent and B is called the (formal) intent of the
concept (A, B). A partial order 6 is defined on the set of concepts as follows:
(A, B) ≤ (C, D) iff A ⊆ C (D ⊆ B), a pair (A, B) is a subconcept of (C, D),
while (C, D) is a superconcept of (A, B).</p>
        <p>The number of formal concepts grows exponentially w.r.t. the size of a formal
context, i.e. the number of objects in G and attributes in M . Thus, it becomes
almost impossible to analyze and interpret the whole set of generated concepts.
Pattern discovery techniques are designed to solve this problem. The goal of
pattern discovery within the framework of FCA is to find a non-redundant set of
concepts that are interesting w.r.t. specified constrains/interestingness criterion.
The criterion can be applied to both intent and extent, whereas pattern discovery
in general is related solely to the itemset assessment.</p>
        <p>Example. Let us consider the toy example given in Table 1. We will use either D1
or D2 to compute classifiers and the remaining objects will be used to assess the
quality of the classifiers. The set of attributes M includes columns m1, . . . , m9.
s
n
taad iittro
a
p objects
g1 dog
g2 cat
The additional attribute “target”, i.e., class labels, is not taken into account
under unsupervised settings.</p>
        <p>Filtering concepts based on their extent and/or intent belongs to class of
unsupervised problems, since any supplemental information is unavailable. In
the next subsection we consider the problem of concept selection in supervised
settings.
In supervised settings along with the objects and their descriptions an additional
attribute w is given. It specifies the class of an object. We denote the set of its
values by ε.</p>
        <p>We shall confine ourselves to two-class classification and study the simplest
case, where each object belongs to a single class. In the defined settings ε =
{+, −}. We consider the case where a set of objects G = G+ ∪ G− is divided
into 2 disjoint subsets, i.e., a set of positive examples G+, negative examples
G−. In practice, a set Gτ of unlabeled examples appears as well. The objects
are described by attributes from M and the target attribute w is defined as
follows: gIw = “ + ” for g ∈ G+ and gIw = “ − ” for g ∈ G−. The objects from
G+ ∪ G− compose training and test sets, which are used to generate concepts
and to estimate quality of classifiers, respectively.</p>
        <p>It is assumed that there exists an unknown function that maps each object
g ∈ G (or its description g0 ⊆ M ) to an element in ε. The goal is to reconstruct
as accurately as possible the unknown function using an observable subset of
labeled objects. To do that, one builds classifiers, which can be constructed by
means of FCA as follows.</p>
        <p>For each concept (A, B) a class label e ∈ ε is defined by majority of labeled
objects in the extent, i.e. class((A, B)) = arg maxe∈ε |Ae|/|A|, where Ae = Ge ∩
A. To classify an unlabeled object g w.r.t. a pair (B, e) we set the following
classification principle:
(B, e)(g) =
(e, if B ⊆ g0
∅, otherwise.</p>
        <p>(1)</p>
        <p>According to Formula 1, we get a non-empty response e from (B, e) if an
object description g0 contains attribute set B. To simplify notation we will write
Bˆe instead of (B, e). In general, B could be any itemset, not necessarily closed.</p>
        <p>
          The details on classification problem in terms of FCA can be found in [
          <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
          ].
        </p>
        <p>To identify the best classifier a test set G∗ ⊆ G+ ∪ G− is used, we will write
G∗+ and G∗ for subsets of positive and negative objects in the test set G∗, i.e., for
−
G∗ ∩ G+ and G∗ ∩ G−, respectively. In our paper we estimate classifiers using the
measures listed below and provide small examples of their usage (see Table 2).
Precision measures how many correct answers are given by the classifier:
prec(Bˆe, G∗) = ng | Bˆe(g) = e, g ∈ Ge∗o / ng | Bˆe(g) = e, g ∈ G∗o . (2)
Recall measures how many objects from a target class are characterized by
the classifier (i.e. whether a classifier is specific or general):</p>
        <p>recall(Bˆe, G∗) = ng | Bˆe(g) = e, g ∈ Ge∗o /|Ge∗|. (3)
Support measures how many objects can be classified (correctly or not):
sup(Bˆe, G∗) = ng | Bˆe(g) = e, g ∈ G∗o /|G∗|.
(4)</p>
        <p>The accuracy takes into account examples from the remaining classes ε \ {e}
unclassified by Bˆe:
acc(Bˆe, G∗) =
ng | Bˆe(g) = e, g ∈ Ge∗o ∪ ng | Bˆe(g) = ∅, g ∈ Gc∗, c ∈ ε \ {e}o
|G∗|</p>
        <p>
          .
(5)
However, other measures can also be examined [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], e.g., F1 score, AUC, etc.
        </p>
        <p>In the next section we consider an ensemble of classifiers based on single
concept-based classifiers defined in Formula 1.
2.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Concept-based Classifiers</title>
        <p>A set of the classifiers S = nBˆeo , e ∈ ε with a rule for aggregation of their
j j∈J
responses constitute an ensemble of classifiers. We call an ensemble
conceptbased classifiers (CBC) if the itemsets B are intents of formal concepts. As an
aggregation rule the following principles might be chosen: class of the majority
of labels, classification if responses are the same class labels, the highest
priority class from the set of responses.The best CBCs are those that ensure high
accuracy and have a small size of S.</p>
        <p>Example. Let us turn back to the running example from Table 1. The
supplementary information on class labels (the target attribute w) is given in column
“target”. We consider the following set of classifiers:</p>
        <p>S = {({“cold-resistant”} , −), ({“4 legs”, “change size”} , +)} .</p>
        <p>As a rule for aggregation of responses we use the priority principle, let us
suppose that “–” class has higher priority than “+”, thus we can apply classifiers
from “–” class and, if the object remains unclassified, we try to classify it with
“+” class classifiers.</p>
        <p>CBC works for g10 as follows. A classifiers with the highest priority (“–” class)
are firstly applied. Since {“cold-resistant”} 6⊆ g100, we get an empty response and
turn to the classifiers of lower priority (“+” class). {“4 legs”, “change size”}⊆
g100, we get “+” response and classify g10 as an member of “+” class. The object
g11 is classified as “–”, since {“cold-resistant”}⊆ g101. g8 remains unclassified,
since we get empty responses from all classifiers in S, g9 is misclassified by
({“cold-resistant”} , –).</p>
        <p>Thus, a formal concept is a well-interpreted, quite intuitive and handy tool
for describing subsets of objects both un- and supervised settings. However,
as it was mentioned earlier, the huge number of generated concepts hampers
interpretation of the results as well as its practical application.</p>
        <p>In the next section we provide an approach that can be used to select a small
set of diverse concepts.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Minimal Description Length Principle</title>
      <p>
        Describing Data with MDL . Unsupervised Settings
The MDL principle in the context of pattern mining is formulated as follows:
the best set of patterns is the set that best compresses the database [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>The main element of this approach is a code table (CT), which is composed
of “some” itemsets with their length. The best code table minimizes the total
length in bits L(D, CT ) = L(D | CT ) + L(CT | D), where L(D | CT ) is the
length of the dataset encoded with the code table CT and L(CT | D) is the
length of the code table CT computed w.r.t. D. To encode an object g in a
dataset one needs to select a subset of disjoint itemsets that cover all attributes
of g. By u(B) = | {t ∈ D | B ∈ cover(t)} | we denote the usage of itemset B
in dataset D, i.e., how many times B is used to cover objects in D, where
U = PB∈CT u(B) is the total usage of itemsets. The principles of building code
tables will be discussed further.</p>
      <p>To define the length of an itemset we use an optimal prefix code given by
Shannon entropy, i.e., l(B) = −logP r(B), where probability is defined as follows:
P r(B) = u(B)/U . Thus, the itemsets with higher frequency have smaller code
lengths. The details on itemset encoding and memory that is needed to store
itemsets of code table are out of scope of this paper. Since we are interested in the
length of the description rather than in the encoding itself (materialized codes),
we consider simplified version of L(CT, D) where only terms that characterize
“specificity” of itemsets for the dataset are taken into account:</p>
      <p>L(D | CT ) = X</p>
      <p>X
g∈D B∈cover(g)
l(B) = − X u(B) log</p>
      <p>B∈CT
u(B)</p>
      <p>U
,
L(CT | D) =</p>
      <p>X code(B) + l(B),</p>
      <p>B∈CT
where code(B) is the length in bits to store itemset B in a code table.
Principles of computing a code table. A code table is computed in an incremental
manner: starting from a set of single-attribute patterns, i.e. {{m} | m ∈ M }.
The optimization procedure is based on adding a new itemset to the code table,
correcting the information about usage of the other itemsets in the code table,
recomputing the itemset lengths and re-encoding the data. At each iteration a
new pattern can be added or not to the code table.</p>
      <p>
        A set of candidates might be composed of any kind of patterns: arbitrary
itemsets, closed ones, δ-itemsets, etc. The items in a set of candidates are sorted
w.r.t. chosen interestingness measures, in particular, in Krimp [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] patterns are
ordered by the length of itemset (its size of attribute set) and frequency.
      </p>
      <p>Note that the problem of computing optimal code table implies exhaustive
search for the best combination of patterns, and in practice some heuristics are
used.</p>
      <p>Example. Let us consider the principle of computing a CT using the running
example. The iterative process of updating a CT for class “animal” and
recomputing the cover of D2 is described in Table 3. As candidates we use the set of
concepts sorted by area, i.e., frequency × length, the ordered list of candidates
with their areas is given in columns “Candidate set, area”. Columns “CT”
correspond to the CT at each iteration. The CT contains itemsets and the number
of times each itemset is used to cover objects. The covering is given in columns
“Data with covering”. At the beginning it consists of single-attribute itemsets.
The first object m1m2m3m6 is covered by four single-attribute itemsets. The first
candidate is m1m2m3 with an area equal to 6. At the next step this candidate
is used to cover attributes in the dataset covered by single-attribute itemsets.
With the chosen candidate, the covering for two objects is changed (compare
the first two objects in columns “Data with covering” in “Step 0” and “Step
1”). Step by step a new itemset with the maximal area is added to the CT. If
a new CT compresses the data better then the old one, the itemset is accepted
to the CT, otherwise, it is removed from both the CT and the candidate set. If
the CT is changed, the usage of itemsets in the CT and area for candidates are
recomputed.
Being purely unsupervised, the MDL principle can be adapted for usage in
supervised settings. The idea is to find a compressed representation for objects
using code tables of each target class separately. Classes have their own code
tables. A code table consists of typical patterns and their lengths (the more
typical patterns have shorter lengths). To classify a new object, its set of attributes
is covered by itemsets from code tables of each class. Then, the encoding lengths
for each class are computed and the class that corresponds to the minimal
encoding length is assigned to the object. The length reflects typicality of an object
for a particular class (code table).</p>
      <p>
        Example. Consider classification with code tables “CT1” and “CT2” from
Table 4 that have been computed on sets D1 and D2, respectively. The details on
the computing of the code tables is out of the scope of this paper (see the Krimp
algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). Each column“CTi” contains code tables CTA and CTNA for
“animal” and “not animal” classes, respectively. Let us consider how an object
g9 is classified with the code tables from “CT1”. The main steps of the covering
of g90 are given in Table 5. To find a covering we use a greedy strategy, i.e., we
start from the first itemset in a code table and then stop iterating over itemsets
when all attributes of the object are covered. To cover g90 with CTNA we take the
first itemset m1m4m5m8 (see Step 1 in Table 5), it does not cover g90, at the next
step we take the second itemset m4m9, it covers a subset of g90 and m3 remains
uncovered (see Step 2). The iterations over itemsets from CTNA continues until
all the attributes of g9 will be covered.
      </p>
      <p>Classification with code tables from Table 4 are given in Table 6. The
objects are covered by itemsets from tables of “animal” and “not animal” as it
is described in Table 5). The class where the object has the shortest length is
assigned to this object.</p>
      <p>In this section we considered how the MDL principle is used to select patterns
(itemsets) in un- and supervised settings. In the next section we study how MDL
works in the FCA framework.
code table
“animal”</p>
      <p>CTA</p>
      <p>CT1</p>
      <p>
        CT1
code table
“animal”, CTA
s
t
e
s
m
e
t
i
To show that MDL can improve the practical application of FCA, in this section
we discuss the results of experiments on the embedding of MDL within FCA.
We used the discretized datasets from LUCS-KDD repository [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We split the
data into 10 parts and in each of 10 experiments we use 9 of them as a training
set and one as a test set. The average performance is reported in the paper. To
compute code tables and to cover objects the Krimp algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is used.
In unsupervised learning (where the target attribute is not given), one is
interested in a small number of meaningful patterns. In our experiments we compute
the set of closed itemsets and apply the MDL principle to them. For MDL we
,
      </p>
      <p>,</p>
      <p>CT2 gg110001 == ((mm11)m(m2m4)3()m(m5)7()m8)</p>
      <p>Encoding with itemsets not animal decision
g80 = (m3)(m7) , not animal
g90 = (m3)(m4m9) , , not animal
g100 = (m1)(m2)(m3)(m7) , , , animal
g101 = (m1m4m5m8) not animal
g100 = (m1)(m2)(m3)(m7) , , , animal
g101 = (m1m4m5m8) not animal
,</p>
      <p>,
sort itemsets in the candidate set by “length” (the cardinality of intent) and
“frequency” (the cardinality of extent).</p>
      <p>MDL-optimal concepts are concepts whose intents are included in a code
table with a non-empty usage. The results of the experiments are given in
Table 7. For instance, the “auto” dataset consists of 205 objects and 135 binary
attributes. The total number of formal concepts is 67 557, 57 of them are
MDLoptimal.</p>
      <p>Our experiments show that the selected itemsets might be shorter or longer
on average than the itemsets in the whole set of closed concepts (see column
“avg. length of intent” in Table 7). However, around 2% (12 % at most) of the
concepts are selected with the MDL-principle. Thus, MDL can be considered as
a threshold-free alternative for selection of interesting itemsets. It is important
to notice that the subset of MDL-optimal itemsets is composed of diverse
patterns and expert assumptions on interestingness of concepts can be embedded
by ordering candidates w.r.t. particular interestingness measures. Since a greedy
strategy is used to make a code table, one gets a set of diverse itemsets that are
in agreement with interestingness.</p>
      <p>Classifier Comparison. FCA in Supervised Settings
In this section we study both the accuracy of ensembles of classifiers S and
their basic elements Bˆ. We also compare the accuracy of the ensembles with
commonly used classification methods, e.g., random forest, multilayer perceptron
and support vector machine.</p>
      <p>Comparison of single classifiers. Here we consider formal concepts as single
classifiers. We study precision, recall (Formulas 2 and 3, respectively) and
precision loss (Formula 6). Formula 6 is also used as a measure of overfitting, i.e.,
cases where accuracy on a test set is worse than on a training one. If precision
loss is close to 1, we call it overfitting, the classifier makes much more errors on
a test set than on a training set. If precision loss is close to 0, we get “expected”
precision on a test set.</p>
      <p>prloss(Bˆe, G+ ∪ G−) = precision(Bˆe, G \ G∗) − precision(Bˆe, G∗)
(6)</p>
      <p>In Figure 1 we show the performance of single concepts in the following
2dimensional spaces: “precision” and “precision loss” (blue shapes), “recall” and
“precision loss” (green shapes). Due to lack of space we provide two typical
kinds of distributions using results for “Breast cancer” and “Wine” datasets.
Precision loss is shown on the vertical axis, recall and precision are combined in
the horizontal axis. The pictures are density plots for the (sub)set of concepts
in the chosen dimensions. Intense-colored regions correspond to the regions with
high concentration of concepts.</p>
      <p>Let us consider the classifiers of “Breast cancer” dataset (the first line in
Figure 1). The first figure shows that most concepts have precision loss close to
0, i.e., they have similar precision both on training and test set. Classifiers with
precision loss close to 0 are preferable, since they have the “expected” precision
even on unobservable data. Let us consider the position on the horizontal axis.
The blue shape (“precision” axis) is located close to 1, it means that most
concepts have high precision on a training set. A long stretch of the green shape
along the axis means that the set of classifiers consists of both specific (having
relatively big intents) and general concepts (recall is from 0 to 1).</p>
      <p>The second plot corresponds to MDL-optimal classifiers. In both spaces,
i.e., “precision” and “precision loss” (blue shapes), “recall” and “precision loss”
(green shapes), classifiers are concentrated in two points on the “precision loss”
axis, the bottom shape is brighter than the upper one. It means that a lot of
classifiers have similar precision on training and test sets, and there are several
classifiers that have much smaller precision on a test set (they are overfitting
classifiers). Since the blue shapes are located close to 0 on the horizontal axis,
we conclude that the classifiers are very precise on a training set. The
concentration of the green shapes around 0 on the horizontal axis means that most
classifiers have recall close to 0, thus, the classifiers are very specific.</p>
      <p>Classifiers of “Wine” dataset demonstrate another typical distribution of the
classifiers. We can read the plots as it is done above. Here we discuss the key
difference between two kinds of datasets.</p>
      <p>In our experiments, the set of classifiers on the whole set of formal concepts
was comprised of either mostly one type of classifiers with precision loss close
to 0 or two types of classifiers: with precision loss close to 0 and to 1. Thus, the
set of concept-classifiers contains either mostly non-overfitting (good) classifiers
or non- and overfitting ones. MDL-based subset usually includes both non- and
overfitted classifiers, all these classifiers are quite specific (have big intents).</p>
      <p>A reasonable question arises: what is the accuracy of ensembles of classifiers
built on such different types of concepts? In the next paragraph we examine
the accuracy of ensembles of classifiers that are constituted by MLD-optimal
subsets.
Comparison of ensembles of classifiers In this section we compare
ensembles built on a set of formal concepts (“CBC”) and its MDL-optimal subset
(“MDL”) with commonly used classification methods like random forests (RFs),
multilayer perceptrons3 (MLP) and support vector machines (SVM). We study
the average accuracy on a test set, and the results are summarized in Table 8.
As it was noticed above, we split our dataset into 10 folds (9:1 for train and test
sets). The maximal accuracy over the 10 folds is also reported in the table. We
pay our attention to the number of classifiers that constitute an ensemble. The
smaller the number of classifiers the faster one can obtain the response and the
better this response can be interpreted.</p>
      <p>Our experiments show that among itemset-based classifiers (CBC, MDL,
RF) MDL-based approach demonstrate quite good performances and have a
much smaller set of classifiers. An ensemble with a small number of classifiers
performs faster and is better interpretable. Thus, it is easier to classify with
MDL-ensembles and understand the obtained results as well.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper we have addressed the problem of selecting meaningful,
nonredundant sets of formal concepts. We have proposed to use the MDL principle
and to show how the expert understanding of interestingness might be
incorporated into it.</p>
      <p>The MDL principle ensures a good compression even when the set of formal
concepts is huge (for example, 24 MDL-optimal among 6 432 concepts for “Breast
cancer”, 392 among 176 537 concepts for “nursery” dataset).
3 We select the configuration that ensures the best accuracy among the following ones:
(100:50:50),(100:50:25),(50:50:50)</p>
      <p>In supervised settings, MDL principle tends to choose the specific classifiers,
some of them have a precision loss close to 0.9. However, the MDL principle
ensures high classification accuracy.</p>
      <p>One interesting direction for future work is to study how some
interestingness measures, such as stability, might be embedded into the MDL approach.
Another interesting direction is to study connection between the MDL principle
and Pareto optimality.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The work was supported by the Russian Science Foundation under grant
1711-01294 and performed at National Research University Higher School of
Economics, Moscow, Russia.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aggarwal</surname>
          </string-name>
          , C.C.,
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          .:
          <source>Frequent pattern mining</source>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baldi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brunak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chauvin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andersen</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Assessing the accuracy of prediction algorithms for classification: an overview</article-title>
          .
          <source>Bioinformatics</source>
          <volume>16</volume>
          (
          <issue>5</issue>
          ),
          <fpage>412</fpage>
          -
          <lpage>424</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Fast generation of best interval patterns for nonmonotonic constraints</article-title>
          .
          <source>In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          . pp.
          <fpage>157</fpage>
          -
          <lpage>172</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Coenen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>The lucs-kdd discretised/normalised arm and carm data library (</article-title>
          <year>2003</year>
          ), http://www.csc.liv.ac.uk/ frans/KDD/Software/LUCS KDD DN
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <article-title>Formal concept analysis: Logical foundations (</article-title>
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          :
          <article-title>Formalizing hypotheses with concepts</article-title>
          .
          <source>In: International Conference on Conceptual Structures</source>
          . pp.
          <fpage>342</fpage>
          -
          <lpage>356</lpage>
          . Springer (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Gru¨nwald, P.D.:
          <article-title>The minimum description length principle</article-title>
          . MIT press (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.:</given-names>
          </string-name>
          <article-title>Machine learning and formal concept analysis</article-title>
          . In: Eklund,
          <string-name>
            <surname>P</surname>
          </string-name>
          . (ed.) Concept Lattices. Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makhalova</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>On interestingness measures of formal concepts</article-title>
          .
          <source>Information Sciences 442-443</source>
          ,
          <fpage>202</fpage>
          -
          <lpage>219</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Vreeken</surname>
            , J., Van Leeuwen,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siebes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Krimp: mining itemsets that compress</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <volume>23</volume>
          (
          <issue>1</issue>
          ),
          <fpage>169</fpage>
          -
          <lpage>214</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>