Global Optimization in Learning with Important
        Data: an FCA-Based Approach

                     Yury Kashnitsky, Sergei O. Kuznetsov

     National Research University Higher School of Economics, Moscow, Russia
                       {ykashnitsky, skuznetsov}@hse.ru


      Abstract. Nowadays decision tree learning is one of the most popular
      classification and regression techniques. Though decision trees are not
      accurate on their own, they make very good base learners for advanced
      tree-based methods such as random forests and gradient boosted trees.
      However, applying ensembles of trees deteriorates interpretability of the
      final model. Another problem is that decision tree learning can be seen
      as a greedy search for a good classification hypothesis in terms of some
      information-based criterion such as Gini impurity or information gain.
      But in case of small data sets the global search might be possible.
      In this paper, we propose an FCA-based lazy classification technique
      where each test instance is classified with a set of the best (in terms of
      some information-based criterion) rules. In a set of benchmarking exper-
      iments, the proposed strategy is compared with decision tree and nearest
      neighbor learning.

      Keywords: Formal Concept Analysis, lazy learning, global optimization


1   Introduction
    The classification task in machine learning aims to use some historical data
(a training set) to predict unknown discrete variables in unknown data (a test
set). While there are dozens of popular methods for solving the classification
problem, usually there is an accuracy-interpretability trade-off when choosing
a method for a particular task. Neural networks, random forests and ensemble
techniques (boosting, bagging, stacking etc.) are known to outperform simple
methods in difficult tasks. Kaggle competitions also bear testimony for that –
usually, winners resort to ensemble techniques, mainly to gradient boosting [1].
The mentioned algorithms are widely spread in those application scenarios where
classification performance is the main objective. In Optical Character Recogni-
tion, voice recognition, information retrieval and many other tasks typically we
are satisfied with a trained model if it has a low generalization error.
    However, in lots of applications we need a model to be interpretable as well
as accurate. Some classification rules, built from data and examined by experts,
may be justified or proved. In medical diagnostics, when making highly respon-
sible decisions, e.g., predicting whether a patient has cancer (i.e., dealing with
“important data”), experts prefer to extract readable rules from a machine learn-
ing model in order to “understand” it and justify the decision. In credit scoring,
for instance, applying ensemble techniques can be very effective, but the model
is often obliged to have “sound business logic”, that is, to be interpretable [2].


2     Related work

    Eager (non-lazy) algorithms construct classifiers that contain an explicit hy-
pothesis mapping unlabelled test instances to their predicted labels. A decision
tree classifier, for example, uses a stored model to classify instances by tracing
the instance through the tests at the interior nodes until a leaf containing the
label is reached. In eager algorithms, the main work is done at the phase of
building a classifier.
    In lazy classification paradigm [3], however, no explicit model is constructed,
and the inductive process is done by a classifier which maps each test instance
to a label using a training set.


2.1   Lazy decision trees

    The authors of [4] point the following problem with decision tree learning:
while entropy measures used in C4.5 and ID3 are guaranteed to decrease on
average, the entropy of a specific child may not change or may increase. In other
words, a single decision tree may find a locally optimal hypothesis in terms of
entropy measure such as Gini impurity or pairwise mutual information. But us-
ing a single tree may lead to many irrelevant splits for a given test instance.
A decision tree built for each test instance individually can avoid splits on at-
tributes that are irrelevant for the specific instance. Thus, such “customized”
decision trees (actually classification paths) built for a specific test instance may
be much shorter and hence may provide a short explanation for the classification.


2.2   Lazy associative classification

     Associative classifiers build a classifier using association rules mined from
training data. Such rules have the class attribute as a conclusion. This approach
was shown to yield improved accuracy over decision trees as they perform a
global search for rules satisfying some quality constraints [5]. Decision trees, on
the contrary, perform greedy search for rules by selecting the most promising
attributes.
     Unfortunately, associative classifiers tend to output too many rules while
many of them even might not be used for classification of a test instance. Lazy
associative classification algorithm overcomes these problems of associative clas-
sifiers by generating only the rules with premises being subsets of test instance
attributes [5]. Thus, in lazy associative classification paradigm only those rules
are generated that might be used in classification of a test instance. This leads
to a reduced set of classification rules for each test instance.
2.3    Decision trees in terms of Formal Concept Analysis

    In [6] the authors utilize concept lattices to represent each concept intent (a
closed set of attributes) as a decision tree node and a concept lattice itself – as
a set of overlapping decision trees. The construction of a decision tree is thus
reduced to selecting one of the downward paths in a concept lattice via some
information criterion.


2.4    Lazy classification for complex structure data

    The modification of the lazy classification algorithm capable of handling com-
plex structure data was first proposed in [7]. The main difference from the Lazy
Associative Classification algorithm is that the method is designed to analyze
arbitrary objects with complex descriptions (intervals, sequences, graphs etc.).
This setting was implemented for interval credit scoring data [8] and for graphs
in a toxicology prediction task [9].


3     Definitions

   Here we introduce some notions from Formal Concept Analysis [10] which
help us to organize the search space for classification hypotheses.

Definition 1. A formal context in FCA is a triple K = (G, M, I) where G is a
set of objects, M is a set of attributes, and the binary relation I ⊆ G × M shows
which object possesses which attribute. gIm denotes that object g has attribute
m. For subsets of objects and attributes A ⊆ G and B ⊆ M Galois operators are
defined as follows:

                           A0 = {m ∈ M | gIm ∀g ∈ A},
                           B 0 = {g ∈ G | gIm ∀m ∈ B}.

    A pair (A, B) such that A ⊆ G, B ⊆ M, A0 = B and B 0 = A, is called
a formal concept of a context K. The sets A and B are closed and called the
extent and the intent of a formal concept (A, B) respectively.

Example 1. Let us consider a “classical” toy example of a classification task. The
training set is represented in Table 1. All categorical attributes are binarized
into “dummy” attributes. The table shows a formal context K = (G, M, I) with
G = {1, . . . , 10}, M = {or, oo, os, tc, tm, th, hn, w} (let us omit a class attribute
“play”) and I – a binary relation defined on G×M where an element of a relation
is represented with a cross (×) in a corresponding cell of a table.
    A concept lattice for this formal context is depicted to the right from1. It
should be read as follows: for a given element (formal concept) of the lattice
its intent (closed set of attributes) is given by all attributes which labels can
be reached in ascending lattice traversal. Similarly, the extent (a closed set of
objects) of a certain lattice element (formal concept) can be traced in a downward
lattice traversal from a given point. For instance, a big blue-and-black circle
depicts a formal concept ({1, 2, 5}, {or, tc, hn}).
    Such concept lattice is a concise way of representing all closed itemsets (for-
mal concepts’ intents) of a formal context. Closed itemsets, further, can serve
as a condensed representation of classification rules [11]. In what follows, we
develop the idea of a hypotheses search space represented with a concept lattice.


Table 1. A toy classification problem and a concept lattice of the corresponding
formal context. Attributes: or – outlook = rainy, oo – outlook = overcast, os
– outlook = sunny, tc – temperature = cold, tm – temperature = mild, th –
temperature = high, hn – humidity = normal, w – windy, play – whether to
play tennis or not (class attribute).

    № or oo os tc tm th hn w play
    1 ×        ×        ×     ×
    2 ×        ×        × ×
    3    ×           ×        ×
    4       ×     ×
    5 ×        ×        ×     ×
    6       × ×         ×     ×
    7       ×        × ×      ×
    8    ×        ×        × ×
    9       ×     ×        ×
    10      × ×                ?


4     Concept lattice a hypothesis search space
    Further we describe and illustrate the proposed approach in binary- and
numeric-attribute cases when dealing with binary classification. The approach
is naturally extended to multiclass case with the corresponding adjustments to
information criteria formulas.

4.1     Binary-attribute case
   In case of training and test data represented as binary tables, the proposed
algorithm is described as Algorithm 1.
   Let Ktrain = (Gtrain , M0 ∪ M 0 ∪ ctrain , Itrain ) and Ktest = (Gtrain , M0 ∪
M 0 , Itest ) be formal contexts representing a training set and a test set corre-
spondingly. We state clearly that the set of attributes is dichotomized:
M = M0 ∪ M 0 where ∀g ∈ Gtrain , m ∈ M0 ∃ m ∈ M 0 : gItrain m → ¬gItrain m.
Let CbO(K, min_supp) be the algorithm used to find all formal concepts of a
formal context K with support greater or equal to min_supp (by default we use
a modification of the InClose-2 program implementation [12] of the CloseByOne
algorithm [13]). Let inf : M ∪ ctrain → R be an information criterion used to
rate classification rules (we use Gini impurity by default). Finally, let min_supp
and n_rules be the parameters of the algorithm (the minimal support of each
classification rule’s premise and the number of rules to be used for prediction of
each test instance’s class attribute).
    With these designations, the main steps of the proposed algorithm for each
test instance are the following:
1. For each test object we leave only its attributes in the training set (step 1
   in Algorithm 1). Or, formally, we build a new formal context
   Kt = {Gtrain , gt0 , Itrain } with the same objects Gtrain as in the training
   context Ktrain and with attributes of a test object gt0 ∪ ctrain . We clarify
   what it means in case of real-valued attributes in subsection 4.2.
2. With CbO(K, min_supp), find all formal concepts of a formal context Kt
   satisfying the constraint on minimal support. We build formal concepts in a
   top-down manner (increasing the number of attributes) and backtrack when
   the support of a formal concept intent is less than min_supp. The parameter
   min_supp refines the support of any possible hypothesis mined to classify
   the test object and is therefore analogous to the parameter
   min_samples_leaf of a decision tree. While generating formal concepts, we
   keep track of the values of the class attributes for all training objects having
   all corresponding attributes (i.e. for all objects in formal concept extent).
   We calculate the value of an information criterion inf (we use Gini impurity
   by default) for each formal concept intent.
3. Then the mined formal concepts are sorted by the value of the criterion inf
   from the “best” to the “worse”.
4. Retaining first n concepts with the best values of the chosen information
   criterion, we have a set of rules to classify the current test object. For each
   concept we define a classification rule with concept intent as an antecedent
   and the most common value of class attribute among the objects of concept
   extent as a consequent.
5. Finally, we predict the value of the class attribute for current test object
   simply via majority rule among n “best” classification rules’ antecedents. We
   also save the rules for each test object in a dictionary rtest .


4.2    Numeric-attribute case
    In our approach, we deal with numeric attributes similarly to what is done
in the CART algorithm [14]. We sort the values of a numeric attribute and
identify the thresholds to binarize numeric attributes where the target attribute
changes. Let us demonstrate step 1 of Algorithm 1 in case of binary and numeric
attributes with a sample from Kaggle “Titanic: Machine Learning from Disaster”
competition dataset.1
1
    https://www.kaggle.com/c/titanic
Algorithm 1 Lazy Lattice-based Optimization (LLO)
Input: Ktrain = (Gtrain , M0 ∪ M 0 ∪ ctrain , Itrain )
Ktest = (Gtest , M0 ∪ M 0 , Itest )
min_supp ∈ R+ , nrules ∈ N;
CbO(K, min_supp) : K → S;
sort(S, inf ) : S → S
inf : M ∪ ctrain → R;
Output: ctest , rtest


  ctest = ∅, rtest = ∅
  for gt ∈ Gtest do
     1. Kt = {Gtrain , gt0 , Itrain }
                                                                       |A|
     2. St = {(A, B) | A ⊆ Gtrain , B ⊆ gt0 , A0 = B, B 0 = A, Gtrain      ≥ min_supp} =
     CbO(Kt , min_supp)
     3. St = sort(St , inf )
     4. {Bi }i∈[1,nrules ] = {Bj | (Aj , Bj ) ∈ St }, j ∈ [1, n_rules]
     5. ci = argmax({count(ctrainj ) | j ∈ Bi0 })
     6. rtest [i] = {Bi → ci }, i = 1, . . . , nrules
     7. ctest [i] = argmax({count(cj ) | j = 1, . . . , nrules )}
  end for


Example 2. Table 2 shows a sample from the Titanic dataset. Let us build a
formal context to classify passenger no. 7 with attributes Pclass=2, Age=28,
City=C. If we sort the data by age in ascending order we see where the target
attribute “Survived” switches from 0 to 1 or vice versa.

      Age      16 18 30 39 42 62
      Survived 1 0 0 1 0 1

   Thus we have a set of thresholds to discretize the attribute “Age”:
T = {17, 34.5, 40.5, 52}. The formal context K7 (corresponding to Kt for t = 7
in Algorithm 1) is presented in Table 3.

4.3     Complexity
    The algorithm is based on the CloseByOne lattice-building algorithm with
time complexity shown [15] to be equal to O(|G||M |2 |L|) for a formal context
(G, M, I) and a corresponding lattice L. To put it simply, the complexity is linear
in the number of objects, quadratic in the number of attributes and linear in the
number of built formal concepts.
    In the proposed algorithm CloseByOne is run for each test object (step 3
in Algorithm 1), and for each formal concept information criterion values are
calculated. Calculating entropy or Gini index is linear in the number of objects
as it requires calculating supports of attribute sets. This is done “on-the-go”
while building a lattice (step 4 in Algorithm 1).
Table 2. A sample from the Titanic dataset. Attributes: “Pclass” – passenger’s
class, “City” – boarding city (here Cherburg or Southhampton), “Age” – passen-
ger’s age, “Survived” – whether a passenger survived in the Titanic disaster.

                              Id Pclass Age City Survived
                              1 3       39 S     1
                              2 3       16 S     1
                              3 1       62 C     1
                              4 3       42 S     0
                              5 2       30 C     0
                              6 2       18 C     0
                              7 2       28 C     ?
                              8 1       47 C     ?


        Table 3. A formal context built to classify a test passenger no. 7.

Id P class! = 1 P class == 2 P class! = 3 Age ≥ 17 Age ≤ 34.5 Age ≤ 40.5 Age ≤ 52 City == C Survived
1       ×                                    ×                    ×         ×                  1
2       ×                                              ×          ×         ×                  1
3                                 ×          ×                                        ×        1
4       ×                                    ×                              ×                  0
5       ×             ×           ×          ×         ×          ×         ×         ×        0
6       ×             ×           ×          ×         ×          ×         ×         ×        0


    Therefore, the time complexity of classifying |Gt | test instances with the pro-
posed algorithm based on a training formal context (G, M, I) is approximately
O(|Gt ||G||M |2 |L̄|) where |L̄| is an average lattice size for formal contexts de-
scribed in step 2 in Algorithm 1.


5   Example
    Let us illustrate the proposed algorithm with a toy example from Table 1. To
classify the object no. 10, we do the following steps according to Algorithm 1:

 1. Let us fix Gini impurity as an information criterion of interest and the pa-
    rameters min_supp = 0.5 and n = 3. Thus, we are going to classify a test
    instance with 3 rules supporting at least 5 objects and having highest gain
    in Gini impurity.
 2. The case Outlook=sunny, Temperature=cool, Humidity=high, Windy=false
    corresponds to a set of attributes {os, tc, hh, w} describing the test instance.
    Or, if we consider the negations of the attributes, such case is described with
    a set of attributes: {or,
                           ¯ oo,
                              ¯ os, tc, tm, ¯ hn,
                                         ¯ th,  ¯ w̄}
 3. We build a formal context with objects being the training set instances and
    attributes of a test instance – {or,
                                      ¯ oo,
                                          ¯ os, tc, tm, ¯ hn,
                                                     ¯ th, ¯ w̄}. The correspond-
    ing binary table is shown in Table 4.
Table 4. The training set instances with attributes of a test instance Out-
look=sunny, Temperature=cool, Humidity=high, Windy=false. Attributes: or     ¯ –
outlook is not rainy, oo¯ – outlook is not overcast, os – outlook = sunny, tc –
temperature = cool, tm                               ¯ – temperature is not high,
                       ¯ – temperature is not mild, th
 ¯
hn – humidity is not normal, w̄ – not windy, play – whether to play tennis or
not (class attribute). A concept lattice on the right-hand side is build with the
corresponding formal context. The horizontal line separates the concepts with
extents comprised of at least 5 objects.


 № or ¯ os tc tm
   ¯ oo           ¯ hn
               ¯ th  ¯ w̄ play
 1    ×    × × ×       ××
 2    ×    × × ×
 3 ×          ×     × ××
 4 × × ×         × × ×
 5    ×    × × ×       ××
 6 × × × × × ×         ××
 7 × × ×      ×        ××
 8 ×             × ×      ×
 9 × × ×         × ×
4. A concept lattice, organizing all formal concepts for a formal context is shown
   to the right from Table 4. The horizontal line separates the concepts with
   extents having at least 5 objects (above, min_supp ≥ 0.5).
5. 9 formal concepts satisfying min_supp ≥ 0.5 give rise to 9 classification
   rules. Top 3 rules having the highest gain in Gini impurity are given in
   Table 5.


Table 5. Top 3 rules to classify the test instance Outlook=sunny, Tempera-
ture=cool, Humidity=high, Windy=false

                                    Rule                       Gini gain
                {not windy, temperature not mild} → play         0.278
           {outlook not overcast, temperature not high} → play   0.111
           {outlook not overcast, temperature not mild} → play   0.044


6. The “best” rules mined in the previous step unanimously classify the test
   instance Outlook=sunny, Temperature=cool, Humidity=high, Windy=false
   as appropriate for playing tennis.


6     Experiments
    As we have stated, in this paper we deal with “important data” problems,
those where accurate and interpretable results are needed. We compare the pro-
posed classification algorithm (denoted as LLO for “Lazy Lattice-based Opti-
mization”) with Scikit-learn [16] implementations of CART [14] and kNN on
several datasets from the UCI machine learning repository.2
    We used pairwise mutual information as a criterion for rule selection. CART
and kNN parameters were chosen in stratified 5-fold cross-validation and are
given in Table 7.
    Parameter min_supp for LLO was taken equal to CART min_sample_leaf
for each dataset divided by the number of objects. We used n = 5 classification
rules to vote for a test instance label.
    As it can be seen, the proposed approach performs better than CART on
most of the datasets while kNN is often better when the number of attributes is
not high. Obviously, the running times of LLO are far from perfect. That is due
to the computationally demanding nature of the algorithm.


Conclusions and further work
    In this paper, we have shown how searching for classification hypotheses in
a formal concept lattice for each test instance individually may yield accurate
2
    http://repository.seasr.org/Datasets/UCI/csv/
Table 6. Accuracy and F1-score in classification experiments with the UCI ma-
chine learning datasets. “CART acc” stands for “5-fold cross-validation accuracy
of the CART algorithm”, ... , “LLO F1” stands for “5-fold cross-validation F1
score of the Lazy Lattice Optimization algorithm”.

      dataset   CART acc kNN acc LLO acc CART F1 kNN F1 LLO F1
    audiology     0.743   0.442   0.758    0.725  0.336  0.736
  breast-cancer   0.738   0.727   0.769    0.477   0.66  0.694
    breast-w      0.936   0.773   0.942    0.909  0.734  0.921
        colic     0.647   0.644   0.653    0.619  0.569  0.664
     heart-h      0.782   0.837   0.791    0.664  0.831  0.787
  heart-statlog   0.804   0.848   0.816    0.761  0.846  0.823
    hepatitis     0.794   0.794   0.782   0.867   0.702  0.755
  hypothyroid     0.975   0.923   0.968   0.974   0.886  0.948
   ionosphere       0.9   0.783   0.924    0.923  0.757  0.938
     kr-vs-kp      0.98   0.761    0.98    0.981  0.756  0.984
       letter     0.769   0.711   0.774    0.769  0.645  0.771
      lymph       0.818   0.831    0.82    0.806  0.813   0.85
 primary-tumor    0.425   0.469   0.457    0.376  0.418  0.409
     segment      0.938   0.872   0.947   0.938   0.869  0.928
       sonar      0.697   0.663    0.73    0.665  0.658  0.718
     soybean      0.877    0.89    0.88    0.868  0.883  0.879
       splice     0.943   0.833   0.956    0.943  0.832  0.948
      vehicle     0.708   0.677   0.692   0.708   0.667   0.62
        vote      0.956   0.929   0.968    0.946  0.929  0.955
       vowel      0.436   0.405   0.442   0.428   0.387  0.406
 waveform-5000    0.761   0.834   0.783    0.761  0.583  0.774
Table 7. Parameters and runtimes in classification experiments with the UCI
machine learning datasets. “CART msl” stands for for the minimal required
number of objects in each node of a CART tree (min_samples_leaf ), “kNN k”
is the number of neighbors used by the kNN algorithm.

        dataset     # objects # attr CART kNN CART kNN LLO
                                      msl  k   time time time
        audiology      226     94      1   10   0.31 0.52 9.97
      breast-cancer    286     39      4   20   0.29 0.52 5.35
        breast-w       699     89      6   10   0.29 0.83 258.37
            colic      368     59      1   30   0.3   0.52 6.41
         heart-h       294     24      2   20   0.3   0.52 0.89
      heart-statlog    270     13      5   45   0.3   0.53 3.76
        hepatitis      155     285     2   10   0.29 0.55 62.9
      hypothyroid     3772     126     7   15   0.63 1.39 298.84
       ionosphere      351     34      4   10   0.41 0.54 2.03
         kr-vs-kp     3196     38      1   50   0.4   2.03 23.15
           letter     20000    256     1   71  38.04 67.85 6607.2
          lymph        148     50      1   10   0.29 0.52   4.7
     primary-tumor     339     26      4   15   0.3   0.52 1.59
         segment      2310     19      1   10   1.05 0.83 4.17
           sonar       208     60      3   15   0.41 0.53 3.79
         soybean       683     98      1   10   0.3   0.73 32.6
           splice     3190     287     6   65   1.3 11.29 1302.57
          vehicle      846     18      4   10   0.62 0.63 1.34
            vote       435     32      2   10   0.31 0.53 2.65
           vowel       990     26      2   35   0.63 0.63 3.29
     waveform-5000    5000     40      5   82   3.79 1.34 40.2
results while keeping the classification model interpretable. The proposed strat-
egy is computationally demanding but may be used for “small data” problems
where prediction delay is not as important as classification accuracy and inter-
pretability.
    Further we plan to implement the idea of searching for classification hypothe-
ses in a concept lattice for complex structure data such as molecular graphs We
plan to implement the same strategy of lazy classification by searching for suc-
cinct classification rules in a pattern concept lattice. The designed framework
might help to learn sets of rules for tasks such as biological activity (toxicology,
mutagenicity, etc.) prediction. We are also going to interpret random forests as
a search for an optimal hypothesis in a concept lattice and try to compete with
this popular classification method.


References
 1. Tsoumakas, G., Papadopoulos, A., Qian, W., Vologiannidis, S., D’yakonov, A.,
    Puurula, A., Read, J., Svec, J., Semenov, S.: Wise 2014 challenge: Multi-label
    classification of print media articles to topics. In: 15th International Conference on
    Web Information Systems Engineering (WISE 2014). Proceedings Part II. Volume
    8787 of Lecture Notes in Computer Science., Springer (October 12-14 2014) 541–
    548
 2. Li, X., Zhong, Y.: An overview of personal credit scoring: Techniques and future
    work. International Journal of Intelligence Science 2(4A) (2012) 181–189
 3. Aha, D.W., ed.: Lazy Learning. Kluwer Academic Publishers, Norwell, MA, USA
    (1997)
 4. Friedman, J.H.: Lazy decision trees. In: Proceedings of the Thirteenth National
    Conference on Artificial Intelligence - Volume 1. AAAI’96, AAAI Press (1996)
    717–724
 5. Veloso, A., Meira Jr., W., Zaki, M.J.: Lazy Associative Classification. In: Proceed-
    ings of the Sixth International Conference on Data Mining. ICDM ’06, Washington,
    DC, USA, IEEE Computer Society (2006) 645–654
 6. Belohlavek, R., De Baets, B., Outrata, J., Vychodil, V.: Inducing decision trees via
    concept lattices. International Journal of General Systems 38(4) (2009) 455–467
 7. Kuznetsov, S.O.: Scalable knowledge discovery in complex data with pattern struc-
    tures. In Maji, P., Ghosh, A., Murty, M.N., Ghosh, K., Pal, S.K., eds.: PReMI.
    Volume 8251 of Lecture Notes in Computer Science., Springer (2013) 30–39
 8. Masyutin, A., Kashnitsky, Y., Kuznetsov, S.O.: Lazy classification with interval
    pattern structures: Application to credit scoring. In: CEUR Workshop Proceedings.
    Volume 1430. (2015) 43–54
 9. Kashnitsky, Y., Kuznetsov, S.O.: Lazy associative graph classification. In: CEUR
    Workshop Proceedings. Volume 1430. (2015) 63–74
10. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. 1st
    edn. Springer-Verlag New York, Inc., Secaucus, NJ, USA (1997)
11. Hata, I., Veloso, A., Ziviani, N.: Learning accurate and interpretable classifiers
    using optimal multi-criteria rules. JIDM 4(3) (2013) 204–219
12. Andrews, S.: In-close2, a high performance formal concept miner. In: Conceptual
    Structures for Discovering Knowledge - 19th International Conference on Concep-
    tual Structures, ICCS 2011, Derby, UK. Proceedings. (2011) 50–62
13. Kuznetsov, S.O.: A fast algorithm for computing all intersections of objects from
    an arbitrary semilattice. Nauchno-Tekhnicheskaya Informatsiya Seriya 2 – Infor-
    matsionnye protsessy i sistemy (1) (1993) 17–20
14. Breiman, L., Friedman, J., Stone, C., Olshen, R.: Classification and Regression
    Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor &
    Francis (1984)
15. Kuznetsov, S.O., Obiedkov, S.A.: Comparing performance of algorithms for gener-
    ating concept lattices. Journal of Experimental & Theoretical Artificial Intelligence
    14(2-3) (2002) 189–216
16. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
    Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
    Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
    Learning in Python. Journal of Machine Learning Research 12 (2011) 2825–2830