Interval Pattern Concept Lattice as a Classifier
                      Ensemble

                     Yury Kashnitsky, Sergei O. Kuznetsov

     National Research University Higher School of Economics, Moscow, Russia
                       {ykashnitsky,skuznetsov}@hse.ru


      Abstract. Decision tree learning is one of the most popular classifica-
      tion techniques. However, by its nature it is a greedy approach to finding
      a classification hypothesis that optimizes some information-based crite-
      rion. It is very fast but may lead to finding suboptimal classification hy-
      potheses. Moreover, in spite of decision trees being easily interpretable,
      ensembles of trees (random forests and gradient-boosted trees) are not,
      which is crucial in some domains, like medical diagnostics or bank credit
      scoring. In case of such “small, but important-data” problems one is not
      obliged to perform a greedy search for classification hypotheses, and
      therefore alternatives to decision tree learning techniques may be con-
      sidered. In this paper, we propose an FCA-based classification technique
      where each test instance is classified with a set of the best (in terms of
      some information-based criterion) classification rules. In a set of bench-
      marking experiments, the proposed strategy is compared with decision
      tree and nearest neighbor learning.

      Keywords: machine learning, classification, decision tree learning, for-
      mal concept analysis, pattern structures


1    Introduction

    The classification task in machine learning aims to use some historical data
(a training set) to predict unknown discrete variables in unknown data (a test
set). While there are dozens of popular methods for solving the classification
problem, usually there is an accuracy-interpretability trade-off when choosing
a method for a particular task. Neural networks, random forests and ensemble
techniques (boosting, bagging, stacking etc.) are known to outperform simple
methods in difficult tasks. Kaggle competitions also bear testimony for that –
usually, winners resort to ensemble techniques, mainly to gradient boosting [13].
The mentioned algorithms are widely spread in those application scenarios where
classification performance is the main objective. In Optical Character Recogni-
tion, voice recognition, information retrieval and many other tasks typically we
are satisfied with a trained model if it has a low generalization error.
    However, in lots of applications we need a model to be interpretable as well as
accurate. Some classification rules, built from data and examined by experts, may
be justified or proved. In medical diagnostics, when making highly responsible
decisions (e.g., predicting whether a patient has cancer), experts prefer to extract
readable rules from a machine learning model in order to “understand” it and
justify the decision. In credit scoring, for instance, applying ensemble techniques
can be very effective, but the model is often obliged to have “sound business
logic”, that is, to be interpretable [10].
    In what follows, we introduce some notions from Formal Concept Analysis
(FCA) [5] and provide a technique to express decision tree learning in terms
of a search for a hypothesis in a concept lattice (section 3). In section 4, we
propose an algorithm which by its design guarantees that each test object is
classified with a better (in terms of some criterion such as information gain or
Gini impurity) rule than in case of applying a decision tree. Finally, we discuss
the results of the experiments with several popular datasets (section 5), make
conclusions and directions of further work on developing the performed ideas.

2   Pattern Structures and Projections
    Pattern structures are natural extension of Formal Concept Analysis to ob-
jects with arbitrary partially-ordered descriptions [4].
Definition 1. Let G be a set (of objects), let (D, u) be a meet-semi-lattice (of
all possible object descriptions) and let δ : G → D be a mapping between objects
and descriptions. Set δ(G) := {δ(g)|g ∈ G} generates a complete subsemilattice
(Dδ , u) of (D, u), if every subset X of δ(G) has infimum uX in (D, u).
Pattern structure is a triple (G, D, δ), where D = (D, u), provided that the
set δ(G) := {δ(g) | g ∈ G} generates a complete subsemilattice (Dδ , u) [4, 9].
Definition 2. Patterns are elements of D. Patterns are naturally ordered by
subsumption relation v: given c, d ∈ D one has c v d ⇔ c u d = c. Operation u
is also called a similarity operation. A pattern structure (G, D, δ) gives rise
to the following derivation operators (·) :
                                       l
                                A =         δ(g)       for A ∈ G,
                                       g∈A
                 d = {g ∈ G | d v δ(g)}            for d ∈ (D, u).
   Pairs (A, d) satisfying A ⊆ G, d ∈ D, A = d, and A = d are called
pattern concepts of (G, D, δ).
    As in classical FCA, pattern concepts form a pattern concept lattice. In case
it is too computationally demanding to build the whole lattice, projections are
used to simplify object descriptions and boost the formation of a pattern concept
lattice.
Definition 3. A projection [4] of a semilattice (D, u) is a kernel function ψ :
D → D, i.e ∀x, y ∈ D :
 – x v y ⇒ ψ(x) v ψ(y) (monotonicity)
 – ψ(x) v x (contractivity)
 – ψ(ψ(x)) = ψ(x) (idempotence)
3       Interval Pattern Structure Projections
    The theoretical part of the proposed approach is based on Formal Concept
Analysis and pattern structures, in particular, on Interval Pattern Structures [6]
that provide a way to apply FCA techniques to data with numeric attributes.
Unfortunately, the size of the concept lattice is usually too large to be used effi-
ciently in learning [11]. Hence, we introduce a so-called discretizing projection on
interval pattern structures which helps to build more general object descriptions
based on numeric attributes.
Definition 4. Let (G, (D, u), δ) be an interval pattern structure.
Let Ti = {τi1 , . . . , τiti }, i = 1, . . . , m be m sets of real numbers where m is a
cardinality of each d ∈ D. Then, ψ(h[ai , bi ]ii∈[1,m] ) =
h[max{τ | τ ∈ Ti ∪ {−∞, +∞}, τ ≤ ai }, min{τ | τ ∈ Ti ∪ {−∞, +∞}, τ ≥ bi }]i
is called a discretizing projection of a semilattice (D, u).
   The discretizing projection, as defined in Def. 4, is a projection according to
the definition Def. 3.
Example 1. Consider a toy dataset with only 4 objects and 1 numeric attribute
as shown in Fig. 1 (left). In order to apply decision tree learning for some classi-
fication task with this dataset, one would apply some discretization method to
produce binary attributes from attribute a. Consider the discretization shown in
Fig. 1 (middle). The corresponding concept lattice is shown in the same figure
on the right-hand side.


Fig. 1. A toy many-valued context, it’s discretization and the corresponding con-
cept lattice.


          a
                    a ≤ 1.5 a ≤ 3.5 a ≥ 1.5 a ≥ 3.5
         11       1×        ×
         22       2         ×       ×
                  3         ×       ×
         33
                  4                 ×       ×
         44


   ψ([a, b]) = [max{τ | τ ∈ T + , τ ≤ a}, min{τ | τ ∈ T + , τ ≥ b}] with
    +
T = {−∞, 1.5, 3.5, +∞} is a projection of the semilattice built for a context
that arises from the interordinal scaling of the initial many-valued numerical
context. Address to [8] for more details on the link between interordinal scaling
and interval pattern structures. The pattern concept lattice corresponding to
the discretizing projection ψ([a, b]) is isomorphic to the concept lattice of the
discretized context shown in Fig. 1 (middle).

    Introducing a discretizing projection is a general way to express any dis-
cretizing procedure (essential part of decision tree learning algorithms) in terms
of FCA.


4    Learning with Pattern Concept Lattices

    For classification tasks with complex data we propose Algorithm 1. The main
idea is to find the classification rule for each test instance that maximizes some
information criterion (Gini index, pairwise mutual information etc.). In case
of interval pattern structures, by its design, the algorithm guarantees to classify
each test instance with at least as good rule (in terms of an information criterion)
as a decision tree. We apply a modification of the CloseByOne algorithm [7] to
build all pattern concepts – the search space for classification rules.
    Let P Strain = (Gtrain , ((D, u), ctrain ), δtrain ) and P Stest =
(Gtest , (D, u), δtest ) be two pattern structures corresponding to a train and a
test set in a classification task. Let CbOP S (P S, min_supp) be the algorithm used
to find all pattern concepts of a pattern structure P S with support greater or
equal to min_supp. Let inf : D ∪ctrain → R be an information criterion used to
rate classification rules (we use Gini impurity by default). Finally, let min_supp
and n_rules be the parameters of the algorithm (the minimal support of each
classification rule’s premise and the number of rules to be used for prediction of
each test instance’s class attribute).
    With this designations, the main steps of the proposed algorithm are the
following:

 1. Initialize a list of predicted labels for test instances ctest and a dictionary of
    classification rules rtest for each test instance.
                                                                               |c0     |
 2. Calculate the proportion of positive objects in the training set: fpos = |Gtrain
                                                                                  train |
 3. With the CbOP S algorithm, find S – a dictionary of all pattern concepts
    (with support greater or equal to min_supp) of a pattern structure P Strain
    Meanwhile, calculate the value of the criterion inf (values in the dictionary
    S) for each concept intent (keys in the dictionary S).
 4. Sort S by its values.
 5. For each test instance gt ∈ Gtest :
      – Find first nrules concept intents from S such that
        (Ai , di ) ∈ S, gt v di , i = 1, . . . , nrules
      – For each “top-ranked” concept intent di determine ci – the proportion of
                                                    |d ∩ c0 |
        positive objects among di : fi+ = i |dtrain      |   .
                                                    i

     – Thus, form {di → fi+ }i∈[1,n_rules] – a set of classification rules for gt .
       Set rtest [t] be equal to this set of rules.
       – Predict the value of the class attribute for gt as an indicator of the
         average antecedent of rtest [t] being greater or equal to the proportion of
         positive objects in the training set:
                                               n_rules
                                                X
                               ctest [i] = [             fi+ ≥ fpos ∗ n_rules]
                                                i=1


Algorithm 1 Concept Lattice-Based Rule-learner (CoLiBRi)
Input: P Strain = (Gtrain , ((D, u), ctrain ), δtrain )
P Stest = (Gtest , (D, u), δtest )
min_supp ∈ R+ , nrules ∈ N;
CbOP S (P S, min_supp) : P S → S;
inf : D × ctrain → R;
sort(S, inf ) : S → S

Output: ctest , rtest


    ctest = ∅, rtest = ∅
             |c0     |
    fpos = |Gtrain
                train |
    S = {(A, d) : inf (d, ctrain ) | A ⊆ Gtrain , d ∈ D, A = d, d = A, |A| ≥ min_supp} =
    CbOP S (P Strain , min_supp)
    S = sort(S, inf )
    for gt ∈ Gtest do
       {di }i∈[1,nrules ] = {d | (A, d) ∈ S, gt v d}
              |d     0
                i ∩ ctrain |
      fi+ =        |d
                     i|
      rtest [i] = {di → fi+ }i∈[1,nrules ]
                   Pn_rules +
      ctest [i] = [ i=1      fi ≥ fpos ∗ n_rules]
    end for


    In case of a classification task with numeric attributes we apply the same
Algorithm 1 for interval pattern structures. To make it tractable, we apply it to
projections ψ(P Strain ) and ψ(P Stest ) of a training and a test interval pattern
structure. Here ψ(P S), is a discretizing pattern structure projection as defined
in Def. 4.


5     Experiments
    We compare the proposed classification algorithm (denoted as “CoLiBRi” for
“Concept Lattice-Based Rule-learner”) with Scikit-learn [12] implementations of
CART [2], Random Forest [1] and kNN on several datasets from the UCI machine
learning repository.1
1
    http://repository.seasr.org/Datasets/UCI/csv/
 dataset       DT acc RF acc kNN acc CoLiBRi acc DT F1 RF F1 kNN F1 CoLiBRi F1
 audiology     0.75   0.8    0.63    0.79*       0.71  0.74  0.58   0.74
 balance-scale 0.63   0.66   0.76    0.65        0.58  0.63  0.75   0.61
 breast_cancer 0.7    0.74   0.73    0.76        0.45  0.42  0.38   0.44*
 car           0.75   0.78*  0.71    0.79        0.75  0.76  0.71   0.76
 hayses-roth   0.84*  0.83*  0.49    0.86        0.84* 0.82  0.49   0.85
 lymph         0.8    0.83   0.86    0.83        0.77  0.85  0.84*  0.84*
 mol_bio_prom 0.78    0.83   0.83    0.82*       0.78  0.84  0.8    0.83*
 nursery       0.64   0.65   0.72    0.65        0.62  0.62  0.7    0.62
 primary_tumor 0.41   0.46   0.41    0.45*       0.37  0.41  0.37   0.4*
 solar_flare   0.7*   0.7*   0.63    0.72        0.67  0.69* 0.6    0.71
 soybean       0.91*  0.91*  0.92    0.91*       0.91* 0.93  0.92*  0.91*
 spect_train   0.61   0.69   0.68*   0.7         0.34  0.36* 0.23   0.38
 tic-tac-toe   0.79   0.79   0.85    0.78        0.82  0.86  0.89   0.85


Table 1. Accuracies and F1-scores in classification experiments with the UCI
machine learning datasets. “DT acc” and “DT F1” stand for average 5-run 5-fold
CV accuracy and F1 score of the CART algorithm, . . . , “CoLiBRi F1” stands
for average 5-run 5-fold CV F1 score of the proposed CoLiBRi algorithm.


    We used Gini impurity as a criterion for rule selection and MDL [3] for con-
tinuous feature discretization. CART, Random Forest and kNN parameters
(min_samples_leaf ∈ [1, 10] for tree-based algorithms and k ∈ {1, 2, 5, 15, 30, 50}
for kNN) were chosen in stratified 5-fold cross-validation. We built 10 trees for
each instance of Random Forest classifier.
    Parameter min_supp for “CoLiBRi” was taken equal to CART’s
min_samples_leaf for each dataset. We used n = 10 classification rules to vote
for a test instance label. The described algorithms were implemented in Python
2.7.3 and run on a 4-CPU machine with 4 GB RAM.
    The results are presented in Table 1. Each entry stands for the average metric
(accuracy or F1-score) in 5 runs of 5-fold cross-validation. In the table, the algo-
rithm with the best performance on each metric is boldfaced. Other algorithm’s
whose performance is not statistically distinguishable from the best algorithm
at p = 0.05 using paired t-tests on the 5 runs are *’ed. The best parameters for
each algorithm are mentioned in Table 2.
    As it can be seen, the proposed approach performs better than CART and
is statistically indistinguishable from RF on most of the datasets. Surprising
enough, kNN seems to be the best-performer (on average over all datasets) in
terms of accuracy but not F1-score.


Conclusions and further work
    In this paper, we have shown how searching for classification hypotheses in
a formal concept lattice may yield accurate results while providing interpretable
classification rules.
    Further we plan to test the proposed strategy in classification tasks such as
predicting biological activity (toxicology, mutagenicity, etc.) and telecom client
  dataset       DT min_samples_leaf RF min_samples_leaf kNN k
  audiology              1                  1             2
  balance-scale          6                  1             50
  breast cancer          4                  3             5
  car                    3                  2             5
  hayses-roth            3                  1             15
  lymph                  1                  1             5
  mol-bio-prom           3                  3             5
  nursery                3                  4             50
  primary tumor          4                  4             30
  solar flare            3                  1             30
  soybean                1                  1             2
  spect train            9                  5             10
  tic-tac-toe           10                  3             10

Table 2. Best parameters in classification experiments with the UCI ma-
chine learning datasets. CoLiBRi’s min_supp is taken equal to CART’s
min_samples_leaf for each dataset.


satisfaction where objects have complex descriptions (graphs and sequences cor-
respondingly).
    We also plan to introduce some randomization in mining rules for each test
instance (as it is done with random forests) in order to further improve the
classification quality.
References
 [1]   L. Breiman. “Random Forests”. In: Machine Learning 45.1 (Oct. 2001),
       pp. 5–32. issn: 0885-6125.
 [2]   L. Breiman et al. Classification and Regression Trees. The Wadsworth and
       Brooks-Cole statistics-probability series. Taylor & Francis, 1984.
 [3]   U. M. Fayyad and K. B. Irani. “Multi-Interval Discretization of Continuous-
       Valued Attributes for Classification Learning.” In: IJCAI. 1993, pp. 1022–
       1029.
 [4]   B. Ganter and S. O. Kuznetsov. “Pattern Structures and Their Projec-
       tions”. In: Conceptual Structures: Broadening the Base. Ed. by Harry Del-
       ugach and Gerd Stumme. Vol. 2120. Lecture Notes in Computer Science.
       Berlin/Heidelberg: Springer, 2001, pp. 129–142.
 [5]   B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Founda-
       tions. 1st. Secaucus, NJ, USA: Springer-Verlag New York, Inc., 1997.
 [6]   M. Kaytoue et al. “Mining gene expression data with pattern structures in
       formal concept analysis”. en. In: Information Sciences 181.10 (May 2011),
       pp. 1989–2001.
 [7]   S. O. Kuznetsov. “A fast algorithm for computing all intersections of ob-
       jects from an arbitrary semilattice”. In: Nauchno-Tekhnicheskaya Infor-
       matsiya Seriya 2-Informatsionnye Protsessy I Sistemy 1 (1993), pp. 17–
       20.
 [8]   S. O. Kuznetsov. “Fitting Pattern Structures to Knowledge Discovery in
       Big Data”. In: Formal Concept Analysis, 11th International Conference,
       ICFCA 2013, Dresden, Germany, May 21-24, 2013. Proceedings. Vol. 7880.
       Lecture Notes in Computer Science. Springer, 2013, pp. 254–266.
 [9]   S. O. Kuznetsov. “Scalable Knowledge Discovery in Complex Data with
       Pattern Structures”. In: PReMI. Ed. by Pradipta Maji et al. Vol. 8251.
       Lecture Notes in Computer Science. Springer, 2013, pp. 30–39.
[10]   X. Li and Y. Zhong. “An Overview of Personal Credit Scoring: Techniques
       and Future Work”. In: International Journal of Intelligence Science 2.4A
       (2012), pp. 181–189.
[11]   A. Masyutin, Y. Kashnitsky, and S. O. Kuznetsov. “Lazy classification with
       interval pattern structures: Application to credit scoring”. In: Proceedings
       of the 4th International Workshop "What can FCA do for Artificial Intelli-
       gence?", FCA4AI 2015, co-located with the International Joint Conference
       on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, July 25,
       2015. Vol. 1430. 2015, pp. 43–54.
[12]   F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal
       of Machine Learning Research 12 (2011), pp. 2825–2830.
[13]   G. Tsoumakas et al. “WISE 2014 Challenge: Multi-label Classification of
       Print Media Articles to Topics”. eng. In: 15th International Conference
       on Web Information Systems Engineering (WISE 2014). Proceedings Part
       II. Vol. 8787. Lecture Notes in Computer Science. Springer, Oct. 2014,
       pp. 541–548.