Can FCA-based Recommender System Suggest a
             Proper Classier?
                     Yury Kashnitsky and Dmitry I. Ignatov

             National Research University Higher School of Economics
  Scientic-Educational Laboratory for Intelligent Systems and Structural Analysis
                                 Moscow, Russia
                     ykashnitsky@hse.ru, dignatov@hse.ru


       Abstract.    The paper briey introduces multiple classier systems and
       describes a new algorithm, which improves classication accuracy by
       means of recommendation of a proper algorithm to an object classica-
       tion. This recommendation is done assuming that a classier is likely to
       predict the label of the object correctly if it has correctly classied its
       neighbors. The process of assigning a classier to each object is based on
       Formal Concept Analysis. We explain the idea of the algorithm with a
       toy example and describe our rst experiments with real-world datasets.


1 Introduction
    The topic of Multiple Classier Systems (MCSs) is well studied in machine
learning community [1]. Such algorithms appear with dierent names  mixture
of experts, committee machines, classier ensembles, classier fusion and others.
     The underlying idea of all these systems is to train several (base) classiers
on a training set and to combine their predictions in order to classify objects
from a test set [1]. This idea probably dates back to as early as the 18th cen-
tury. The Condorcet's jury theorem, that was formulated in 1785 in [2], claims
that if a population makes a group decision and each voter most likely votes
correctly, then adding more voters increases the probability that the majority
decision is correct. The probability that the majority votes correctly tends to 1
as the number of voters increases. Similarly, if we have multiple weak classiers
(meaning that classier's error on its training data is less than 50% but greater
than 0%), we can combine their predictions and boost the classication accuracy
as compared to those of each single base classier.
     Among the most popular MCSs are bagging [3], boosting [7], random forests
[9], and stacked generalization (or stacking) [10].
     In this paper, we present one more algorithm of such type  Recommender-
based Multiple Classier System (RMCS). Here the underlying proposition is
that a classier is likely to predict the label of the object from a test set correctly
if it has correctly classied its neighbors from a training set.
     The paper is organized as follows. In chapter 2, we discuss bagging, boosting
and stacking. In Section 3, we introduce basic denitions of Formal Concept
Analysis (FCA). Section 4 provides an example of execution of the proposed
RMCS algorithm on a toy synthetic dataset. Then, Section 5 describes the RMCS
algorithm itself. Further, the results of the experiments with real data are pre-
sented. Section 7 concludes the paper.


2 Multiple Classier Systems
In this chapter, we consider several well-known multiple classier systems.

2.1 Bagging
   The bootstrap sampling technique has been used in statistics for many years.
Bootstrap aggregating, or bagging, is one of the applications of bootstrap sam-
pling in machine learning. As suciently large data sets are often expensive or
impossible to obtain, with bootstrap sampling, multiple random samples are cre-
ated from the source data by sampling with replacement. Samples may overlap
or contain duplicate items, yet the combined results are usually more accurate
than a single sampling of the entire source data achieves.
    In machine learning the bootstrap samples are often used to train classiers.
Each of these classiers can classify new instances making a prediction; then
predictions are combined to obtain a nal classication.
    The aggregation step of bagging is only helpful if the classiers are dierent.
This only happens if small changes in the training data can result in large changes
in the resulting classier  that is, if the learning method is unstable [3].

2.2 Boosting
   The idea of boosting is to iteratively train classiers with a weak learner (the
one with error better than 50% but worse than 0%) [4]. After each classier is
trained, its accuracy is measured, and misclassied instances are emphasized.
Then the algorithm trains a new classier on the modied dataset. At classi-
cation time, the boosting classier combines the results from the individual
classiers it trained.
    Boosting was originally proposed by Schapire and Freund [5,6]. In their Adap-
tive Boosting, or AdaBoost, algorithm, each of the training instances starts with
a weight that tells the base classier its relative importance [7]. At the initial step
the weights of n instances are evenly distributed as n1 The individual classier
training algorithm should take into account these weights, resulting in dier-
ent classiers after each round of reweighting and reclassication. Each classier
also receives a weight based on its accuracy; its output at classication time is
multiplied by this weight.
    Freund and Schapire proved that, if the base classier used by AdaBoost
has an error rate of just slightly less than 50%, the training error of the meta-
classier will approach zero exponentially fast [7]. For a two-class problem the
base classier only needs to be slightly better than chance to achieve this error
rate. For problems with more than two classes less than 50% error is harder to
achieve. Boosting appears to be vulnerable to overtting. However, in tests it
rarely overts excessively [8].


2.3 Stacked generalization
   In stacked generalization, or stacking, each individual classier is called a
level-0 model. Each may vote, or may have its output sent to a level-1 model
 another classier that tries to learn which level-0 models are most reliable.
Level-1 models are usually more accurate than simple voting, provided they are
given the class probability distributions from the level-0 models and not just the
single predicted class [10].


3 Introduction to Formal Concept Analysis
3.1 Main denitions
   A formal context in FCA is a triple K = (G, M, I), where G is a set of
objects, M is a set of attributes, and the binary relation I ⊆ G × M shows
which object possesses which attribute. gIm denotes that object g has attribute
m. For subsets of objects and attributes A ⊆ G and B ⊆ M Galois operators
are dened as follows:
                         A0 = {m ∈ M | gIm ∀g ∈ A},
                          B 0 = {g ∈ G | gIm ∀m ∈ B}.

    A pair (A, B) such that A ⊆ G, B ⊆ M, A0 = B and B 0 = A, is called a formal
concept of a context K . The sets A and B are closed and called the extent and
the intent of a formal concept (A, B) respectively. For the set of objects A the
set of their common attributes A0 describes the similarity of objects of the set
A and the closed set A00 is a cluster of similar objects (with the set of common
attributes A0 ) [11].
    The number of formal concepts of a context K = (G, M, I) can be quite large
(2min{|G|,|M |} in the worst case), and the problem of computing this number
is #P-complete [12]. There exist some ways to reduce the number of formal
concepts, for instance, choosing concepts by stability, index or extent size [13].
    For a context (G, M, I), a concept X = (A, B) is less general than or equal
to a concept Y = (C, D) (or X ≤ Y ) if A ⊆ C or, equivalently, D ⊆ B .
For two concepts X and Y such that X ≤ Y and there is no concept Z with
Z 6= X, Z 6= Y, X ≤ Z ≤ Y , the concept X is called a lower neighbor of Y , and Y
is called an upper neighbor of X . This relationship is denoted by X ≺ Y . Formal
concepts, ordered by this relationship, form a complete concept lattice which
might be represented by a Hasse diagram [14]. Several algorithms for building
formal concepts (including Close by One) and constructing concept lattices are
studied also in [14].
    One can address to [11] and [15] to nd some examples of formal contexts,
concepts and lattices with their applications. Chapter 4 also shows the usage of
FCA apparatus in a concrete task.
    However, in some applications there is no need to nd all formal concepts of a
formal context or to build the whole concept lattice. Concept lattices, restricted
to include only concepts with frequent intents, are called iceberg lattices. They
were shown to serve as a condensed representation of association rules and fre-
quent itemsets in data mining [15].
    Here we modied the Close by One algorithm slightly in order to obtain
only the upper-most concept of a formal context and its lower neighbors. The
description of the algorithm and details of its modication is beyond the scope
of this paper.


4 A toy example
   Let us demonstrate the way RMCS works with a toy synthetic dataset shown
in Table 1. We consider a binary classication problem with 8 objects comprising
a training set and 2 objects in a test set. Each object has 4 binary attributes
and a target attribute (class). Suppose we train 4 classiers on this data and try
to predict labels for objects 9 and 10.
    Using FCA terms, we denote by G = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}  the whole
set of objects, Gtest = {9, 10}  the test set, Gtrain = G\Gtest  the training
set, M = {m1 , m2 , m3 , m4 }  the attribute set, C = {cl1 , cl2 , cl3 , cl4 }  the set
of classiers.


Table 1. A sample data set of 10 objects
with 4 attributes and 1 binary target
class
                                             Table 2. A classication context
G/M m1 m2 m3 m4 Label
  1 × ×      ×    1                        G/C cl1 cl2 cl3 cl4
  2 ×        ×    1                         1 ×        × ×
  3    × ×        0                         2      × ×
  4 ×     × ×     1                         3 ×            ×
  5 × × ×         1                         4      × ×
  6    × × ×      0                         5 × ×
  7 × × ×         1                         6 × ×          ×
  8       × ×     0                         7      ×       ×
  9 × × × ×       ?                         8      × × ×
 10    ×     ×    ?


    Here we run leave-one-out cross-validation on this training set for 4 classiers.
Further, we ll in Table 2, where a cross for object i and classier clj means that
clj correctly classies object i in the process of cross-validation. To clarify, a
cross for object 3 and classier cl4 means that after being trained on the whole
training set but object 3 (i.e. on objects {1, 2, 4, 5, 6, 7, 8}), classier cl4 correctly
predicted the label of object 3.
    Let us consider Table 2 as a formal context with objects G and attributes
C (so now classiers play the role of attributes). We refer to it as classication
context. The concept lattice for this context is presented in Fig. 1.


                 Fig. 1. The concept lattice of the classication context


    As it was mentioned, the number of formal concepts of a context K =
(G, M, I) can be exponential in the worst case. But for the toy example it is
possible to draw the whole lattice diagram. Thankfully, we do not need to build
the whole lattice in RMCS algorithm  we only keep track of its top concepts.
    Here are these top concepts: (G, ∅), ({1, 3, 5, 6}, {cl1 }), ({2, 4, 5, 6, 7, 8}, {cl2 }),
({1, 2, 4, 8}, {cl3 }), ({1, 3, 6, 7, 8}, {cl4 }).
    To classify objects from Gtest , we rst nd their k nearest neighbors from
Gtrain according to some distance metric. In this case, we use k = 3 and Ham-
ming distance. In these conditions, we nd that three nearest neighbors of object
9 are 4, 5 and 7, while those of object 10 are 1, 6 and 8.
    Then, we take these sets of nearest neighbors N eighb9 = {4, 5, 7} and
N ieghb10 = {1, 6, 8}, and nd maximal intersections of these sets with the ex-
tents of formal concepts presented above (ignoring the concept (G, ∅)). The in-
tents (i.e. classiers) of the corresponding concepts are given as recommendations
for the objects from Gtest . The procedure is summarized in Table 3.


             Table 3. Recommending classiers for objects from Gtest

Gtest 1st      2nd      3rd      Neighbors Classication concept Recommended
      nearest nearest nearest              which extent gives the classier
      neighbor neighbor neighbor           maximal intersection
                                           with the Neighbors
                                           set
 9 4           5        7        {4, 5, 7} ({2, 4, 5, 6, 7, 8}, {cl2 }) cl2
 10 1          6        8        {1, 6, 8} ({1, 3, 6, 7, 8}, {cl4 }) cl4


    Finally, the RMCS algorithm predicts the same labels for objects 9 and 10
as classiers cl2 and cl4 do correspondingly.
    Lastly, let us make the following remarks:

1. We would not have ignored the upper-most concept with extent G if it did not
   have an empty intent. That is, if we had the top concept of the classication
   context in a form (G, {clj }) it would mean that clj correctly classied all
   objects from the training set and we would therefore recommend it to the
   objects from the test set.
2. One more situation might occur that two or more classiers turn out to be
   equally good at classifying objects from Gtrain . That would mean that the
   corresponding columns in classication table are identical and, therefore, the
   intent of some classication concept is comprised of more than one classier.
   In such case, we do not have any argument for preferring one classier to
   another and, hence, the nal label would be dened as a result of voting
   procedure among the predicted labels of these classiers.
3. Here we considered an input dataset with binary attributes and a binary
   target class. However, the idea of the RMCS algorithm is still applicable for
   datasets with numeric attributes and multi-class classication problems.


5 Recommender-based Multiple Classier System
  In this section, we discuss the Recommender-based Multiple Classier System
(RMCS). The pseudocode of the RMCS algorithm is presented in the listing
Algorithm 1.
   The inputs for the algorithm are the following:
1. {Xtrain , ytrain }  is a training set, Xtest  is a test set;
2. C = {cl1 , cl2 , ..., clK }  is a set of K base classiers. The algorithm is in-
   tended to perform a classication accuracy exceeding those of base classiers;
3. dist(x1 , x2 )  is a distance function for objects which is dened in the
   attribute space. This might be the Minkowski (including Hamming and Eu-
   clidean) distance, the distance weighted by attribute importance and others.
4. k, n_f old  are parameters. Their meaning is explained below;
5. topCbO(context)  is a function for building the upper-most concept of a
   formal context and its lower neighbors. Actually, it is not an input for the
   algorithm but RMCS uses it.
The algorithm includes the following steps:

1. Cross-validation on the training set. All K classiers are trained on n_f olds−
   1 folds of Xtrain . Then a classication table (or context) is formed where a
   cross is put for object i and classier clj if clj correctly classies object i
   after training on n_f olds − 1 folds (where object i belongs to the rest fold);
2. Running base classiers. All K classiers are trained on the whole Xtrain .
   Then, a table of predictions is formed where (i, j) position keeps the pre-
   dicted label for object i from Xtest by classier clj ;
3. Building top formal concepts of the classication context. The topCbO al-
   gorithm is run in order to build upper formal concepts of a classication
   context. These concepts have the largest possible number of objects in ex-
   tents and minimal possible number of classiers in their intents (not counting
   the upper-most concept);
4. Finding neighbors of the objects from Xtest . The objects from the test set
   are processed one by one. For every object from Xtest we nd its k nearest
   neighbors from Xtrain according to the selected metric sim(x1 , x2 ). Let us
   say these k objects form a set N eighbors. Then, we search for a concept of a
   classication context which extent yields maximal intersection with the set
   N eighbors. If the intent of the upper-most concept is an empty set (i.e., no
   classier correctly predicted the labels of all objects from Xtrain , which is
   mostly the case), then the upper-most concept (G, ∅) is ignored. Thus, we
   select a classication concept, and its intent is a set of classiers Csel ;
5. Classication. If Csel consists of just one classier, we predict the same label
   for the current object from Xtest as this classier does. If there are several
   selected classiers, then the predicted label is dened by majority rule.


6 Experiments
   The algorithm, described above, was implemented in Python 2.7.3 and tested
on a 2-processor machine (Core i3-370M, 2.4 HGz) with 3.87 GB RAM.
    We used four UCI datasets in these experiments - mushrooms, ionosphere,
digits, and nursery.1 Each of the datasets was divided into training and test
sets in proportion 70:30.
1
    http://archive.ics.uci.edu/ml/datasets
Algorithm 1 Recommender-based Multiple Classier System
Input: {Xtrain , ytrain }, Xtest     are training and test sets, C = {cl1 , cl2 , ..., clK } 
is a set of base classiers, topCbO(context, n)  is a function for building the upper-
most concept of a formal context and its lower neighbors, dist(x1 , x2 )  is a distance
function dened in the attribute space, k  is a parameter (the number of neighbors),
n_f old  is the number of folds for cross-validation on a training set
Output: ytest  are predicted labels for objects from Xtest

   train_class_context = [ ][ ]  is a 2-D array
   test_class_context = [ ][ ]  is a 2-D array
   for i ∈ 0 . . . len(Xtrain ) − 1 do
      for cl ∈ 0 . . . len(C) − 1 do
        train classier cl on (n_f old − 1) folds not including object Xtrain [i]
        pred = predicted label for Xtrain [i] by classier cl
        train_class_context[i][cl] = (pred == ytrain [i])
      end for
  end for
  for cl ∈ 0 . . . len(C) − 1 do
      train classier cl on the whole Xtrain
      pred = predicted labels for Xtest by classier cl
      test_class_context[:][cl] = pred
  end for
  top_concepts = topCbO(class_context)
  for i ∈ 0 . . . len(Xtest ) − 1 do
    N eighbors = k nearest neighbors of Xtest [i] from Xtrain according to sim(x1 , x2 )
    concept = argmax(c.extent ∩ N eighbors), c ∈ top_concepts
    Csel = concept.intent
    labels = predictions for Xtest [i] made by classiers from Csel
    ytest [i] = argmax(count_f req(labels))
  end for


Table 4. Classication accuracy of 6 algorithms on 4 UCI datasets: mushrooms (1),
ionosphere (2), digits (3), and nursery (4)

Data SVM,           Logit                 kNN             RMCS         Bagging SVM AdaBoost
     RBF kernel     (C=10)                (euclidean,     (k=3,        (C=1, γ =0.02) on decision
     (C=1, γ =0.02)                       k=3)            n_folds=4) 50 estimators stumps,
                                                                                      50 iterations
  1    0.998           0.996              0.989           0.997        0.998          0.998
       t=0.24 sec.     t=0.17 sec.        t=1.2*10−2 sec. t=29.45 sec. t=3.35 sec.    t=44.86 sec.
  2    0.906           0.868              0.858           0.933        0.896          0.934
       t=5.7*10−3 sec. t=10−2 sec.        t=8*10−4 sec. t=3.63 sec. t=0.24 sec.       t=22.78 sec.
  3    0.917           0.87               0.857           0.947        0.92           0.889
       t=0.25 sec.     t=0.6 sec.         t=1.1*10−2 sec. t=34.7 sec. t=4.12 sec.     t=120.34 sec.
  4    0.914           0.766              0.893           0.927        0.913          0.903
       t=3.23 sec.     t=0.3 sec.         t=3.1*10−2 sec. t=220.6 sec. t=38.52 sec. t=1140 sec.
Data SVM,             Logit             kNN             RMCS         Bagging SVM     AdaBoost
     RBF kernel       (C=103 )          (minkowski,     (k=5,        (C=103 ,        on decision
     (C=10 , γ =0.02)
          3
                                        p=1, k=5)       n_folds=10) γ =0.02)         stumps,
                                                                     50 estimators   100 iterations
    1   0.998             0.999         0.999           0.999        0.999           0.998
        t=0.16 sec.       t=0.17 sec.   t=1.2*10−2 sec. t=29.45 sec. t=3.54 sec.     t=49.56 sec.
    2   0.906             0.868         0.887           0.9          0.925           0.934
        t=4.3*10−3 sec.   t=10−2 sec.   t=8*10−4 sec. t=3.63 sec. t=0.23 sec.        t=31.97 sec.
    3   0.937             0.87          0.847           0.951        0.927           0.921
        t=0.22 sec.       t=0.6 sec.    t=1.1*10−2 sec. t=34.7 sec. t=4.67 sec.      t=131.6 sec.
    4   0.969             0.794         0.945           0.973        0.92            0.912
        t=2.4 sec.        t=0.3 sec.    t=3*10−2 sec. t=580.2 sec. t=85.17 sec.      t=2484 sec.


    We ran 3 classiers implemented in SCIKIT-LEARN library 2 (written in Python)
which served as base classiers for the RMCS algorithm as well. These were a
Support Vector Machine with Gaussian kernel (svm.SVC() in Scikit), logis-
tic regression (sklearn.linear_model.LogisticRegression()) and k Nearest
Neighbors classier (sklearn.neighbors.classification.
KNeighborsClassifier()).
    The classication accuracy of each classier on each dataset is presented in
Table 4 along with special settings of parameters. Moreover, for comparison, the
results for Scikit's implementation of bagging with SVM as a base classier
and AdaBoost on decision stumps 3 are presented.
    As we can see, RMCS outperformed its base classiers in all cases, while it
turned out to be better than bagging only in case of multi-class classication
problems (datasets digits and nursery).


7 Conclusion
   In this paper, we described the underlying idea of multiple classier systems,
discussed bagging, boosting and stacking. Then, we proposed a multiple classi-
er system which turned out to outperform its base classiers and two particular
implementations of bagging and AdaBoost in two multi-class classication prob-
lems.
    Our further work on the algorithm will continue in the following directions:
exploring the impact of dierent distance metrics (such as the one based on
attribute importance or information gain) on the algorithm's performance, ex-
perimenting with various types of base classiers, investigating the conditions
preferable for RMCS (in particular, when it outperforms bagging and boosting),
improving execution time of the algorithm and analyzing RMCS's overtting.


2
    http://scikit-learn.org
3
    https://github.com/pbharrin/machinelearninginaction/tree/master/Ch07
Acknowledgements. The authors would like to thank their colleague from
Higher School of Economics, Sergei Kuznetsov, Jaume Baixeries and Konstantin
Vorontsov for their inspirational discussions which directly or implicitly inu-
enced this study.


References
1. Izenman, A. J.: Committee Machines. Modern Multivariate Statistical Techniques.
   pp. 505550. Springer New York (2008).
2. Condorcet, M.-J.-A.-N.: Essay on the Application of Analysis to the Probability of
   Majority Decisions. (1785)
3. Breiman, L.: Bagging predictors. Machine Learning. 24(2), 123140. (1996)
4. Schapire, R. E.: The Strength of Weak Learnability. Machine Learning. 5, 197227
   (1990)
5. Freund, Y: Boosting a Weak Learning Algorithm by Majority. Information and
   Computation. 121(2), 256285 (1995)
6. Freund, Y., Schapire, R. E.: A Decision-Theoretic Generalization of On-Line Learn-
   ing and an Application to Boosting. Journal of Computer and System Sciences. 55,
   119139 (1997)
7. Freund, Y., Schapire, R. E.: A Short Introduction to Boosting. (1999).
8. Dietterich, T. G.: Ensemble Methods in Machine Learning. Multiple Classier Sys-
   tems, LBCS-1857. pp. 115. Springer (2000).
9. Breiman, L.: Random Forests. Machine Learning. 45(1), 532 (2001)
10. Wolpert, D. H.: Stacked Generalization. Neural Networks, 5, 241259 (1992)
11. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations.
   Springer-Verlag New York, Inc., Secaucus, NJ, USA (1997).
12. Kuznetsov, S. O.: On Computing the Size of a Lattice and Related Decision Prob-
   lems. Order. 18(4), 313321 (2001).
13. Kuznetsov, S. O.: On stability of a formal concept. Annals of Mathematics and
   Articial Intelligence. 49(1-4), 101115 (2007)
14. Kuznetsov, S. O., Obiedkov, S.: Comparing Performance of Algorithms for Gener-
   ating Concept Lattices. Journal of Experimental and Theoretical Articial Intelli-
   gence. 14, 189216 (2002).
15. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., Lakhal, L.: Intelligent Structur-
   ing and Reducing of Association Rules with Formal Concept Analysis. In: Baader,
   F., Brewka, G., and Eiter, T. (eds.) KI 2001: Advances in Articial Intelligence. pp.
   335350. Springer Berlin Heidelberg (2001)