Introduction

A First Study on What MDL Can Do for FCA

Tatiana Makhalova

tpmakhalova@hse.ru 1 2 3

Sergei O. Kuznetsov

skuznetsov@hse.ru 2 3

Amedeo Napoli

amedeo.napoil@loria.fr 1 3 0 36, Department of Computer Science, Palacky University Olomouc , 2018. Copying permitted only for private and academic purposes 1 LORIA, (CNRS - Inria - U. of Lorraine) , BP 239 Vandoeuvre-l`es-Nancy , France 2 National Research University Higher School of Economics , 3 Kochnovsky Proezd, Moscow , Russia 3 c paper author(s), 2018. Proceedings volume published and copyrighted by its editors. Paper published in Dmitry I. Ignatov , Lhouari Nourine (Eds.): CLA 2018, pp. 25

25 36

Formal Concept Analysis can be considered as a classification engine able to build classes of objects with a description or concepts and to organize these concepts within a concept lattice. The concept lattice can be navigated for selecting significant concepts. Then the problem of finding significant concepts among the potential exponential number of concepts arises. Some measures exist that can be used for focusing on interesting concepts such as support, stability, and other. MDL (minimum description length) is also a good candidate that was rarely used in FCA by now for such objective. In this paper we argue that MDL can give FCA practitioners a good measure for selecting significant and representative concepts.

Introduction

Formal concept analysis (FCA) plays an important role in Data Mining and Machine Learning. Concept lattices support mainly unsupervised settings, improving tasks such as building taxonomies and ontologies, computing implications and association rules, clustering and solving classification tasks. These tasks in practice are coupled with the problem of exponential explosion of the number of formal concepts.

By now, to tackle this issue, a lot of different approaches have been proposed, including data pre- and postprocessing, background knowledge incorporation, computing approximate concepts (see [ 9 ] for an overview). As a result, one expects to get a small set of interesting, meaningful, non-redundant concepts [ 3 ].

In this paper, we focus on the characterization of such a small set of concepts. Instead of using an interesting measure in postprocessing step, we propose to rely on the minimum description length (MDL) principle [ 7 ], which allows one to select small sets of diverse and interpretable concepts. Providing the best lossless compression of the data, the MDL optimal sets of patterns (itemsets) automatically provides a balance between the quality of fit of the data and the complexity of the model without any user-defined parameters to be set [ 1 ].

In this paper we propose a first study on the application of the MDL principle in FCA settings. To the best of our knowledge, this is one of the first papers to study the effective use of MDL in the framework of FCA.

The rest of the paper has the following structure. In Section 2 we remind the basic definition used in FCA and discuss how FCA can be used to solve classification problem. Section 3 introduces the MDL principle. In Section 4 we discuss what MDL can bring to FCA. Section 5 gives conclusion and directions for future work. 2

Preliminaries

In this section we recall the main notions that are used in the paper. We study attribute sets of formal concepts, which have several alternative names, such as itemset, intent, or pattern. We discuss the application of FCA in unsupervised and supervised settings. 2.1

FCA. Basic Notions

Here we briefly recall FCA terminology [ 5 ]. A formal context is a triple (G, M, I), where G = {g1, g2, ..., gn} is called a set objects, M = {m1, m2, . . . , mk} is called a set attributes and I ⊆ G × M is a relation called incidence relation, i.e. (g, m) ∈ I if the object g has the attribute m. The derivation operators (·)0 are defined for A ⊆ G and B ⊆ M as follows:

A0 = {m ∈ M | ∀g ∈ A : gIm}

B0 = {g ∈ G | ∀m ∈ B : gIm} A0 is the set of attributes common to all objects of A and B0 is the set of objects sharing all attributes of B. An object g is said to contain a pattern (set of items) B ⊆ M if B ⊆ g0. The double application of (·)0 is a closure operator, i.e. (·)00 is extensive, idempotent and monotone. Sets A ⊆ G, B ⊆ M , such that A = A00 and B = B00, are said to be closed.

A (formal) concept is a pair (A, B), where A ⊆ G, B ⊆ M and A0 = B, B0 = A. A is called the (formal) extent and B is called the (formal) intent of the concept (A, B). A partial order 6 is defined on the set of concepts as follows: (A, B) ≤ (C, D) iff A ⊆ C (D ⊆ B), a pair (A, B) is a subconcept of (C, D), while (C, D) is a superconcept of (A, B).

The number of formal concepts grows exponentially w.r.t. the size of a formal context, i.e. the number of objects in G and attributes in M . Thus, it becomes almost impossible to analyze and interpret the whole set of generated concepts. Pattern discovery techniques are designed to solve this problem. The goal of pattern discovery within the framework of FCA is to find a non-redundant set of concepts that are interesting w.r.t. specified constrains/interestingness criterion. The criterion can be applied to both intent and extent, whereas pattern discovery in general is related solely to the itemset assessment.

Example. Let us consider the toy example given in Table 1. We will use either D1 or D2 to compute classifiers and the remaining objects will be used to assess the quality of the classifiers. The set of attributes M includes columns m1, . . . , m9. s n taad iittro a p objects g1 dog g2 cat The additional attribute “target”, i.e., class labels, is not taken into account under unsupervised settings.

Filtering concepts based on their extent and/or intent belongs to class of unsupervised problems, since any supplemental information is unavailable. In the next subsection we consider the problem of concept selection in supervised settings. In supervised settings along with the objects and their descriptions an additional attribute w is given. It specifies the class of an object. We denote the set of its values by ε.

We shall confine ourselves to two-class classification and study the simplest case, where each object belongs to a single class. In the defined settings ε = {+, −}. We consider the case where a set of objects G = G+ ∪ G− is divided into 2 disjoint subsets, i.e., a set of positive examples G+, negative examples G−. In practice, a set Gτ of unlabeled examples appears as well. The objects are described by attributes from M and the target attribute w is defined as follows: gIw = “ + ” for g ∈ G+ and gIw = “ − ” for g ∈ G−. The objects from G+ ∪ G− compose training and test sets, which are used to generate concepts and to estimate quality of classifiers, respectively.

It is assumed that there exists an unknown function that maps each object g ∈ G (or its description g0 ⊆ M ) to an element in ε. The goal is to reconstruct as accurately as possible the unknown function using an observable subset of labeled objects. To do that, one builds classifiers, which can be constructed by means of FCA as follows.

For each concept (A, B) a class label e ∈ ε is defined by majority of labeled objects in the extent, i.e. class((A, B)) = arg maxe∈ε |Ae|/|A|, where Ae = Ge ∩ A. To classify an unlabeled object g w.r.t. a pair (B, e) we set the following classification principle: (B, e)(g) = (e, if B ⊆ g0 ∅, otherwise.

(1)

According to Formula 1, we get a non-empty response e from (B, e) if an object description g0 contains attribute set B. To simplify notation we will write Bˆe instead of (B, e). In general, B could be any itemset, not necessarily closed.

The details on classification problem in terms of FCA can be found in [ 6, 8 ].

To identify the best classifier a test set G∗ ⊆ G+ ∪ G− is used, we will write G∗+ and G∗ for subsets of positive and negative objects in the test set G∗, i.e., for − G∗ ∩ G+ and G∗ ∩ G−, respectively. In our paper we estimate classifiers using the measures listed below and provide small examples of their usage (see Table 2). Precision measures how many correct answers are given by the classifier: prec(Bˆe, G∗) = ng | Bˆe(g) = e, g ∈ Ge∗o / ng | Bˆe(g) = e, g ∈ G∗o . (2) Recall measures how many objects from a target class are characterized by the classifier (i.e. whether a classifier is specific or general):

recall(Bˆe, G∗) = ng | Bˆe(g) = e, g ∈ Ge∗o /|Ge∗|. (3) Support measures how many objects can be classified (correctly or not): sup(Bˆe, G∗) = ng | Bˆe(g) = e, g ∈ G∗o /|G∗|. (4)

The accuracy takes into account examples from the remaining classes ε \ {e} unclassified by Bˆe: acc(Bˆe, G∗) = ng | Bˆe(g) = e, g ∈ Ge∗o ∪ ng | Bˆe(g) = ∅, g ∈ Gc∗, c ∈ ε \ {e}o |G∗|

. (5) However, other measures can also be examined [ 2 ], e.g., F1 score, AUC, etc.

In the next section we consider an ensemble of classifiers based on single concept-based classifiers defined in Formula 1. 2.3

Concept-based Classifiers

A set of the classifiers S = nBˆeo , e ∈ ε with a rule for aggregation of their j j∈J responses constitute an ensemble of classifiers. We call an ensemble conceptbased classifiers (CBC) if the itemsets B are intents of formal concepts. As an aggregation rule the following principles might be chosen: class of the majority of labels, classification if responses are the same class labels, the highest priority class from the set of responses.The best CBCs are those that ensure high accuracy and have a small size of S.

Example. Let us turn back to the running example from Table 1. The supplementary information on class labels (the target attribute w) is given in column “target”. We consider the following set of classifiers:

S = {({“cold-resistant”} , −), ({“4 legs”, “change size”} , +)} .

As a rule for aggregation of responses we use the priority principle, let us suppose that “–” class has higher priority than “+”, thus we can apply classifiers from “–” class and, if the object remains unclassified, we try to classify it with “+” class classifiers.

CBC works for g10 as follows. A classifiers with the highest priority (“–” class) are firstly applied. Since {“cold-resistant”} 6⊆ g100, we get an empty response and turn to the classifiers of lower priority (“+” class). {“4 legs”, “change size”}⊆ g100, we get “+” response and classify g10 as an member of “+” class. The object g11 is classified as “–”, since {“cold-resistant”}⊆ g101. g8 remains unclassified, since we get empty responses from all classifiers in S, g9 is misclassified by ({“cold-resistant”} , –).

Thus, a formal concept is a well-interpreted, quite intuitive and handy tool for describing subsets of objects both un- and supervised settings. However, as it was mentioned earlier, the huge number of generated concepts hampers interpretation of the results as well as its practical application.

In the next section we provide an approach that can be used to select a small set of diverse concepts. 3 3.1

Minimal Description Length Principle

Describing Data with MDL . Unsupervised Settings The MDL principle in the context of pattern mining is formulated as follows: the best set of patterns is the set that best compresses the database [ 10 ].

The main element of this approach is a code table (CT), which is composed of “some” itemsets with their length. The best code table minimizes the total length in bits L(D, CT ) = L(D | CT ) + L(CT | D), where L(D | CT ) is the length of the dataset encoded with the code table CT and L(CT | D) is the length of the code table CT computed w.r.t. D. To encode an object g in a dataset one needs to select a subset of disjoint itemsets that cover all attributes of g. By u(B) = | {t ∈ D | B ∈ cover(t)} | we denote the usage of itemset B in dataset D, i.e., how many times B is used to cover objects in D, where U = PB∈CT u(B) is the total usage of itemsets. The principles of building code tables will be discussed further.

To define the length of an itemset we use an optimal prefix code given by Shannon entropy, i.e., l(B) = −logP r(B), where probability is defined as follows: P r(B) = u(B)/U . Thus, the itemsets with higher frequency have smaller code lengths. The details on itemset encoding and memory that is needed to store itemsets of code table are out of scope of this paper. Since we are interested in the length of the description rather than in the encoding itself (materialized codes), we consider simplified version of L(CT, D) where only terms that characterize “specificity” of itemsets for the dataset are taken into account:

L(D | CT ) = X

X g∈D B∈cover(g) l(B) = − X u(B) log

B∈CT u(B)

U , L(CT | D) =

X code(B) + l(B),

B∈CT where code(B) is the length in bits to store itemset B in a code table. Principles of computing a code table. A code table is computed in an incremental manner: starting from a set of single-attribute patterns, i.e. {{m} | m ∈ M }. The optimization procedure is based on adding a new itemset to the code table, correcting the information about usage of the other itemsets in the code table, recomputing the itemset lengths and re-encoding the data. At each iteration a new pattern can be added or not to the code table.

A set of candidates might be composed of any kind of patterns: arbitrary itemsets, closed ones, δ-itemsets, etc. The items in a set of candidates are sorted w.r.t. chosen interestingness measures, in particular, in Krimp [ 10 ] patterns are ordered by the length of itemset (its size of attribute set) and frequency.

Note that the problem of computing optimal code table implies exhaustive search for the best combination of patterns, and in practice some heuristics are used.

Example. Let us consider the principle of computing a CT using the running example. The iterative process of updating a CT for class “animal” and recomputing the cover of D2 is described in Table 3. As candidates we use the set of concepts sorted by area, i.e., frequency × length, the ordered list of candidates with their areas is given in columns “Candidate set, area”. Columns “CT” correspond to the CT at each iteration. The CT contains itemsets and the number of times each itemset is used to cover objects. The covering is given in columns “Data with covering”. At the beginning it consists of single-attribute itemsets. The first object m1m2m3m6 is covered by four single-attribute itemsets. The first candidate is m1m2m3 with an area equal to 6. At the next step this candidate is used to cover attributes in the dataset covered by single-attribute itemsets. With the chosen candidate, the covering for two objects is changed (compare the first two objects in columns “Data with covering” in “Step 0” and “Step 1”). Step by step a new itemset with the maximal area is added to the CT. If a new CT compresses the data better then the old one, the itemset is accepted to the CT, otherwise, it is removed from both the CT and the candidate set. If the CT is changed, the usage of itemsets in the CT and area for candidates are recomputed. Being purely unsupervised, the MDL principle can be adapted for usage in supervised settings. The idea is to find a compressed representation for objects using code tables of each target class separately. Classes have their own code tables. A code table consists of typical patterns and their lengths (the more typical patterns have shorter lengths). To classify a new object, its set of attributes is covered by itemsets from code tables of each class. Then, the encoding lengths for each class are computed and the class that corresponds to the minimal encoding length is assigned to the object. The length reflects typicality of an object for a particular class (code table).

Example. Consider classification with code tables “CT1” and “CT2” from Table 4 that have been computed on sets D1 and D2, respectively. The details on the computing of the code tables is out of the scope of this paper (see the Krimp algorithm [ 10 ]). Each column“CTi” contains code tables CTA and CTNA for “animal” and “not animal” classes, respectively. Let us consider how an object g9 is classified with the code tables from “CT1”. The main steps of the covering of g90 are given in Table 5. To find a covering we use a greedy strategy, i.e., we start from the first itemset in a code table and then stop iterating over itemsets when all attributes of the object are covered. To cover g90 with CTNA we take the first itemset m1m4m5m8 (see Step 1 in Table 5), it does not cover g90, at the next step we take the second itemset m4m9, it covers a subset of g90 and m3 remains uncovered (see Step 2). The iterations over itemsets from CTNA continues until all the attributes of g9 will be covered.

Classification with code tables from Table 4 are given in Table 6. The objects are covered by itemsets from tables of “animal” and “not animal” as it is described in Table 5). The class where the object has the shortest length is assigned to this object.

In this section we considered how the MDL principle is used to select patterns (itemsets) in un- and supervised settings. In the next section we study how MDL works in the FCA framework. code table “animal”

CTA

CT1

CT1 code table “animal”, CTA s t e s m e t i To show that MDL can improve the practical application of FCA, in this section we discuss the results of experiments on the embedding of MDL within FCA. We used the discretized datasets from LUCS-KDD repository [ 4 ]. We split the data into 10 parts and in each of 10 experiments we use 9 of them as a training set and one as a test set. The average performance is reported in the paper. To compute code tables and to cover objects the Krimp algorithm [ 10 ] is used. In unsupervised learning (where the target attribute is not given), one is interested in a small number of meaningful patterns. In our experiments we compute the set of closed itemsets and apply the MDL principle to them. For MDL we ,

CT2 gg110001 == ((mm11)m(m2m4)3()m(m5)7()m8)

Encoding with itemsets not animal decision g80 = (m3)(m7) , not animal g90 = (m3)(m4m9) , , not animal g100 = (m1)(m2)(m3)(m7) , , , animal g101 = (m1m4m5m8) not animal g100 = (m1)(m2)(m3)(m7) , , , animal g101 = (m1m4m5m8) not animal ,

, sort itemsets in the candidate set by “length” (the cardinality of intent) and “frequency” (the cardinality of extent).

MDL-optimal concepts are concepts whose intents are included in a code table with a non-empty usage. The results of the experiments are given in Table 7. For instance, the “auto” dataset consists of 205 objects and 135 binary attributes. The total number of formal concepts is 67 557, 57 of them are MDLoptimal.

Our experiments show that the selected itemsets might be shorter or longer on average than the itemsets in the whole set of closed concepts (see column “avg. length of intent” in Table 7). However, around 2% (12 % at most) of the concepts are selected with the MDL-principle. Thus, MDL can be considered as a threshold-free alternative for selection of interesting itemsets. It is important to notice that the subset of MDL-optimal itemsets is composed of diverse patterns and expert assumptions on interestingness of concepts can be embedded by ordering candidates w.r.t. particular interestingness measures. Since a greedy strategy is used to make a code table, one gets a set of diverse itemsets that are in agreement with interestingness.

Classifier Comparison. FCA in Supervised Settings In this section we study both the accuracy of ensembles of classifiers S and their basic elements Bˆ. We also compare the accuracy of the ensembles with commonly used classification methods, e.g., random forest, multilayer perceptron and support vector machine.

Comparison of single classifiers. Here we consider formal concepts as single classifiers. We study precision, recall (Formulas 2 and 3, respectively) and precision loss (Formula 6). Formula 6 is also used as a measure of overfitting, i.e., cases where accuracy on a test set is worse than on a training one. If precision loss is close to 1, we call it overfitting, the classifier makes much more errors on a test set than on a training set. If precision loss is close to 0, we get “expected” precision on a test set.

prloss(Bˆe, G+ ∪ G−) = precision(Bˆe, G \ G∗) − precision(Bˆe, G∗) (6)

In Figure 1 we show the performance of single concepts in the following 2dimensional spaces: “precision” and “precision loss” (blue shapes), “recall” and “precision loss” (green shapes). Due to lack of space we provide two typical kinds of distributions using results for “Breast cancer” and “Wine” datasets. Precision loss is shown on the vertical axis, recall and precision are combined in the horizontal axis. The pictures are density plots for the (sub)set of concepts in the chosen dimensions. Intense-colored regions correspond to the regions with high concentration of concepts.

Let us consider the classifiers of “Breast cancer” dataset (the first line in Figure 1). The first figure shows that most concepts have precision loss close to 0, i.e., they have similar precision both on training and test set. Classifiers with precision loss close to 0 are preferable, since they have the “expected” precision even on unobservable data. Let us consider the position on the horizontal axis. The blue shape (“precision” axis) is located close to 1, it means that most concepts have high precision on a training set. A long stretch of the green shape along the axis means that the set of classifiers consists of both specific (having relatively big intents) and general concepts (recall is from 0 to 1).

The second plot corresponds to MDL-optimal classifiers. In both spaces, i.e., “precision” and “precision loss” (blue shapes), “recall” and “precision loss” (green shapes), classifiers are concentrated in two points on the “precision loss” axis, the bottom shape is brighter than the upper one. It means that a lot of classifiers have similar precision on training and test sets, and there are several classifiers that have much smaller precision on a test set (they are overfitting classifiers). Since the blue shapes are located close to 0 on the horizontal axis, we conclude that the classifiers are very precise on a training set. The concentration of the green shapes around 0 on the horizontal axis means that most classifiers have recall close to 0, thus, the classifiers are very specific.

Classifiers of “Wine” dataset demonstrate another typical distribution of the classifiers. We can read the plots as it is done above. Here we discuss the key difference between two kinds of datasets.

In our experiments, the set of classifiers on the whole set of formal concepts was comprised of either mostly one type of classifiers with precision loss close to 0 or two types of classifiers: with precision loss close to 0 and to 1. Thus, the set of concept-classifiers contains either mostly non-overfitting (good) classifiers or non- and overfitting ones. MDL-based subset usually includes both non- and overfitted classifiers, all these classifiers are quite specific (have big intents).

A reasonable question arises: what is the accuracy of ensembles of classifiers built on such different types of concepts? In the next paragraph we examine the accuracy of ensembles of classifiers that are constituted by MLD-optimal subsets. Comparison of ensembles of classifiers In this section we compare ensembles built on a set of formal concepts (“CBC”) and its MDL-optimal subset (“MDL”) with commonly used classification methods like random forests (RFs), multilayer perceptrons3 (MLP) and support vector machines (SVM). We study the average accuracy on a test set, and the results are summarized in Table 8. As it was noticed above, we split our dataset into 10 folds (9:1 for train and test sets). The maximal accuracy over the 10 folds is also reported in the table. We pay our attention to the number of classifiers that constitute an ensemble. The smaller the number of classifiers the faster one can obtain the response and the better this response can be interpreted.

Our experiments show that among itemset-based classifiers (CBC, MDL, RF) MDL-based approach demonstrate quite good performances and have a much smaller set of classifiers. An ensemble with a small number of classifiers performs faster and is better interpretable. Thus, it is easier to classify with MDL-ensembles and understand the obtained results as well. 5

Conclusion

In this paper we have addressed the problem of selecting meaningful, nonredundant sets of formal concepts. We have proposed to use the MDL principle and to show how the expert understanding of interestingness might be incorporated into it.

The MDL principle ensures a good compression even when the set of formal concepts is huge (for example, 24 MDL-optimal among 6 432 concepts for “Breast cancer”, 392 among 176 537 concepts for “nursery” dataset). 3 We select the configuration that ensures the best accuracy among the following ones: (100:50:50),(100:50:25),(50:50:50)

In supervised settings, MDL principle tends to choose the specific classifiers, some of them have a precision loss close to 0.9. However, the MDL principle ensures high classification accuracy.

One interesting direction for future work is to study how some interestingness measures, such as stability, might be embedded into the MDL approach. Another interesting direction is to study connection between the MDL principle and Pareto optimality.

Acknowledgements

The work was supported by the Russian Science Foundation under grant 1711-01294 and performed at National Research University Higher School of Economics, Moscow, Russia.

1. Aggarwal , C.C., Han , J .: Frequent pattern mining . Springer ( 2014 )

2. Baldi , P. , Brunak , S. , Chauvin , Y. , Andersen , C.A. , Nielsen , H. : Assessing the accuracy of prediction algorithms for classification: an overview . Bioinformatics 16 ( 5 ), 412 - 424 ( 2000 )

3. Buzmakov , A. , Kuznetsov , S.O. , Napoli , A. : Fast generation of best interval patterns for nonmonotonic constraints . In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases . pp. 157 - 172 . Springer ( 2015 )

4. Coenen , F. : The lucs-kdd discretised/normalised arm and carm data library ( 2003 ), http://www.csc.liv.ac.uk/ frans/KDD/Software/LUCS KDD DN

5. Ganter , B. , Wille , R.: Formal concept analysis: Logical foundations ( 1999 )

6. Ganter , B. , Kuznetsov , S.O. : Formalizing hypotheses with concepts . In: International Conference on Conceptual Structures . pp. 342 - 356 . Springer ( 2000 )

7. Gru¨nwald, P.D.: The minimum description length principle . MIT press ( 2007 )

8. Kuznetsov , S.O.: Machine learning and formal concept analysis . In: Eklund, P . (ed.) Concept Lattices. Springer Berlin Heidelberg, Berlin, Heidelberg ( 2004 )

9. Kuznetsov , S.O. , Makhalova , T. : On interestingness measures of formal concepts . Information Sciences 442-443 , 202 - 219 ( 2018 )

10. Vreeken , J., Van Leeuwen, M. , Siebes , A. : Krimp: mining itemsets that compress . Data Mining and Knowledge Discovery 23 ( 1 ), 169 - 214 ( 2011 )