=Paper= {{Paper |id=None |storemode=property |title=Extracting Decision Trees from Interval Pattern Concept Lattices |pdfUrl=https://ceur-ws.org/Vol-959/paper22.pdf |volume=Vol-959 |dblpUrl=https://dblp.org/rec/conf/cla/AssaghirKMV11 }} ==Extracting Decision Trees from Interval Pattern Concept Lattices== https://ceur-ws.org/Vol-959/paper22.pdf
                Extracting Decision Trees from
               Interval Pattern Concept Lattices

    Zainab Assaghir1 , Mehdi Kaytoue2 , Wagner Meira Jr.2 and Jean Villerd3
                 1
                   INRIA Nancy Grand Est / LORIA, Nancy, France
           2
             Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
       3
         Institut National de Recherche Agronomique / Ensaia, Nancy, France
           Zainab.Assaghir@loria.fr, {kaytoue,meira}@dcc.ufmg.br,
                            Jean.Villerd@nancy.inra.fr




       Abstract. Formal Concept Analysis (FCA) and concept lattices have
       shown their effectiveness for binary clustering and concept learning.
       Moreover, several links between FCA and unsupervised data mining
       tasks such as itemset mining and association rules extraction have been
       emphasized. Several works also studied FCA in a supervised framework,
       showing that popular machine learning tools such as decision trees can be
       extracted from concept lattices. In this paper, we investigate the links be-
       tween FCA and decision trees with numerical data. Recent works showed
       the efficiency of ”pattern structures” to handle numerical data in FCA,
       compared to traditional discretization methods such as conceptual scal-
       ing.



1    Introduction

Decision trees (DT) are among the most popular classification tools, especially
for their readability [1]. Connexions between DT induction and FCA have been
widely studied in the context of binary and nominal features [2], including struc-
tural links between decision trees and dichotomic lattices [8], and lattice-based
learning [7]. However the numerical case faces issues regarding FCA and numer-
ical data. In this paper, we investigate the links between FCA and decision trees
with numerical data and a binary target attribute. We use an extension of For-
mal Concept Analysis called interval pattern structures to extract sets of positive
and negative hypothesis from numerical data. Then, we propose an algorithm
thats extract decision trees from minimal positive and negative hypothesis.
    The paper is organised as follows. Section 2 presents the basics of FCA and
one of its extensions called interval pattern structureq for numerical data. Sec-
tion 3 recalls basic notions of decision trees. Then, we introduce some definitions
in section 4 showing the links between interval pattern structures and decision
trees, and a first algorithm for building decision trees from minimal positive and
negative hypothesis extracted from the pattern structures.
II

2    Pattern structures in formal concept analysis
Formal contexts and concept lattices. We assume that the reader is familiar
with FCA, and recall here most important definitions from [3]. Basically, data
are represented as a binary table called formal context (G, M, I) that represents
a relation I between a set of objects G and a set of attributes M . The statement
(g, m) ∈ I is interpreted as “the object g has attribute m”. The two operators (·)0
define a Galois connection between the powersets (2G , ⊆) and (2M , ⊆), with
A ⊆ G and B ⊆ M :
                A0 = {m ∈ M | ∀g ∈ A : gIm}          f or A ⊆ G,
                  0
                B = {g ∈ G | ∀m ∈ B : gIm}           f or B ⊆ M
For A ⊆ G, B ⊆ M , a pair (A, B), such that A0 = B and B 0 = A, is called a
(formal) concept. In (A, B), the set A is called the extent and the set B the
intent of the concept (A, B). The set of all concepts is partially ordered by
(A1 , B1 ) ≤ (A2 , B2 ) ⇔ A1 ⊆ A2 (⇔ B2 ⊆ B1 ) and forms a complete lattice
called the concept lattice of the formal context (G, M, I).
    In many applications, data usually consist in complex data involving num-
bers, intervals, graphs, etc. (e.g. Table 1) and require to be conceptually scaled
into formal contexts. Instead of transforming data, leading to representation and
computational difficulties, one may directly work on the original data. Indeed,
to handle complex data in FCA, Ganter & Kuznetsov [4] defined pattern struc-
tures: it consists of objects whose descriptions admit a similarity operator which
induces a semi-lattice on data descriptions. Then, the basic theorem of FCA nat-
urally holds. We recall here their basic definitions, and present interval pattern
structures from [5] to handle numerical data.
    Patterns structures. Formally, let G be a set of objects, let (D, u) be
a meet-semi-lattice of potential object descriptions and let δ : G −→ D be a
mapping. Then (G, (D, u), δ) is called a pattern structure. Elements of D are
called patterns and are ordered by a subsumption relation v such that given
c, d ∈ D one has c v d ⇐⇒ c u d = c. A pattern structure (G, (D, u), δ) gives
rise to the following derivation operators (·) , given A ⊆ G and an interval
pattern d ∈ (D, u):                    l
                                 A =      δ(g)
                                        g∈A
                              
                             d = {g ∈ G|d v δ(g)}
These operators form a Galois connection between (2G , ⊆) and (D, v). (Pattern)
concepts of (G, (D, u), δ) are pairs of the form (A, d), A ⊆ G, d ∈ (D, u), such
that A = d and A = d . For a pattern concept (A, d), d is called a pattern
intent and is the common description of all objects in A, called pattern extent.
When partially ordered by (A1 , d1 ) ≤ (A2 , d2 ) ⇔ A1 ⊆ A2 (⇔ d2 v d1 ), the set
of all concepts forms a complete lattice called a (pattern) concept lattice.
   Interval pattern structures. Pattern structures allow us to consider com-
plex data in full compliance with FCA formalism. This requires to define a meet
                                                                                            III

operator on object descriptions, inducing their partial order. Concerning numer-
ical data, an interesting possibility presented in [5] is to define a meet operator
as an interval convexification. Indeed, one should realize that “similarity” or “in-
tersection” between two real numbers (between two intervals) may be expressed
in the fact that they lie within some (larger) interval, this interval being the
smallest interval containing both two. Formally, given two intervals [a1 , b1 ] and
[a2 , b2 ], with a1 , b1 , a2 , b2 ∈ R, one has:

                     [a1 , b1 ] u [a2 , b2 ] = [min(a1 , a2 ), max(b1 , b2 )]
                     [a1 , b1 ] v [a2 , b2 ] ⇔ [a1 , b1 ] ⊇ [a2 , b2 ]

The definition of u implies that smaller intervals subsume larger intervals that
contain them. This is counter intuitive referring to usual intuition, and is ex-
plained by the fact that u behaves as an union (actually convex hull is the union
of intervals, plus the holds between them).
    These definitions of u and v can be directly applied component wise on
vectors of numbers or intervals, e.g. in Table 1 where objects are described by
vectors of values, each dimension corresponding to an attribute. For example,
h[5, 7.2], [1, 1.8]i v h[5, 7], [1, 1.4]i as [5, 7.2] v [5, 7] and [1, 1.8] v [1, 1.4].
    Now that vectors of interval forms a u-semi-lattice, numerical data such
as Table 1 give rise to a pattern structure and a pattern concept lattice. An
example of application of concept forming operators (.) is given below. The
corresponding pattern structure is (G, (D, u), δ) with G = {p1 , ..., p4 , n1 , ..., n3 }
and d ∈ D is a vector with ith component corresponding to attribute mi .

            {p2 , p3 } = δ(p2 ) u δ(p3 ) = h[5, 5.9], [2, 3.2], [3.5, 4.8], [1, 1.8]i
          {p2 , p3 } = {p2 , p3 , p4 }

As detailed in [5], vectors of intervals can be seen as hyperrectangles in Eu-
clidean space: first (.) operator gives the smallest rectangle containing some
object descriptions while second (.) operator returns the set of objects whose
descriptions are rectangles included in the rectangle in argument. Accordingly,
({p2 , p3 , p4 }, h[5, 5.9], [2, 3.2], [3.5, 4.8], [1, 1.8]i) is a pattern concept. All pattern
concepts of an interval pattern structure form a concept lattice. Intuitively, low-
est concepts have few objects and “small” intervals while higher concepts have
“larger” intervals. An example of such lattice is given later.


3    Decision trees
Among all machine leaning tools, decision trees [6, 1] are one of the most widely
used. They belong to the family of supervised learning techniques, where data
consist in a set of explanatory attributes (binary, nominal or numerical) that
describe each object, called example, and one target class attribute that affects
each example to a nominal class. Many extensions have been proposed, e.g. to
consider a numerical class attribute (regression trees) or other particular cases
depending on the nature of attributes. In this paper we focus on data consisting
IV

                                 m1 m2 m3 m4 
                              p1 7 3.2 4.7 1.4 +
                              p2 5 2 3.5 1 +
                              p3 5.9 3.2 4.8 1.8 +
                              p4 5.5 2.3 4 1.3 +
                              n1 6.7 3.3 5.7 2.1 -
                              n2 7.2 3.2 6 1.8 -
                              n3 6.2 2.8 4.8 1.8 -
     Table 1: Example 1: numerical context with an external target attribute 



of numerical explanatory attributes and a binary class attribute. The aim of de-
cision tree learning is to exhibit the relation between explanatory attributes and
the class attribute through a set of decision paths. A decision path is a sequence
of tests on the value of explanatory attributes that is a sufficient condition to
assign a new example to one of the two classes. A decision tree gathers a set of
decision paths through a tree structure where nodes contain tests on explana-
tory attributes. Each node has two branches, the left (resp. right) corresponds
to the next test if the new example passed (resp. failed) the current test. When
there is no more test to perform, the branch points to a class label, that repre-
sents a leaf of the tree. The links between FCA and decision tree learning have
been investigated in the case where explanatory attributes are binary [7–10, 2].
However, to our knowledge, no research has been carried out until now in the
case of numerical explanatory attributes. In the next section, we show how pat-
tern structures can be used to extract decision trees from numerical data with
positive and negative examples.


4    Learning in interval pattern structures
In [7], S. Kuznetsov considers a machine learning model in term of formal concept
analysis. He assumes that the cause of a target property resides in common at-
tributes of objects sharing this property. In the following, we adapt this machine
learning model to the case of numerical data.
    Let us consider an interval pattern structure (G, (D, u), δ) with an external
target property . The set of objects G (the training set) is partitioned into
two disjoints sets: positive G+ and negative G− . Then, we obtain two different
pattern structures (G+ , (D, u), δ) and (G− , (D, u), δ).
Definition 1 (Positive hypothesis). A positive hypothesis h is defined as an
interval pattern of (G+ , (D, u), δ) that is not subsumed by any interval pattern
of (G− , (D, u), δ), i.e. not subsumed by any negative example. Formally, h ∈ D
is a positive hypothesis iff
              h ∩ G− = ∅     and   ∃A ⊆ G+      such that   A = h
Definition 2 (Negative hypothesis). A negative hypothesis h is defined as
an interval pattern of (G− , (D, u), δ) that is not subsumed by any interval pattern
                                                                                               V

of (G+ , (D, u), δ), i.e. not subsumed by any positive example. Formally, h ∈ D
is a negative hypothesis iff

                h ∩ G+ = ∅        and    ∃A ⊆ G−       such that     A = h

Definition 3 (Minimal hypothesis). A positive (resp. negative) hypothesis h
is minimal iff there is no positive (resp. negative) hypothesis e 6= h such that
e v h.

    Going back to numerical data in Table 1, we now consider the ex-
ternal binary target property and split accordingly the object set into
G+ = {p1 , p2 , p3 , p4 } and G− = {n1 , n2 , n3 }. The pattern concept lat-
tice of (G+ , (D, u), δ), where D is the semi-lattice of intervals and δ is a
mapping associating for each object its pattern description is given in Fig-
ure 1 where positive hypothesis are marked. Note that neither the interval
pattern h[5.5, 7], [2.3, 3.2], [4, 4.8], [1.3, 1.8]i nor h[5, 7], [2, 3.2], [3.5, 4.8], [1, 1.8]i
are positive hypothesis since they are both subsumed by the inter-
val pattern δ(n3 ) = h[6.2, 6.2], [2.8, 2.8], [4.8, 4.8], [1.8, 1.8]i. Therefore, there
are two minimal positive hypothesis: P1 = h[5, 7], [2, 3.2], [3.5, 4.7], [1, 1.4]i
and P2 = h[5, 5.9], [2, 3.2], [3.5, 4.8], [1, 1.8]i. From (G− , (D, u), δ) (not
shown), we obtain the unique minimal negative hypothesis: N1 =
h[6.2, 7.2], [2.8, 3.3], [4.8, 6], [1.8, 2.1]i.
    Now, we consider decision trees more formally. Let the training data be de-
scribed by K+− = (G+ ∪ G− , (D, u), δ) with the derivation operator denoted by
(.) . This operator is called subposition in term of FCA.

Definition 4 (Decision path). A sequence h(m1 , d1 ), (m2 , d2 ), . . . , (mk , dk )i,
for different attributes m1 , m2 , · · · , mk chosen one after another, is a called deci-
sion path of length k if there is no mi such that (mi , di ), (mi , ei ) and di and ei are
not comparable, and there exists g ∈ G+ ∪ G− such that hd1 , d2 , . . . , dk i v δ(g)
(i.e. there is at least one example g such that di v δ(g) for each attribute mi ).
For instance, h(m3 , [4.8, 6]), (m1 , [6.2, 7.2])i is a decision path for Example 1.

If i ≤ k (respectively i < k), the sequence h(m1 , d1 ), (m2 , d2 ), . . . , (mi , di )i is
called subpath (proper subpath) of a decision path h(m1 , d1 ), (m2 , d2 ), . . . , (mk , dk )i.

Definition 5 (Full decision path). A sequence h(m1 , d1 ), (m2 , d2 ), . . . , (mk , dk )i,
for different attributes m1 , m2 , . . . , mk chosen one after another, is called full de-
cision path of length k if all object having (m1 , d1 ), (m2 , d2 ), . . . , (mk , dk ) (i.e.
∀g ∈ G, di v δ(g) for the attribute mi ) are either positive or negative examples
(i.e. have either + or − value of the target attribute).

We say that a full decision path is non-redundant if none of its subpaths is a
full decision path. The set of all chosen attributes in a full decision path can be
considered as a sufficient condition for an object to belong to a class  ∈ {+, −}.
A decision tree is then defined as the set of full decision paths.
                   P1                                                                P2
     minimal positive hypothesis                                          minimal positive hypothesis
                                             positive hypothesis
                        Fig. 1: Lattice of the pattern structure (G+ , (D, u), δ).
VI
                                                                                               VII

4.1     A first algorithm for building decision trees from interval
        pattern structures

In this section, we propose a first algorithm for extracting full decision paths
from the sets of minimal positive hypothesis P and minimal negative hypoth-
esis N . Intuitively, minimal positive (resp. negative) hypothesis describe the
largest areas in the attribute space that gathers the maximum number of posi-
tive (resp. negative) examples with no negative (resp. positive) example. Positive
and negative areas may intersect on some dimensions. In Example 1 (see Table 1),
P = {P1 , P2 } and N = {N1 } and we denote by Pi ∩ Nj the interval vector for
which the k-the component is the intersection of the Pi and Nj intervals for
the k-the component. Recall that P1 = h[5, 7], [2, 3.2], [3.5, 4.7], [1, 1.4]i, P2 =
h[5, 5.9], [2, 3.2], [3.5, 4.8], [1, 1.8]i and N1 = h[6.2, 7.2], [2.8, 3.3], [4.8, 6], [1.8, 2.1]i.
Then we have:
                              P1 ∩ N1 = h[6.2, 7], [2.8, 3.2], ∅, ∅i
                            P2 ∩ N1 = h∅, [2.8, 3.2], [4.8], [1.8]i
    We note that P1 and N1 have no intersection for attributes m3 and m4 . This
means that any example that has a value for m3 (resp. m4 ) that is contained in
P1 ’s interval for m3 (resp. m4 ) can directly be classified as positive. Similarly,
any example having a value for m3 (resp. m4 ) contained in N1 ’s interval for m3
(resp. m4 ) can directly be classified as negative. The same occurs for P2 and N1
for m1 .
    Therefore a full decision path for a minimal positive hypothesis P is defined
as a sequence h(mi , mi (P ))ii∈{1...|N |} where mi is an attribute such that mi (P ∩
Ni ) = ∅4 . A full decision path for a minimal negative hypothesis N is defined as
a sequence h(mj , mj (N ))ij∈{1...|P|} where mj is an attribute such that mj (N ∩
Pi ) = ∅.
    Here examples of such decision paths (built from P1 , P2 and N1 respectively)
are:
                                h(m3 , [3.5, 4.7])i(P1 )
                                      h(m4 , [1, 1.4])i(P1 )
                                      h(m1 , [5, 5.9])i(P2 )
                            h(m3 , [4.8, 6]), (m1 , [6.2, 7.2])i(N1 )
                           h(m4 , [1.8, 2.1]), (m1 , [6.2, 7.2])i(N1 )
   Decision paths built from P1 and P2 are sequences that contain a single
element since |N | = 1. Decision paths built from N1 are sequences that contain
two elements since |P| = 2. Two distinct full decision paths can be built from
P1 since there are two attributes for which P1 and N1 do not intersect.
   A positive (resp. negative) decision tree is therefore a set of full decision
paths, one for each minimal positive (resp.negative) hypothesis. For instance:
4
    For any interval pattern P , the notation mi (P ) denotes its interval value for the
    attribute mi .
VIII

”if m3 ∈ [3.5, 4.7] then +, else if m1 ∈ [5, 5.9] then + else -” is an example of
positive decision path. An example of negative decision path is ”if m1 ∈ [6.2, 7.2]
and m3 ∈ [4.8, 6] then -, else +”.
    Algorithm 1 describes the computation of full decision paths for minimal pos-
itive hypothesis. The dual algorithm for minimal negative hypothesis is obtained
by interchanging P and N .



 1 Res ← empty array of size |P|;
 2 foreach P ∈ P do
 3    foreach N ∈ N such that ∃mi , mi (P ∩ N ) = ∅ do
 4        if (mi , mi (P )) 6∈ Res[P ] then
 5            Res[P ] ← Res[P ] ∪ (mi , mi (P ));

 6      foreach N ∈ N such thatW6 ∃mi , mi (P ∩ N ) = ∅ do
 7         Res[P ] ← Res[P ] ∪ { m∈M (m, m(P ) \ m(P ∩ N ))};

 Algorithm 1: Modified algorithm for extracting full decision paths Res (in-
 cluding non-redundant) for minimal positive hypothesis



     The different steps of the algorithm are detailed below:

line 1: Res will contain a full decision path for each minimal positive hypothesis.
line 2: Process each minimal positive hypothesis P .
line 3: For each minimal negative hypothesis N that has at least one attribute
m such that m(P ∩ N ) = ∅, choose one of these attribute, called mi below.
line 4: Ensure that mi has not already been selected for another N , this enables
to produce non redundant full decision paths (see Example 2).
line 5: Add the interval mi (P ) in the full decision path of P . The test mi ∈ mi (P )
will separate between positive examples covered by P and negative examples cov-
ered by N .
line 6: For each minimal negative hypothesis N that has no attribute m such
that m(P ∩ N ) = ∅.
line 7: Positive examples covered by P and negative examples covered by N can
be separated by a disjunction of tests m ∈ m(P ) \ m(P ∩ N )) on each attribute
m. Hence, there is at least one attribute for which a positive example from P
belongs to m(P ) and not to m(N ). Otherwise, N would not be a negative hy-
pothesis.



   Note that Example 1 is a particular case where all negative examples are
gathered in a unique minimal negative hypothesis.
   A few values have been modified in Table 2 in order to produce two minimal
negative hypothesis.
                                                                                            IX

                                      m1 m2 m3 m4 
                                   p1 7 3.2 4.7 1.4 +
                                   p2 5 2 3.5 1 +
                                   p3 5.9 3.2 4.8 1.8 +
                                   p4 5.5 2.3 4 1.3 +
                                   n1 5.9 3.3 5.7 1.4 -
                                   n2 7.2 3.2 6 1.8 -
                                   n3 6.2 2.8 4.8 1.8 -
                             Table 2: Example 2: training set




   Minimal positive hypothesis P1 and P2 remain unchanged while there are
two minimal negative hypothesis:

                      N1 = h[5.9, 7.2], [3.2, 3.3], [5.7, 6], [1.4, 1.8]i

                      N2 = h[6.2, 7.2], [2.8, 3.2], [4.8, 6], [1.8, 1.8]i
    This leads to the following intersections:

                            P1 ∩ N1 = h[5.9, 7], [3.2], ∅, [1.4]i

                            P1 ∩ N2 = h[6.2, 7], [2.8, 3.2], ∅, ∅i
                           P2 ∩ N1 = h[5.9], [3.2], ∅, [1.4, 1.8]i
                         P2 ∩ N2 = h∅, [2.8, 3.2], [4.8, 4.8], [1.8]i
    Examples of full decision path computed by Algorithm 1 from P1 are

                            h(m3 , [3.5, 4.7]), (m4 , [1, 1.4])i(1)

                           h(m3 , [3.5, 4.7]), (m3 , [3.5, 4.7])i(2)
Note that neither N1 nor N2 intersect P1 on m3 , therefore the full decision
path (2) can be simplified as h(m3 , [3.5, 4.7])i. More generally, following pre-
vious definitions, h(m3 , [3.5, 4.7])i is a non-redundant full decision path while
h(m3 , [3.5, 4.7]), (m4 , [1, 1.4])i and h(m3 , [3.5, 4.7]), (m3 , [3.5, 4.7])i are not. A con-
ditional test has been added in Algorithm 1 in order to also produce such non-
redundant full decision paths.
     Finally a concrete positive decision tree is built from the set of full decision
paths, each node corresponds to a minimal positive hypothesis Pi and contains
a test that consists in the conjunction of the elements of a full decision path.
The left child contains + and the right child is a node corresponding to another
minimal positive hypothesis Pj or - if all minimal positive hypothesis have been
processed.
     An example of decision tree for example 2 is: ”if m3 ∈ [3.5, 4.7] and m4 ∈
[1, 1.4] then +, else (if m3 ∈ [3.5, 4.8] and m1 ∈ [5, 5.9] then +, else -)”.
     We detail below the complete process for examples 1 and 2.
X

4.2        Example 1

                      P     P1 = h[5, 7], [2, 3.2], [3.5, 4.7], [1, 1.4]i
                            P2 = h[5, 5.9], [2, 3.2], [3.5, 4.8], [1, 1.8]i
              N             N1 = h[6.2, 7.2], [2.8, 3.3], [4.8, 6], [1.8, 2.1]i
        intersections       P1 ∩ N1 = h[6.2, 7], [2.8, 3.2], ∅, ∅i
                            P2 ∩ N1 = h∅, [2.8, 3.2], [4.8], [1.8]i
full decisions paths from P h(m3 , [3.5, 4.7])i(P1 )
                            h(m4 , [1, 1.4])i(P1 )
                            h(m1 , [5, 5.9])i(P2 )
full decisions paths from N h(m3 , [4.8, 6]), (m1 , [6.2, 7.2])i(N1 )
                            h(m4 , [1.8, 2.1]), (m1 , [6.2, 7.2])i(N1 )



                m3 ∈ [3.5, 4.7]                                              m1 ∈ [5, 5.9]

          yes                            no                          yes                            no


      +                            m1 ∈ [5, 5.9]                    +                        m3 ∈ [3.5, 4.7]

                             yes                          no                          yes                      no

                         +                                    −                      +                          −

            m4 ∈ [1, 1.4]                                                     m1 ∈ [5, 5.9]

      yes                               no                              yes                          no


      +                        m1 ∈ [5, 5.9]                         +                         m4 ∈ [1, 1.4]

                         yes                             no                              yes                   no

                         +                                 −                          +                             −
    full paths from P1, then from P2                                               full paths from P2, then from P1



                 m1 ∈ [6.2, 7.2] ∧ m3 ∈ [4.8, 6]                        m1 ∈ [6.2, 7.2] ∧ m4 ∈ [1.8, 2.1]

                   yes                             no                    yes                             no


                  −                                 +                    −                                +

                                                        full paths from N1

                                    Fig. 2: Decision trees built from Example 1
                                                                                    XI

4.3   Example 2

              P             P1 = h[5, 7], [2, 3.2], [3.5, 4.7], [1, 1.4]i
                            P2 = h[5, 5.9], [2, 3.2], [3.5, 4.8], [1, 1.8]i
              N             N1 = h[5.9, 7.2], [3.2, 3.3], [5.7, 6], [1.4, 1.8]i
                            N2 = h[6.2, 7.2], [2.8, 3.2], [4.8, 6], [1.8, 1.8]i
        intersections       P1 ∩ N1 = h[5.9, 7], [3.2], ∅, [1.4]i
                            P1 ∩ N2 = h[6.2, 7], [2.8, 3.2], ∅, ∅i
                            P2 ∩ N1 = h[5.9], [3.2], ∅, [1.4, 1.8]i
                            P2 ∩ N2 = h∅, [2.8, 3.2], [4.8, 4.8], [1.8]i
full decisions paths from P h(m3 , [3.5, 4.7])i(P1 )
                            h(m3 , [3.5, 4.7]), (m4 , [1, 1.4])i(P1 ) (redundant)
                            h(m3 , [3.5, 4.8]), (m1 , [5, 5.9])i(P2 )
full decisions paths from N h(m3 , [5.7, 6])i(N1 )
                            h(m3 , [4.8, 6]), (m1 , [6.2, 7.2])i(N2 )
                            h(m4 , [1.8]), (m1 , [6.2, 7.2])i(N2 )


4.4   Comparison with traditional decision tree learning approaches

Standard algorithms such as C4.5 produce decision trees in which nodes contain
tests of the form a ≤ v, i.e. the value for attribute a is less or equal to v, while
our nodes contain conjunctions of tests of the form a ∈ [a1 , a2 ] ∧ b ∈ [b1 , b2 ]. A
solution consists in identifying minimal and maximal values for each attribute
in the training set, and by replacing them by −∞ and +∞ respectively in the
resulting trees (see Figure 4). Moreover, common decision tree induction tech-
niques use Information Gain maximization (or equivalently conditional entropy
minization) to choose the best split at each node. The conditional entropy of a
split is null when each child node is pure (contains only positive or negative ex-
amples). When this perfect split can not be expressed as an attribute-value test,
it can be shown that the optimal split that minimize conditional entropy consists
in maximizing the number of examples in one pure child node (proof is ommited
due to space limitation). This optimal split exactly matches our notion of posi-
tive (resp. negative) minimal hypothesis, which corresponds to descriptions that
gathers the maximum number of only positive (resp. negative) examples.
    However we insist that our algorithm is only a first and naive attempt to
produce decision trees from multi-valued contexts using pattern structures. Its
aim is only to clarify the links between decision tree learning and pattern struc-
tures. Therefore it obviously lacks of relevant data structures and optimization.
However we plan to focus our efforts on algorithm optimization and then on
rigorous experimentations on standard datasets.


5     Concluding remarks

In this paper, we studied the links between decision trees and FCA in the par-
ticular context of numerical data. More precisely, we focused on an extension
XII




                m3 ∈ [3.5, 4.7]                                m3 ∈ [3.5, 4.8] ∧ m1 ∈ [5, 5.9]

          yes                        no                  yes                           no


      +             m3 ∈ [3.5, 4.8] ∧ m1 ∈ [5, 5.9]      +                      m3 ∈ [3.5, 4.7]

                           yes                  no                        yes                       no

                        +                            −                   +                           −
           full paths from P1, then from P2                  full paths from P2, then from P1



                m3 ∈ [5.7, 6]                                  m3 ∈ [4.8, 6] ∧ m1 ∈ [6.2, 7.2]

          yes                        no                  yes                           no


      −            m3 ∈ [4.8, 6] ∧ m1 ∈ [6.2, 7.2]       −                         m3 ∈ [5.7, 6]

                           yes                  no                        yes                       no

                        −                            +                   −                           +

            m3 ∈ [5.7, 6]                                          m4 = 1.8 ∧ m1 ∈ [6.2, 7.2]
      yes                           no                       yes                        no


      −                 m4 = 1.8 ∧ m1 ∈ [6.2, 7.2]                                  m3 ∈ [5.7, 6]
                                                         −
                       yes                     no                            yes                    no

                       −                         +                        −                              +
            full paths from N1, then from N2                 full paths from N2, then from N1

                                  Fig. 3: Decision trees built from Example 2
                                                                                      XIII


          m4 ∈ (−∞, 1.4]                            m4 ≤ 1.4

    yes                     no               yes                  no


   +                   m1 ∈ (−∞, 5.9]        +                 m1 ≤ 5.9
                 yes                    no               yes                     no

                 +                       −               +                        −
                     our approach                  Weka implementation of C4.5

Fig. 4: Comparison of decisions trees produced by our approach and by C4.5 for Ex-
ample 1



of FCA for numerical data called interval pattern structures, that has recently
gained popularity through its ability to handle numerical data without any dis-
cretization step. We showed that interval pattern structures from positive and
negative examples are able to reveal positive and negative hypothesis, from which
decision paths and decision trees can be built.
    In future works, we will focus on a comprehensive and rigorous comparison
of our approach with traditional decision tree learning techniques. Moreover,
we will study how to introduce in our approach pruning techniques that avoid
overfitting. We will also investigate solutions in order to handle nominal class
attributes (i.e. more than two classes) and heterogeneous explanatory attributes
(binary, nominal, ordinal, numerical). Finally, notice that interval patterns are
closed since (.) is a closure operator. In a recent work [11], it has been shown
that whereas a closed interval pattern represents the smallest hyper-rectangle
in its equivalence class, interval pattern generators represent the largest hyper-
rectangles. Accordingly, generators are favoured by minimum description length
principle (MDL), since being less constrained. An interesting perspective is to
test their effectiveness to describe minimal hypothesis in the present work.


References

 1. Quinlan, J.: Induction of decision trees. Machine learning 1(1) (1986) 81–106
 2. Fu, H., Njiwoua, P., Nguifo, E.: A comparative study of fca-based supervised
    classification algorithms. Concept Lattices (2004) 219–220
 3. Ganter, B., Wille, R.: Formal Concept Analysis. Springer-Verlag (1999)
 4. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: ICCS
    ’01: Proceedings of the 9th International Conference on Conceptual Structures,
    Springer-Verlag (2001) 129–142
 5. Kaytoue, M., Kuznetsov, S.O., Napoli, A., Duplessis, S.: Mining gene expression
    data with pattern structures in formal concept analysis. Inf. Sci. 181(10) (2011)
    1989–2001
 6. Breiman, L.: Classification and regression trees. Chapman & Hall/CRC (1984)
XIV

 7. Kuznetsov, S.O.: Machine learning and formal concept analysis. Int. Conf. on
    Formal Concept Analysis, LNCS 2961, (2004) 287–312
 8. Guillas, S., Bertet, K., Ogier, J.: A generic description of the concept lattices
    classifier: Application to symbol recognition. Graphics Recognition. Ten Years
    Review and Future Perspectives (2006) 47–60
 9. Nijssen, S., Fromont, E.: Mining optimal decision trees from itemset lattices. In:
    Proceedings of the 13th ACM SIGKDD international conference on knowledge
    discovery and data mining, ACM (2007) 530–539
10. Nguifo, E., Njiwoua, P.: Iglue: A lattice-based constructive induction system. In-
    telligent data analysis 5(1) (2001) 73
11. Kaytoue, M., Kuznetsov, S.O., Napoli, A.: Revisiting Numerical Pattern Mining
    with Formal Concept Analysis. In: International Joint Conference on Artificial
    Intelligence (IJCAI), Barcelona, Espagne (2011)