=Paper=
{{Paper
|id=Vol-2668/paper10
|storemode=property
|title=A Link between Pattern Structures and Random Forests
|pdfUrl=https://ceur-ws.org/Vol-2668/paper10.pdf
|volume=Vol-2668
|authors=Tilo Krause,Lars Lumpe,Stefan E. Schmidt
|dblpUrl=https://dblp.org/rec/conf/cla/KrauseLS20
}}
==A Link between Pattern Structures and Random Forests==
<pdf width="1500px">https://ceur-ws.org/Vol-2668/paper10.pdf</pdf>
<pre>
           A Link between Pattern Structures
                  and Random Forests

               Tilo Krause1 , Lars Lumpe2 and Stefan E. Schmidt3

                               Institut für Algebra,
                        Technische Universität Dresden
         tilokrause@hotmail.de1 , larslumpe@gmail.com2 , midt1@msn.com3


       Abstract. Our paper shows that decision trees and random forests can
       be described via pattern structures. This new perspective leads to a bet-
       ter understanding of random forests and delivers a new way of analyzing
       complex data with the help of pattern structures.


1    Introduction

Pattern structures within the framework of formal concept analysis have been
introduced in [5]. Since then, they have been the subject of further investigation
like in [3, 11, 13, 12, 14] and have turned out to be an important tool for analyz-
ing various real-world applications (cf. [5, 7–10]). Previous investigations like [6]
show the connection between decision trees and formal concept analysis. In this
paper we want to present a new application and a link between pattern structures
and random forests. But, first, we will introduce embedded pattern structures,
enabling us to create pattern structures from pattern setups. Subsequently, we
are going to apply our findings to a data set [4] and train a random forest to
predict the quality of red wines. Our approach leads to a better understanding
of pattern structures and random forests. Also we provide a useful visualization
to interpret a random forest and construct a model to predict classes of red wines.


2    Preliminaries

We will start with a few general definitions:

Definition 1 (Adjunction). Let P = (P, ≤) and L = (L, ≤) be posets; further-
more let f : P → L and g : L → P be maps.

(1) The pair (f, g) is an adjunction w.r.t. (P, L) if f x ≤ y is equivalent to
    x ≤ gy for all x ∈ P and y ∈ L. In this case, we will refer to (P, L, f, g) as a
    poset adjunction.


Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). Published in Francisco
J. Valverde-Albacete, Martin Trnecka (Eds.): Proceedings of the 15th International
Conference on Concept Lattices and Their Applications, CLA 2020, pp. 131–143,
2020.
132     Tilo Krause, et al.

(2) g is residual from L to P if the preimage of a principal filter in P under g is
    always a principal filter in L, that is, for every x ∈ P there exists y ∈ L s.t.

                        g −1 {s ∈ P | x ≤ s} = {t ∈ L | y ≤ t}.

Definition 2 (restriction). Let P := (P, ≤P ) be a poset; then, for every set U
the poset
                          P | U := (U, ≤P ∩(U × U ))
is called the restriction of P onto U .

Definition 3. Let f : A → B be a surjective map, then

                              IPf : B → 2A , b 7→ f −1 b

is a partition of A. We call IPf an f induced partition of A.

Note: We write f : A  B for a surjective map f from A to B.

If we consider a poset of patterns, it often arises as a dual of a given poset.

Definition 4 (Dedekind MacNeille completion). Let D := (D, ≤) be an ordered
set. For every subset A ⊆ D we call

                       A↑ := {d ∈ D | a ≤ d for all a ∈ A}

the set of all upper bounds of A. The set of all lower bounds of A is defined
dually and denoted by A↓ . A cut of D is a pair (A, B) with A, B ⊆ D, A↑ = B
and A = B↓ . It is well known that these cuts ordered

                 (A1 , B1 ) ≤ (A2 , B2 ) :⇔ A1 ⊆ A2 (⇔ B2 ⊆ B1 )

form a complete lattice, the Dedekind MacNeille completion. It is the small-
est complete lattice containing a subset that is order isomorphic with D and
denoted by DM N (D).

Definition 5 (Interval Poset). Let P := (P, ≤P ) be a poset. Then we call

         IntP := {[p, q]P | p, q ∈ P } with [p, q]P := {t ∈ P | p ≤P t ≤P q}
the set of all intervals of P, and we refer to IntP := (IntP, ⊇) as the interval
poset of P.

Remark: (i) Let P be a poset. Then IntP is an upper bounded poset, that is, ∅
is the greatest element of IntP, provided P has at least 2 elements. Furthermore,
if P is a (complete) lattice, then so is IntP.
(ii) If A is a set of attributes, then (RA , ≤) := (R, ≤)A is a lattice. With (i) it
follows that Int(RA , ≤) is a lower bounded lattice.
                    A Link between Pattern Structures and Random Forests         133

    The main definitions of FCA are based on a binary relation I between a
set of so called objects G and a set of so called attributes M . However, in many
real-world knowledge discovery problems, researchers have to deal with data sets
that are much more complex than binary data tables. In our case, there was a set
of numerical attributes, such as the amount of acetic acid, density, the amount
of salt, etc., describing the quality of a red wine. To deal with this kind of data,
pattern structures are a useful tool.

Definition 6 (pattern setup, pattern structure). A triple P = (G, D, δ) is a
pattern setup if G is a set, D = (D, v) is a poset, and δ : G → D is a map. In
case every subset of δG := {δg | g ∈ G} has an infimum in D, we will refer to P
as pattern structure.

  In the sequel we need definition 5 and an application of Theorem 1 from [13].
We listed both below.

Definition 7 (pattern morphism). If G = (G, D, δ) and H = (H, E, ε) each is
a pattern setup, then a pair (f, ϕ) forms a pattern morphism from G to H if
f : G → H is a map and ϕ is a residual map from D to E satisfying ϕ ◦ δ = ε ◦ f ,
that is, the following diagram is commutative:

                                         f
                                     G          H

                                 δ                  ε

                                     D   ϕ      E


Theorem 1. Let G be a pattern structure and H be a pattern setup. If (f, ϕ)
is a pattern morphism from G to H , with f being surjective, then H is also a
pattern structure.

    In this paper we want to present a connection between pattern structures
and random forests. Decision trees are an important data mining tool, used to
predict classes of data sets. Many applications of decision trees build a model
on a so called training set and then try to predict classes of unseen data (test
sets). For an introduction into the subject, we recommend [1]. Before we give an
abstract definition of a decision tree, we want to present a way to construct one.


Construction 1 (decision tree). We start at the root node and split the data on
the feature that results in the best splitting measure (e.g. least gini impurity). In
an iterative process, we repeat this splitting procedure at each child node until
the nodes are pure (contain only one class) or a termination rule is fulfilled.
Nodes that do not split the data any further are called leaves.
134       Tilo Krause, et al.

Definition 8 (decision tree). A finite tree is defined as a triple T := (T, ≤T , 0T )
consisting of a finite poset (T, ≤) with least element 0T . This least element will
be denoted as root node. Every principal downset is partially ordered. The
maximal elements of T will be called leaves of T; let LeT denote the set of
leaves of T. A triple T := (G, T, λ) will be called a finite decision tree or
short decision tree if G is a finite set, T := (T, ≤T , 0T ) is a finite tree and
λ : G  LeT is a surjective map. The map

                          τ : T → LeT, t 7→ {b ∈ LeT | t ≤ b}

allocates the set of leaves to a node, where the node is less than or equal to the
leaf (in the sense of ≤T ). We call

                        Tc := (T c , ≤T , 0T ) with T c := T ∪ {1T }

the tree completion of T and Tc a complete tree. This leads to a map

                                   λ : G → T c , g 7→ λg.

In the here presented context we will refer to T s := (G, T, Tc , λ, λ, τ ) as a deci-
sion tree setup.

     Remark: (i) A tree completion Tc is the smallest complete lattice which
contains T.
(ii) The λ induced partition of G is given by

                                IPλ : LeT → 2G , b 7→ λ−1 b

      The following example illustrates the definition of a decision tree.


                                                   w0 , w 1 ,
                                                   w2 , w 3 ,
G m1 m2 class
                                                   w4 , w 5
w0 4 14 0
w1 5 12 0
w2 6 11 1                                                       w2 , w 3 ,
                                       w0 , w 1
w3 7 10 1                                                       w4 , w 5
w4 8 13 0
w5 9 15 0
                                                   w4 , w 5                  w2 , w 3

         Fig. 1. Decision tree example. left: the data; right: the constructed tree


   In [2] Leo Breiman introduced random forests, which lead to significant im-
provements in classification accuracy by growing an ensemble of decision trees
and letting them vote for the most popular class. In this paper we use the fol-
lowing definition of a random forest.
                     A Link between Pattern Structures and Random Forests            135

Definition 9 (random forest). Let Tis := (G, Ti , Tci , λi , λi , τi ) be a decision tree
setup for all i ∈ I, then we call
                                  F := (G, Tprod , λ) with
                   Y                   Y                   Y
        Tprod :=         Ti , Tprod :=   Ti and λ : G →      Ti , g 7→ {λi g}i∈I
                   i∈I                i∈I                  i∈I
a random forest.

3    Embedded Pattern Structures
The following definition establishes the connection between pattern setups and
pattern structures.
Definition 10 (embedded pattern structure). Let L be a complete lattice and
P := (G, D, δ) a pattern setup with D := L|D. Then we call
                         Pe := (G, L, D, δ) with δ : G → L, g 7→ δg
the embedded pattern structure. We say the pattern setup P is embedded
in the pattern structure (G, L, δ).
  The following two constructions are a demonstration of how to build an
embedded pattern structure from a pattern setup.
Construction 2. Let G be a set and let D := (D, ≤) be a poset, then for every
map % : G → D the elementary pattern structure is given by:
          P% := (G, L, ε) with L := (2D , ⊇) and ε : G → 2D , g 7→ {%g}.
Hence, the pattern setup (G, D, %) is embedded in the pattern structure P% .

Construction 3. For every pattern setup P := (G, D, δ), a pattern structure is
given through the Dedekind MacNeille completion and
                 (G, DM N (D), δ) with δ : G → DM N (D), g 7→ δg

4    Making the Link
In this section we are going to present a way to construct a pattern structure
from a given decision tree. Starting point is the following pattern setup:
Construction 4. Let T s := (G, T, Tc , λ, λ, τ ) be a decision tree setup, then
                                      P := (G, T, λ)
is a pattern setup. We embed this pattern setup in the embedded pattern struc-
ture
                              P e := (G, Tc , T, λ).
This leads to the pattern structure (G, Tc , λ).
136      Tilo Krause, et al.

    This construction provides a connection between decision trees and pattern
structures. Below we show the link between pattern structures and random
forests. For this purpose we have to look at the product of pattern structures.
Construction 5. Let I be a finite set and Pi := (G, Di , δi ) a pattern structure
for every i ∈ I. Then
                              P := (G, D, δ)
with               Y                  Y                       Y
            D :=         Di , D :=          Di and δ : G →          Di , g 7→ {δi g}i∈I
                   i∈I                i∈I                     i∈I

is a pattern structure, too.
   Our construction shows that a product of pattern structures leads to a pat-
tern structure again. A decision tree can be described via a pattern structure
and, since a random forest is a product of decision trees, a random forest is
characterized by a pattern structure, too. The following theorem lays this out.
Construction 6. Let I be an index set and Tis := (G, Ti , Tci , λi , λi , τi ) a decision
tree setup for all i ∈ I, and F := (G, Tprod , λ) the corresponding random forest.
Construction 4 provides us the embedded pattern structures Pie := (G, Tci , Ti , λi )
for every tree Ti . Then
                                P := (G, T, λ) with

                   Y                  Y                       Y
            T :=         Tic , T :=         Tci and λ : G →         Tic , g 7→ {λi g}i∈I
                   i∈I                i∈I                     i∈I

is a pattern structure, too.
    Construction 6 presents a way to describe a random forest through a pattern
structure. Not only does this novel perspective enable a deeper understanding
of random forests, but it also proves useful in real world applications, as will be
pointed out in the next section.


5      Real World Application
In this section, we are going to first describe the public data set we used, and then
apply Construction 6 to a typical data-mining related classification problem.

5.1    Red Wine Data Set
To apply our previous results on a public data set we choose the red wine data set
from [4]. There are 1599 examples of wines, described by 11 numerical attributes.
The input include objective tests (e.g. fixed acidity, sulphates, PH values, resid-
ual sugar, chlorides, density, alcohol...) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded
the wine quality between 0 (very bad) and 10 (very excellent). For our purpose,
                   A Link between Pattern Structures and Random Forests      137

we established a binary distinction where every wine with a quality score above
5 is classified ”good” and all below as ”bad”. This led to a set of 855 positive
and 744 negative examples. We split the data into a training set (75% of the
examples) and a test set (25% of the examples) and trained a random forest
with onehundred trees with at least 10 examples per leaf and a maximal depth
of 3 on the training set. We verified our model on the test set and obtained the
following confusion matrix:


                Fig. 2. model of the random forest on the test set


    Random forests are an useful tool like this example shows, but it is hard
to comprehend how a decision tree gives a ruling. Interval pattern structures,
introduced in [8], can be helpful like the next section shows.

5.2   From Evaluation Map to Interval Pattern Structures
Many data mining relevant data sets (like the red wine data set) can be described
by an evaluation matrix:
Definition 11 (evaluation map). Let G be a finite set, M a set of attributes,
and Wm := (Wm , ≤m ) a complete lattice for every attribute m ∈ M . Further,
let                       Y                    Y
                    W :=      Wm and W :=          Wm .
                            m∈M                   m∈M

Then, a map                                 Y
                          α : G → W, g 7→         {αm g}
                                            m∈M
138       Tilo Krause, et al.

such that
                                    αm : G → Wm , g 7→ wm
is called evaluation map. We call E := (G, M, W, α) an evaluation setup.
Example 1. In the wine data set in [4] it is possible to interpret the wines as
a set G, the describing attributes as the set M , and Wm as the numerical range
of attribute m with the natural order.
   In the above example, the evaluation map α : G → W assigns to every wine
a vector with values of all attributes m ∈ M .


Construction 7. Let E := (G, M, W, α) be an evaluation setup and, further,
let T s := (G, T, Tc , λ, λ, τ ) be a decision tree setup. Then,

                ϕ : T c → IntW|ϕT c , t 7→ [inf α(λ−1 τ t), sup α(λ−1 τ t)]
                                                W               W

   maps intervals to all nodes of the decision tree. The intervals are spanned by
the objects in the node.
   It is possible to create a pattern structure from this map, as the following
theorem shows.

Theorem 2. Let E := (G, M, W, α) be an evaluation setup and T s := (G, T, Tc , λ, λ, τ )
a decision tree setup. Then

                               P := (G, IntW|ϕT c , ϕ ◦ λ) with
                ϕ : T c → IntW|ϕT c , t 7→ [inf α(λ−1 τ t), sup α(λ−1 τ t)]
                                                W               W

is a pattern structure.
Proof. We want to apply theorem 1. Since the identity map is clearly surjective,
we have to show that (id, ϕ) is a pattern morphism, more precisely we have to
prove that ϕ is a residual map. This can be achieved by applying definition 1.
The preimage of a principal filter in IntW|ϕT under ϕ is always a principal filter
in Tc . For every x ∈ IntW|ϕT

                   ϕ−1 {s ∈ IntW|ϕT | x ≤ s} = {t ∈ T c | ϕ−1 x ≤ t}

holds.

      To lift the previous theorem up to a random forest, we use construction 5.
Construction Q8. Let E := (G, M, W, α) be an evaluation setup, I an index
set, and F :=   Ti a random forest with corresponding decision tree setups
                   i∈I
Tis := (G, Ti , Tci , λi , λi , τi ). Then

                             Pi := (G, IntW|ϕi Tic , ϕi ◦ λi ) with
                        A Link between Pattern Structures and Random Forests                 139

              ϕi : Tic → IntW|ϕi Tic , ti 7→ [inf α(λ−1                −1
                                                     i τi ti ), sup α(λi τi ti )]
                                                 W                  W
is a pattern structure and
                                           P := (G, D, δ)
with
       Y                          Y                               Y
D :=         IntW|ϕi Tic , D :=         IntW|ϕi Tic and δ : G →         IntW|ϕi Tic , g 7→ {ϕi ◦λi g}i∈I
       i∈I                        i∈I                             i∈I

is a pattern structure, too.
    This construction provides a set of intervals for every g ∈ G. The intersection
of these intervals is not empty because g is at least in every interval. Every g
assigned a predicted class probability from a decision tree, which is a purity
measure. It is calculated as the ratio of good examples to all examples in a leaf.
For a random forest, the average of this ratio for all leaves, which contains g
are used to calculate a score. We took the best scored g of our training set and
looked at union of the intervals constructed by the random forest. We received
the following result:


Fig. 3. Interval 1: build from the best scored wine (features scaled in the range [0, 1])


   The diagram show all 11 attributes (fixed acidity, volatile acidity,... ) of the
wines. Displayed is the interval, which is constructed from best scored wine of
the random forest. This pattern contains 330 wines of the training set, 297 of
them are good and 33 are bad. As mentioned before the training set contains of
75% of the data, which are 1199 wines. 652 of them are good and 547 bad.
To choose a collection of intervals for a prediction model we calculate a score on
a pattern p via
                             good wines in p
                score(p) =                   + ([ratiop ∗ 100] − 50)/50.
                              all good wines
To improve interpretability, we selected only the following best scored 5 intervals
and add them to the first interval to build our model. Again we looked at the
union of the intervals and scaled the range of the features to [0, 1].
140    Tilo Krause, et al.


      Fig. 4. Interval 2: contains altogether 331 wines, 295 good and 36 bad


      Fig. 5. Interval 3: contains altogether 466 wines, 383 good and 83 bad


      Fig. 6. Interval 4: contains altogether 366 wines, 318 good and 48 bad


      Fig. 7. Interval 5: contains altogether 466 wines, 452 good and 142 bad
                    A Link between Pattern Structures and Random Forests        141


       Fig. 8. Interval 6: contains altogether 288 wines, 261 good and 27 bad


    Combining these 6 intervals to predict the classes of the test set leads to the
following confusion matrix:


                       Fig. 9. Combination of the 6 intervals


    Our model lost some prediction quality, but the high interpretability makes
our algorithm useful. For example fixed acidity, which is the first attribute, seems
not important for the quality of a wine since all patterns have a huge range in this
feature. On the other hand the range in the attribute sulphates is comparatively
small. This suggests the assumption, that sulphates are an important feature to
classify a wine.
142     Tilo Krause, et al.

6     Conclusion
We introduced a novel framework for the application of pattern structures. The
paper presents a new way of building a pattern structure through a given de-
cision tree. This link is an interesting view on things and via a visualization it
leads to a better understanding of how a decision tree gives a ruling. Further-
more, our paper shows that the product of pattern structures are also pattern
structures and so we extend the link between pattern structures and decision
trees to random forests, since they are a product of decision trees. Also we in-
troduced a model to predict classes of red wines. This model is an abstraction
of a random forest. As here introduced the prediction of the random forest is
better, but the high interpretability makes our algorithm valuable. This is just
a first useful example of our method. Further investigations on other data sets
have to prove the importance of our algorithm.
Besides, we made some theoretical progress in the field of pattern structures by
introducing embedded pattern structures. They provide a framework where it is
possible to construct a pattern structure out of a given pattern setup.


References
  1. L. Breiman, J. H. Friedman, R. A. Olshen, C.J. Stone (1984), Classification and
     Regression Trees, Chapman & Hall.
  2. L. Breiman (2001), Machine Learning, Springer.
  3. A. Buzmakov, S. O. Kuznetsov, A. Napoli (2015) , Revisiting Pattern Structure
     Projections. Formal Concept Analysis. Lecture Notes in Artificial Intelligence
     (Springer), Vol. 9113, pp 200-215.
  4. P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis (2009), Modeling wine
     preferences by data mining from physicochemical properties. Decision Support
     Systems 47(4), pp. 547-553.
  5. B. Ganter, S. O. Kuznetsov (2001), Pattern Structures and Their Projections.
     Proc. 9th Int. Conf. on Conceptual Structures, ICCS’01, G. Stumme and H. Del-
     ugach (Eds.). Lecture Notes in Artificial Intelligence (Springer), Vol. 2120, pp.
     129-142.
  6. S. Guillas, K. Bertet, J.M. Ogier, N. Girard (2008), Some links between decision
     tree and dichotomic lattice, CLA 2008, pp. 193-205
  7. T. B. Kaiser, S. E. Schmidt (2011), Some remarks on the relation between an-
     notated ordered sets and pattern structures. Pattern Recognition and Machine
     Intelligence. Lecture Notes in Computer Science (Springer), Vol. 6744, pp 43-48.
  8. M. Kaytoue, S. O. Kuznetsov, A. Napoli, S. Duplessis (2011), Mining gene expres-
     sion data with pattern structures in formal concept analysis. Information Sciences
     (Elsevier), Vol.181, pp. 1989-2001.
  9. S. O. Kuznetsov (2009), Pattern structures for analyzing complex data. In H.
     Sakai et al. (Eds.). Proceedings of the 12th international conference on rough
     sets, fuzzy sets, data mining and granular computing (RSFDGrC’09). Lecture
     Notes in Artificial Intelligence (Springer), Vol. 5908, pp. 33-44.
 10. S. O. Kuznetsov (2013), Scalable Knowledge Discovery in Complex Data with Pat-
     tern Structures. In: P. Maji, A. Ghosh, M.N. Murty, K. Ghosh, S.K. Pal, (Eds.).
     Proc. 5th International Conference Pattern Recognition and Machine Intelligence
                  A Link between Pattern Structures and Random Forests        143

    (PReMI’2013). Lecture Notes in Computer Science (Springer), Vol. 8251, pp. 30-
    41.
11. L. Lumpe and S. E. Schmidt (2015), A Note on Pattern Structures and Their
    Projections. Formal Concept Analysis. Lecture Notes in Artificial Intelligence
    (Springer), Vol. 9113, pp 145-150.
12. L. Lumpe, S. E. Schmidt (2016), Morphisms Between Pattern Structures and
    Their Impact on Concept Lattices, FCA4AI@ ECAI 2016, pp 25-34.
13. L. Lumpe, S. E. Schmidt (2015), Pattern Structures and Their Morphisms. CLA
    2015, pp. 171-179.
14. L. Lumpe, S. E. Schmidt (2016), Viewing Morphisms Between Pattern Struc-
    tures via Their Concept Lattices and via Their Representations, International
    Symposium on Methodologies for Intelligent Systems 2017, pp. 597-608.

</pre>