=Paper=
{{Paper
|id=Vol-2668/paper10
|storemode=property
|title=A Link between Pattern Structures and Random Forests
|pdfUrl=https://ceur-ws.org/Vol-2668/paper10.pdf
|volume=Vol-2668
|authors=Tilo Krause,Lars Lumpe,Stefan E. Schmidt
|dblpUrl=https://dblp.org/rec/conf/cla/KrauseLS20
}}
==A Link between Pattern Structures and Random Forests==
A Link between Pattern Structures and Random Forests Tilo Krause1 , Lars Lumpe2 and Stefan E. Schmidt3 Institut für Algebra, Technische Universität Dresden tilokrause@hotmail.de1 , larslumpe@gmail.com2 , midt1@msn.com3 Abstract. Our paper shows that decision trees and random forests can be described via pattern structures. This new perspective leads to a bet- ter understanding of random forests and delivers a new way of analyzing complex data with the help of pattern structures. 1 Introduction Pattern structures within the framework of formal concept analysis have been introduced in [5]. Since then, they have been the subject of further investigation like in [3, 11, 13, 12, 14] and have turned out to be an important tool for analyz- ing various real-world applications (cf. [5, 7–10]). Previous investigations like [6] show the connection between decision trees and formal concept analysis. In this paper we want to present a new application and a link between pattern structures and random forests. But, first, we will introduce embedded pattern structures, enabling us to create pattern structures from pattern setups. Subsequently, we are going to apply our findings to a data set [4] and train a random forest to predict the quality of red wines. Our approach leads to a better understanding of pattern structures and random forests. Also we provide a useful visualization to interpret a random forest and construct a model to predict classes of red wines. 2 Preliminaries We will start with a few general definitions: Definition 1 (Adjunction). Let P = (P, ≤) and L = (L, ≤) be posets; further- more let f : P → L and g : L → P be maps. (1) The pair (f, g) is an adjunction w.r.t. (P, L) if f x ≤ y is equivalent to x ≤ gy for all x ∈ P and y ∈ L. In this case, we will refer to (P, L, f, g) as a poset adjunction. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Published in Francisco J. Valverde-Albacete, Martin Trnecka (Eds.): Proceedings of the 15th International Conference on Concept Lattices and Their Applications, CLA 2020, pp. 131–143, 2020. 132 Tilo Krause, et al. (2) g is residual from L to P if the preimage of a principal filter in P under g is always a principal filter in L, that is, for every x ∈ P there exists y ∈ L s.t. g −1 {s ∈ P | x ≤ s} = {t ∈ L | y ≤ t}. Definition 2 (restriction). Let P := (P, ≤P ) be a poset; then, for every set U the poset P | U := (U, ≤P ∩(U × U )) is called the restriction of P onto U . Definition 3. Let f : A → B be a surjective map, then IPf : B → 2A , b 7→ f −1 b is a partition of A. We call IPf an f induced partition of A. Note: We write f : A B for a surjective map f from A to B. If we consider a poset of patterns, it often arises as a dual of a given poset. Definition 4 (Dedekind MacNeille completion). Let D := (D, ≤) be an ordered set. For every subset A ⊆ D we call A↑ := {d ∈ D | a ≤ d for all a ∈ A} the set of all upper bounds of A. The set of all lower bounds of A is defined dually and denoted by A↓ . A cut of D is a pair (A, B) with A, B ⊆ D, A↑ = B and A = B↓ . It is well known that these cuts ordered (A1 , B1 ) ≤ (A2 , B2 ) :⇔ A1 ⊆ A2 (⇔ B2 ⊆ B1 ) form a complete lattice, the Dedekind MacNeille completion. It is the small- est complete lattice containing a subset that is order isomorphic with D and denoted by DM N (D). Definition 5 (Interval Poset). Let P := (P, ≤P ) be a poset. Then we call IntP := {[p, q]P | p, q ∈ P } with [p, q]P := {t ∈ P | p ≤P t ≤P q} the set of all intervals of P, and we refer to IntP := (IntP, ⊇) as the interval poset of P. Remark: (i) Let P be a poset. Then IntP is an upper bounded poset, that is, ∅ is the greatest element of IntP, provided P has at least 2 elements. Furthermore, if P is a (complete) lattice, then so is IntP. (ii) If A is a set of attributes, then (RA , ≤) := (R, ≤)A is a lattice. With (i) it follows that Int(RA , ≤) is a lower bounded lattice. A Link between Pattern Structures and Random Forests 133 The main definitions of FCA are based on a binary relation I between a set of so called objects G and a set of so called attributes M . However, in many real-world knowledge discovery problems, researchers have to deal with data sets that are much more complex than binary data tables. In our case, there was a set of numerical attributes, such as the amount of acetic acid, density, the amount of salt, etc., describing the quality of a red wine. To deal with this kind of data, pattern structures are a useful tool. Definition 6 (pattern setup, pattern structure). A triple P = (G, D, δ) is a pattern setup if G is a set, D = (D, v) is a poset, and δ : G → D is a map. In case every subset of δG := {δg | g ∈ G} has an infimum in D, we will refer to P as pattern structure. In the sequel we need definition 5 and an application of Theorem 1 from [13]. We listed both below. Definition 7 (pattern morphism). If G = (G, D, δ) and H = (H, E, ε) each is a pattern setup, then a pair (f, ϕ) forms a pattern morphism from G to H if f : G → H is a map and ϕ is a residual map from D to E satisfying ϕ ◦ δ = ε ◦ f , that is, the following diagram is commutative: f G H δ ε D ϕ E Theorem 1. Let G be a pattern structure and H be a pattern setup. If (f, ϕ) is a pattern morphism from G to H , with f being surjective, then H is also a pattern structure. In this paper we want to present a connection between pattern structures and random forests. Decision trees are an important data mining tool, used to predict classes of data sets. Many applications of decision trees build a model on a so called training set and then try to predict classes of unseen data (test sets). For an introduction into the subject, we recommend [1]. Before we give an abstract definition of a decision tree, we want to present a way to construct one. Construction 1 (decision tree). We start at the root node and split the data on the feature that results in the best splitting measure (e.g. least gini impurity). In an iterative process, we repeat this splitting procedure at each child node until the nodes are pure (contain only one class) or a termination rule is fulfilled. Nodes that do not split the data any further are called leaves. 134 Tilo Krause, et al. Definition 8 (decision tree). A finite tree is defined as a triple T := (T, ≤T , 0T ) consisting of a finite poset (T, ≤) with least element 0T . This least element will be denoted as root node. Every principal downset is partially ordered. The maximal elements of T will be called leaves of T; let LeT denote the set of leaves of T. A triple T := (G, T, λ) will be called a finite decision tree or short decision tree if G is a finite set, T := (T, ≤T , 0T ) is a finite tree and λ : G LeT is a surjective map. The map τ : T → LeT, t 7→ {b ∈ LeT | t ≤ b} allocates the set of leaves to a node, where the node is less than or equal to the leaf (in the sense of ≤T ). We call Tc := (T c , ≤T , 0T ) with T c := T ∪ {1T } the tree completion of T and Tc a complete tree. This leads to a map λ : G → T c , g 7→ λg. In the here presented context we will refer to T s := (G, T, Tc , λ, λ, τ ) as a deci- sion tree setup. Remark: (i) A tree completion Tc is the smallest complete lattice which contains T. (ii) The λ induced partition of G is given by IPλ : LeT → 2G , b 7→ λ−1 b The following example illustrates the definition of a decision tree. w0 , w 1 , w2 , w 3 , G m1 m2 class w4 , w 5 w0 4 14 0 w1 5 12 0 w2 6 11 1 w2 , w 3 , w0 , w 1 w3 7 10 1 w4 , w 5 w4 8 13 0 w5 9 15 0 w4 , w 5 w2 , w 3 Fig. 1. Decision tree example. left: the data; right: the constructed tree In [2] Leo Breiman introduced random forests, which lead to significant im- provements in classification accuracy by growing an ensemble of decision trees and letting them vote for the most popular class. In this paper we use the fol- lowing definition of a random forest. A Link between Pattern Structures and Random Forests 135 Definition 9 (random forest). Let Tis := (G, Ti , Tci , λi , λi , τi ) be a decision tree setup for all i ∈ I, then we call F := (G, Tprod , λ) with Y Y Y Tprod := Ti , Tprod := Ti and λ : G → Ti , g 7→ {λi g}i∈I i∈I i∈I i∈I a random forest. 3 Embedded Pattern Structures The following definition establishes the connection between pattern setups and pattern structures. Definition 10 (embedded pattern structure). Let L be a complete lattice and P := (G, D, δ) a pattern setup with D := L|D. Then we call Pe := (G, L, D, δ) with δ : G → L, g 7→ δg the embedded pattern structure. We say the pattern setup P is embedded in the pattern structure (G, L, δ). The following two constructions are a demonstration of how to build an embedded pattern structure from a pattern setup. Construction 2. Let G be a set and let D := (D, ≤) be a poset, then for every map % : G → D the elementary pattern structure is given by: P% := (G, L, ε) with L := (2D , ⊇) and ε : G → 2D , g 7→ {%g}. Hence, the pattern setup (G, D, %) is embedded in the pattern structure P% . Construction 3. For every pattern setup P := (G, D, δ), a pattern structure is given through the Dedekind MacNeille completion and (G, DM N (D), δ) with δ : G → DM N (D), g 7→ δg 4 Making the Link In this section we are going to present a way to construct a pattern structure from a given decision tree. Starting point is the following pattern setup: Construction 4. Let T s := (G, T, Tc , λ, λ, τ ) be a decision tree setup, then P := (G, T, λ) is a pattern setup. We embed this pattern setup in the embedded pattern struc- ture P e := (G, Tc , T, λ). This leads to the pattern structure (G, Tc , λ). 136 Tilo Krause, et al. This construction provides a connection between decision trees and pattern structures. Below we show the link between pattern structures and random forests. For this purpose we have to look at the product of pattern structures. Construction 5. Let I be a finite set and Pi := (G, Di , δi ) a pattern structure for every i ∈ I. Then P := (G, D, δ) with Y Y Y D := Di , D := Di and δ : G → Di , g 7→ {δi g}i∈I i∈I i∈I i∈I is a pattern structure, too. Our construction shows that a product of pattern structures leads to a pat- tern structure again. A decision tree can be described via a pattern structure and, since a random forest is a product of decision trees, a random forest is characterized by a pattern structure, too. The following theorem lays this out. Construction 6. Let I be an index set and Tis := (G, Ti , Tci , λi , λi , τi ) a decision tree setup for all i ∈ I, and F := (G, Tprod , λ) the corresponding random forest. Construction 4 provides us the embedded pattern structures Pie := (G, Tci , Ti , λi ) for every tree Ti . Then P := (G, T, λ) with Y Y Y T := Tic , T := Tci and λ : G → Tic , g 7→ {λi g}i∈I i∈I i∈I i∈I is a pattern structure, too. Construction 6 presents a way to describe a random forest through a pattern structure. Not only does this novel perspective enable a deeper understanding of random forests, but it also proves useful in real world applications, as will be pointed out in the next section. 5 Real World Application In this section, we are going to first describe the public data set we used, and then apply Construction 6 to a typical data-mining related classification problem. 5.1 Red Wine Data Set To apply our previous results on a public data set we choose the red wine data set from [4]. There are 1599 examples of wines, described by 11 numerical attributes. The input include objective tests (e.g. fixed acidity, sulphates, PH values, resid- ual sugar, chlorides, density, alcohol...) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). For our purpose, A Link between Pattern Structures and Random Forests 137 we established a binary distinction where every wine with a quality score above 5 is classified ”good” and all below as ”bad”. This led to a set of 855 positive and 744 negative examples. We split the data into a training set (75% of the examples) and a test set (25% of the examples) and trained a random forest with onehundred trees with at least 10 examples per leaf and a maximal depth of 3 on the training set. We verified our model on the test set and obtained the following confusion matrix: Fig. 2. model of the random forest on the test set Random forests are an useful tool like this example shows, but it is hard to comprehend how a decision tree gives a ruling. Interval pattern structures, introduced in [8], can be helpful like the next section shows. 5.2 From Evaluation Map to Interval Pattern Structures Many data mining relevant data sets (like the red wine data set) can be described by an evaluation matrix: Definition 11 (evaluation map). Let G be a finite set, M a set of attributes, and Wm := (Wm , ≤m ) a complete lattice for every attribute m ∈ M . Further, let Y Y W := Wm and W := Wm . m∈M m∈M Then, a map Y α : G → W, g 7→ {αm g} m∈M 138 Tilo Krause, et al. such that αm : G → Wm , g 7→ wm is called evaluation map. We call E := (G, M, W, α) an evaluation setup. Example 1. In the wine data set in [4] it is possible to interpret the wines as a set G, the describing attributes as the set M , and Wm as the numerical range of attribute m with the natural order. In the above example, the evaluation map α : G → W assigns to every wine a vector with values of all attributes m ∈ M . Construction 7. Let E := (G, M, W, α) be an evaluation setup and, further, let T s := (G, T, Tc , λ, λ, τ ) be a decision tree setup. Then, ϕ : T c → IntW|ϕT c , t 7→ [inf α(λ−1 τ t), sup α(λ−1 τ t)] W W maps intervals to all nodes of the decision tree. The intervals are spanned by the objects in the node. It is possible to create a pattern structure from this map, as the following theorem shows. Theorem 2. Let E := (G, M, W, α) be an evaluation setup and T s := (G, T, Tc , λ, λ, τ ) a decision tree setup. Then P := (G, IntW|ϕT c , ϕ ◦ λ) with ϕ : T c → IntW|ϕT c , t 7→ [inf α(λ−1 τ t), sup α(λ−1 τ t)] W W is a pattern structure. Proof. We want to apply theorem 1. Since the identity map is clearly surjective, we have to show that (id, ϕ) is a pattern morphism, more precisely we have to prove that ϕ is a residual map. This can be achieved by applying definition 1. The preimage of a principal filter in IntW|ϕT under ϕ is always a principal filter in Tc . For every x ∈ IntW|ϕT ϕ−1 {s ∈ IntW|ϕT | x ≤ s} = {t ∈ T c | ϕ−1 x ≤ t} holds. To lift the previous theorem up to a random forest, we use construction 5. Construction Q8. Let E := (G, M, W, α) be an evaluation setup, I an index set, and F := Ti a random forest with corresponding decision tree setups i∈I Tis := (G, Ti , Tci , λi , λi , τi ). Then Pi := (G, IntW|ϕi Tic , ϕi ◦ λi ) with A Link between Pattern Structures and Random Forests 139 ϕi : Tic → IntW|ϕi Tic , ti 7→ [inf α(λ−1 −1 i τi ti ), sup α(λi τi ti )] W W is a pattern structure and P := (G, D, δ) with Y Y Y D := IntW|ϕi Tic , D := IntW|ϕi Tic and δ : G → IntW|ϕi Tic , g 7→ {ϕi ◦λi g}i∈I i∈I i∈I i∈I is a pattern structure, too. This construction provides a set of intervals for every g ∈ G. The intersection of these intervals is not empty because g is at least in every interval. Every g assigned a predicted class probability from a decision tree, which is a purity measure. It is calculated as the ratio of good examples to all examples in a leaf. For a random forest, the average of this ratio for all leaves, which contains g are used to calculate a score. We took the best scored g of our training set and looked at union of the intervals constructed by the random forest. We received the following result: Fig. 3. Interval 1: build from the best scored wine (features scaled in the range [0, 1]) The diagram show all 11 attributes (fixed acidity, volatile acidity,... ) of the wines. Displayed is the interval, which is constructed from best scored wine of the random forest. This pattern contains 330 wines of the training set, 297 of them are good and 33 are bad. As mentioned before the training set contains of 75% of the data, which are 1199 wines. 652 of them are good and 547 bad. To choose a collection of intervals for a prediction model we calculate a score on a pattern p via good wines in p score(p) = + ([ratiop ∗ 100] − 50)/50. all good wines To improve interpretability, we selected only the following best scored 5 intervals and add them to the first interval to build our model. Again we looked at the union of the intervals and scaled the range of the features to [0, 1]. 140 Tilo Krause, et al. Fig. 4. Interval 2: contains altogether 331 wines, 295 good and 36 bad Fig. 5. Interval 3: contains altogether 466 wines, 383 good and 83 bad Fig. 6. Interval 4: contains altogether 366 wines, 318 good and 48 bad Fig. 7. Interval 5: contains altogether 466 wines, 452 good and 142 bad A Link between Pattern Structures and Random Forests 141 Fig. 8. Interval 6: contains altogether 288 wines, 261 good and 27 bad Combining these 6 intervals to predict the classes of the test set leads to the following confusion matrix: Fig. 9. Combination of the 6 intervals Our model lost some prediction quality, but the high interpretability makes our algorithm useful. For example fixed acidity, which is the first attribute, seems not important for the quality of a wine since all patterns have a huge range in this feature. On the other hand the range in the attribute sulphates is comparatively small. This suggests the assumption, that sulphates are an important feature to classify a wine. 142 Tilo Krause, et al. 6 Conclusion We introduced a novel framework for the application of pattern structures. The paper presents a new way of building a pattern structure through a given de- cision tree. This link is an interesting view on things and via a visualization it leads to a better understanding of how a decision tree gives a ruling. Further- more, our paper shows that the product of pattern structures are also pattern structures and so we extend the link between pattern structures and decision trees to random forests, since they are a product of decision trees. Also we in- troduced a model to predict classes of red wines. This model is an abstraction of a random forest. As here introduced the prediction of the random forest is better, but the high interpretability makes our algorithm valuable. This is just a first useful example of our method. Further investigations on other data sets have to prove the importance of our algorithm. Besides, we made some theoretical progress in the field of pattern structures by introducing embedded pattern structures. They provide a framework where it is possible to construct a pattern structure out of a given pattern setup. References 1. L. Breiman, J. H. Friedman, R. A. Olshen, C.J. Stone (1984), Classification and Regression Trees, Chapman & Hall. 2. L. Breiman (2001), Machine Learning, Springer. 3. A. Buzmakov, S. O. Kuznetsov, A. Napoli (2015) , Revisiting Pattern Structure Projections. Formal Concept Analysis. Lecture Notes in Artificial Intelligence (Springer), Vol. 9113, pp 200-215. 4. P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis (2009), Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47(4), pp. 547-553. 5. B. Ganter, S. O. Kuznetsov (2001), Pattern Structures and Their Projections. Proc. 9th Int. Conf. on Conceptual Structures, ICCS’01, G. Stumme and H. Del- ugach (Eds.). Lecture Notes in Artificial Intelligence (Springer), Vol. 2120, pp. 129-142. 6. S. Guillas, K. Bertet, J.M. Ogier, N. Girard (2008), Some links between decision tree and dichotomic lattice, CLA 2008, pp. 193-205 7. T. B. Kaiser, S. E. Schmidt (2011), Some remarks on the relation between an- notated ordered sets and pattern structures. Pattern Recognition and Machine Intelligence. Lecture Notes in Computer Science (Springer), Vol. 6744, pp 43-48. 8. M. Kaytoue, S. O. Kuznetsov, A. Napoli, S. Duplessis (2011), Mining gene expres- sion data with pattern structures in formal concept analysis. Information Sciences (Elsevier), Vol.181, pp. 1989-2001. 9. S. O. Kuznetsov (2009), Pattern structures for analyzing complex data. In H. Sakai et al. (Eds.). Proceedings of the 12th international conference on rough sets, fuzzy sets, data mining and granular computing (RSFDGrC’09). Lecture Notes in Artificial Intelligence (Springer), Vol. 5908, pp. 33-44. 10. S. O. Kuznetsov (2013), Scalable Knowledge Discovery in Complex Data with Pat- tern Structures. In: P. Maji, A. Ghosh, M.N. Murty, K. Ghosh, S.K. Pal, (Eds.). Proc. 5th International Conference Pattern Recognition and Machine Intelligence A Link between Pattern Structures and Random Forests 143 (PReMI’2013). Lecture Notes in Computer Science (Springer), Vol. 8251, pp. 30- 41. 11. L. Lumpe and S. E. Schmidt (2015), A Note on Pattern Structures and Their Projections. Formal Concept Analysis. Lecture Notes in Artificial Intelligence (Springer), Vol. 9113, pp 145-150. 12. L. Lumpe, S. E. Schmidt (2016), Morphisms Between Pattern Structures and Their Impact on Concept Lattices, FCA4AI@ ECAI 2016, pp 25-34. 13. L. Lumpe, S. E. Schmidt (2015), Pattern Structures and Their Morphisms. CLA 2015, pp. 171-179. 14. L. Lumpe, S. E. Schmidt (2016), Viewing Morphisms Between Pattern Struc- tures via Their Concept Lattices and via Their Representations, International Symposium on Methodologies for Intelligent Systems 2017, pp. 597-608.