=Paper=
{{Paper
|id=Vol-2972/paper9
|storemode=property
|title=Summation of Decision Trees
|pdfUrl=https://ceur-ws.org/Vol-2972/paper9.pdf
|volume=Vol-2972
|authors=Egor Dudyrev,Sergei O. Kuznetsov
|dblpUrl=https://dblp.org/rec/conf/ijcai/DudyrevK21
}}
==Summation of Decision Trees==
Summation of Decision Trees Egor Dudyrev1[0000−0002−2144−3308] Sergei O. Kuznetsov1[0000−0003−3284−9001] National Research University Higher School of Economics, Moscow, Russia Abstract. Ensembles of decision trees, like Random Forests are efficient machine learning models with state-of-the-art prediction quality. How- ever, their predictions are much less transparent than those of a single decision tree. In this paper, we describe a prediction model based on a single decision tree in terms of Formal Concept Analysis. We define a differential way to describing a decision rule. We conclude by present- ing an approach to summing an ensemble of decision trees into a single decision semilattice with the same predictions. Keywords: Ensembles of Decision Trees · Formal Concept Analysis · Supervised Machine Learning. 1 Introduction A decision tree [4] is a popular machine learning model. It can help face the chal- lenge of interpretable machine learning. However, usually it is too simplistic to show good learning performance. Ensembles of decision trees show better learn- ing quality. Some of them – such as random forest [3] and gradient boosting [7] – are considered state-of-the-art. However, ensembles miss the high interpretability of a single decision tree. Formal Concept Analysis (FCA) [8] is a mathematically well-founded theory aimed at data analysis. In [1], [2], [9], [10], researchers show the connection between decision trees and FCA. This paper continues our study on the connection between FCA and decision trees started in [6]. In that paper, we have presented the following pipeline. First, we convert a decision tree into a concept lattice. Second, we fuse an ensemble of concept lattices into a single concept lattice. Third, we convert a concept lattice into a decision (semi)lattice: a supervised machine learning model with prediction quality non-inferior to that of ensembles of decision trees. In what follows, we present a method for constructing a decision semilattice that outputs the same predictions as an ensemble of decision trees. We propose a differential way for describing a decision rule and, consequently, a decision tree and a decision semilattice. We finish by summing the ensemble of decision trees into a single decision semilattice. Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Basic definitions For standard definitions of FCA and decision trees, we refer the reader to [8] and [4], respectively. Here we use binary attributes to describe the algorithms. In the experimental section, we extend the algorithm to processing numerical data with interval pattern structures [11]. The standard FCA framework operates with a set M of binary attributes. In what follows we often replace a set of attributes M by a set M ? that consists both of attributes m ∈ M and their complements m (“not m”): M ? = M ∪ {m | ∀m ∈ M } (1) 3 The proposed approach 3.1 Decision tree and decision semilattice Definition 1. A decision rule (p, t) is a pair of a subset of attributes p ⊆ M ? called a premise and a real number t ∈ R called a target. The attributes in the premise p are non-complementary, i.e. ∀m ∈ M ? : if m ∈ p then m ∈ /p. Given a description x ⊆ M ? , a decision rule can be expressed as “if x contains p: p ⊆ x then predict t”. We order decision rules (p, t), (p̃, t̃) by the reverse inclusion of their premises: (p, t) < (p̃, t̃) ⇔ p ⊃ p̃ (2) We cannot apply a single decision rule to any possible description x ⊆ M ? . Therefore, we should use a set of decision rules. A popular means of structuring decision rules in a set is a decision tree DT . Definition 2. Decision tree DT is an ordered set of decision rules satisfying the following properties: (a) each premise in DT is unique, (b) DT contains a root decision rule with the empty premise, (c) each non-root decision rule in DT has exactly one direct bigger neighbour (“parent”), and one direct smaller neighbour of a parent (“sibling”) which differ by one complementary attribute: a) ∀(p, t) ∈ DT @t̃ ∈ R, t̃ 6= t : (p, t̃) ∈ DT (3) b) ∃t ∈ R : (∅, t) ∈ DT (4) c) ∀(p, t) ∈ DT, p 6= ∅, ∃!(ppar , tpar ), (psib , tsib ) ∈ DT, m ∈ p : (5) (ppar , tpar ) (p, t), (ppar , tpar ) (psib , tsib ), psib 6= p ppar = p \ {m}, psib = p \ {m} ∪ {m} We propose a more general type of the ordered set of decision rules: a decision semilattice DSL. To define it, we relax the property ”c” of a decision tree DT . Definition 3. Decision semilattice DSL is an ordered set of decision rules sat- isfying properties a-b (eq. 3-4) from Definition 2. A decision tree DT is a special case of a decision semilattice DSL. Thus, any operation defined for a decision semilattice can also be applied to a decision tree. We define a “prediction” function φ(DSL, x) as a function outputting a single target prediction for a description x ⊆ M ? based on a decision semilattice DSL: 1 X φ(DSL, x) = x t (6) |DSLmin | x (p,t)∈DSLmin where DSLxmin = {(p, t) ∈ DSLx | @(p̃, t̃) ∈ DSLx : (p̃, t̃) < (p, t)} (7) x DSL = {(p, t) ∈ DSL | p ⊆ x} (8) 3.2 Differential decision tree In this subsection we define a “differential” way for describing a decision rule: (given a prior prediction ŷ ∈ R) “if x contains p : p ⊆ x then add t to the prediction ŷ”. We define a function φ∆ (DSL, x) which outputs a single target prediction for a description x ⊆ M ? based on a decision semilattice DSL and differential approach: X φ∆ (DSL, x) = t (9) (p,t)∈DSLx It is unclear how to construct “differential” decision trees and semilattices. We suggest a solution to the former task. To construct a differential decision tree, one can construct a decision tree DT and then “differentiate” it with a function δ: δ(DT ) = {(p, t − t̃) | (p, t), (p̃, t̃) ∈ DT : (p, t) ≺ (p̃, t̃)} ∪ {(∅, t) ∈ DT } (10) Proposition 1. For a decision tree DT a prediction φ(DT, x) matches the pre- diction φ∆ (δ(DT ), x) for any x. Proof. The proof is derived from two facts: (i) a decision tree DT always uses x only one decision rule to make a final prediction: |DTmin | = 1, ∀x ⊆ M ? (ii) each target of a decision rule in δ(DT ) represents the difference between the target of the corresponding decision rule in DT and the target of its parent. 3.3 Summation of differential decision semilattices We define an addition operation on decision semilattices in the following way: DSL1 + DSL2 = {(p, t1 + t2 ) | ∀(p, t1 ) ∈ DSL1 , t2 ∈ R : (p, t2 ) ∈ DSL2 } ∪ {(p, t1 ) ∈ DSL1 | ∀t2 ∈ R : (p, t2 ) ∈ / DSL2 } (11) ∪ {(p, t2 ) ∈ DSL2 | ∀t1 ∈ R : (p, t1 ) ∈ / DSL1 } The addition operation leads to an important proposition: Proposition 2. Given a set of n decision semilattices {DSLi }ni=1 , the “differ- ential” prediction of the sum of decision semilattices matches the sum of “dif- ferential” predictions of the summand decision semilattices : Xn n X φ∆ ( DSLi , x) = φ∆ (DSLi , x), ∀x ⊆ M ? (12) i=1 i=1 Proof. The proof follows from the definitions of the addition operation (eq. 11) and the function φ∆ (eq. 9). The summation of several identical decision semilattices can be represented as multiplication by a real number: k X DSL ∗ k = DSL = {(p, t ∗ k) | (p, t) ∈ DSL}, ∀k ∈ R (13) i=1 3.4 Ensembles of decision trees as decision semilattices Random forest RF and gradient boosting GB are state-of-the-art ensembles of decision trees. They both operate with a set of decision trees {DTi }ni=1 and, optionally, real-valued hyperparameters. Although the ensembles construct the set of decision trees differently, their prediction functions φRF and φGB are similar as they both sum the predictions of the underlying decision trees: n 1X φRF ({DT }ni=1 , x) = φ(DTi , x) (14) n i=1 n X φGB (({DT }ni=1 , α, λ), x) = α + λ φ(DTi , x), α, λ ∈ R (15) i=1 Proposition 3. Given a set of n decision trees {DTi }ni=1 and real numbers α, λ ∈ R, there is (i) a decision semilattice DSLRF such that the prediction φ∆ (DSLRF , x) matches the prediction φRF ({DTi }ni=1 , x) for any description x ⊆ M ? ; (ii) a decision semilattice DSLGB such that the prediction φ∆ (DSLGB , x) matches the prediction φGB ({DTi }ni=1 , x) for any description x ⊆ M ? : 1) ∀x ⊆ M ?φ∆ (DSLRF , x) = φRF ({DTi }ni=1 , x) (16) n 1X DSLRF = δ(DTi ) (17) n i=1 2) ∀x ⊆ M ?φ∆ (DSLGB , x) = φGB ({DTi }ni=1 , α, λ), x (18) n X DSLGB = {(∅, α)} + λ δ(DTi ) (19) i=1 Proof. (i) By proposition 1, for any decision tree DTi , there is a differential decision tree δ(DTi ) : φ(DTi , x) = φ∆ (δ(DTi ), x), ∀x ⊆ M ? , (ii) By proposition 2, one can sum a set of differential decision trees into a single differential decision semilattice keeping predictions unchanged. 4 Experiments This section presents an empirical proof that a decision semilattice can produce the same predictions as ensembles of decision trees. The experiments are run via FCApy1 python package. The experimental setup is as follows. First, we construct the “base” models: a decision tree, a random forest, a gradient boosting from sci-kit learn pack- age [12], and a gradient boosting from XGBoost package [5]. Then we convert each decision tree of these models into a unified decision tree format used in FCApy. Finally, we aggregate the unified decision trees of ensemble models into a decision semilattice as defined in equations 17, 19. We use three real-world datasets for regression to compare the models. They are: Boston Housing Data2 (“Bost.”), California Housing dataset3 (“Cal.”), Dia- betes Data4 (“Diab.”). To construct each decision semilattice in less than a minute (on average), we limit each ensemble model by only ten decision trees with a maximum depth of six. The sole decision tree models are limited by a maximal depth of ten. Table 1 shows the weighted average percentage error (WAPE) of the decision semilattices copying the predictions of the base models on both train and test parts of a dataset. The error does not exceed 1.9%. The slight difference in the errors comes from the real-valued nature of the datasets. The premises of decision trees built on such data are of the form either “is m ≤ θ” or “is m > θ” where m is a real-valued attribute and θ ∈ R. These premises are sensitive to the precision of θ. They also use both closed and open intervals, while our FCA-based implementation operates only the former ones. We replace each premise of the form “is m > θ” by the premise “is m ≥ θ+10−9 ”. Base model DecisionTree GradientBoosting RandomForest XGBoost Dataset Bost. Cal. Diab. Bost. Cal. Diab. Bost. Cal. Diab. Bost. Cal. Diab. Train error 0.00 0.00 0.00 0.44 0.00 0.35 0.88 1.75 0.10 0.00 0.00 0.00 Test error 0.02 0.01 0.25 0.63 0.00 0.31 0.84 1.88 0.30 0.22 0.03 0.59 Table 1. WAPE (in %) of the decision semilattices copying the predictions of the base models 5 Conclusion In this paper, we have introduced a method for summing an ensemble of decision trees into a single decision semilattice model with the same predictions. To do so, 1 https://github.com/EgorDudyrev/FCApy 2 https://archive.ics.uci.edu/ml/machine-learning-databases/housing 3 https://scikit-learn.org/stable/datasets/real world.html#california-housing-dataset 4 https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html we have presented a “differential” way to describe decision rules and a function for differentiating a single decision tree. In the future work, we plan to extend this approach to decision semilat- tices. We also plan to study the application of decision semilattice to improving interpretability of ensembles of decision trees. Acknowledgments The work of Sergei O. Kuznetsov on the paper was carried out at St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Science and supported by the Russian Science Foundation grant no. 17-11-01276 References 1. Assaghir, Z., Kaytoue, M., Jr., W.M., Villerd, J.: Extracting decision trees from interval pattern concept lattices. In: Napoli, A., Vychodil, V. (eds.) Proc. 8th Int. Conf. Concept Lattices and Their Applications, Nancy, France, October 17-20, 2011. CEUR Workshop Proc., vol. 959, pp. 319–332. CEUR-WS.org (2011) 2. Belohlávek, R., Baets, B.D., Outrata, J., Vychodil, V.: Inducing decision trees via concept lattices. Int. J. of General Systems 38(4), 455–467 (2009) 3. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (10 2001) 4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regres- sion Trees. Wadsworth (1984) 5. Chen, T., Guestrin, C.: Xgboost. Proc. 22nd ACM SIGKDD Int. Conf. on Knowl- edge Discovery and Data Mining (08 2016) 6. Dudyrev, E., Kuznetsov, S.O.: Decision concept lattice vs. decision trees and ran- dom forests. In: Braud, A., Buzmakov, A., Hanika, T., Ber, F.L. (eds.) Formal Concept Analysis - 16th Int. Conf., ICFCA 2021, Strasbourg, France, June 29 - July 2, 2021, Proc. LNCS, vol. 12733, pp. 252–260. Springer (2021) 7. Friedman, J.: Greedy function approximation: A gradient boosting machine. An- nals of Statistics 29, 1189–1232 (10 2001) 8. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer Berlin Heidelberg (1999) 9. Krause, T., Lumpe, L., Schmidt, S.E.: A link between pattern structures and ran- dom forests. In: Valverde-Albacete, F.J., Trnecka, M. (eds.) Proc. 15th Int. Conf. Concept Lattices and Their Applications, Tallinn, Estonia, June 29-July 1, 2020. CEUR Workshop Proc., vol. 2668, pp. 131–143. CEUR-WS.org (2020) 10. Kuznetsov, S.O.: Machine learning and formal concept analysis. In: Eklund, P.W. (ed.) Concept Lattices, 2nd Int. Conf. on Formal Concept Analysis, ICFCA 2004, Sydney, Australia, February 23-26, 2004, Proc. LNCS, vol. 2961, pp. 287–312. Springer (2004) 11. Kuznetsov, S.O.: Pattern structures for analyzing complex data. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Slezak, D., Zhu, W. (eds.) Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, 12th Int. Conf., RSFDGrC 2009, Delhi, India, December 15-18, 2009. Proc. LNCS, vol. 5908, pp. 33–44. Springer (12 2009) 12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. J. of machine learning research 12(Oct), 2825–2830 (2011)