Histogram-based Probabilistic Rule Lists for Numeric Targets Lincen Yang* , Tim Opdam and Matthijs van Leeuwen LIACS, Leiden University, The Netherlands Abstract While most rule learning methods focus on categorical targets for tasks including classification and subgroup discovery, rule learning for numeric targets are under-studied, especially using probabilistic rules: the only existing probabilistic rule list method for numeric targets, named SSD++ , is based on Gaussian parametric models. To extend this method, we adopt an adaptive histogram model to estimate the probability distribution of the target variable. We formalize the rule list learning as a model selection problem, which we tackle with the minimum description length principle. We demonstrate that our method produces rule lists with better predictive performance than SSD++ . Keywords Rule list, Regression rule, MDL model selection. 1. Introduction and Related Work Learning rules from data is a long studied problem in inductive reasoning, data mining, and machine learning. It has been widely used in practice as rules are directly readable by analysts and domain experts, revealing actionable insights for real-world data-driven tasks. However, rule learning for numeric targets is under-studied, especially the task of inducing a rule-based global model for describing the whole dataset. Classic rule learning algorithms often follow the separate and conquer strategy: learn a single rule from the dataset, remove the covered instances, and repeat the process until some stopping criterion is met. As a result, the separate and conquer strategy leads to an ordered list of rules, also referred to as decision list or rule list. While it is proved to be efficient in classification tasks [1], the extension from categorical to numeric targets is non-trivial. The criteria used in rule lists for categorical targets are often based on probabilistic estimates (i.e., precision, false positive rate, etc), including 1) heuristics for evaluating individual rules as local patterns, 2) global evaluation metrics for a rule list model, and 3) statistics used for statistical significance testing in order to prevent overfit [2]. However, for numeric targets, performing probabilistic density estimation can be computationally expensive, especially when doing this a large number of times, i.e., each time when a rule is assessed. Specifically, standard KDID 2022: 20th anniversary of international workshop on knowledge discovery in inductive databases Workshop. * Corresponding author. $ l.yang@liacs.leidenuniv.nl (L. Yang); timopdam@hotmail.com (T. Opdam); m.van.leeuwen@liacs.leidenuniv.nl (M. v. Leeuwen)  0000-0003-1936-2784 (L. Yang); 0000-0002-9459-2646 (T. Opdam); 0000-0002-0510-3549 (M. v. Leeuwen) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) approaches like kernel density estimation (KDE) sutffer from a high computational cost (mainly due to the cross-validation needed for choosing hyper-parameters), which makes it unsuitable for learning rule lists. Parametric models, however, are fast but may lead to mis-specification. One existing method named SSD++ [3] assumes Gaussian distribution for the numeric targets of instances covered by a certain individual rule, but this is theoretically sub-optimal and leads to self-contradiction. This is because any individual rule can be regarded as the β€œunion" of its refinements: for instance, data points satisfying some condition (𝐴1 ) is the union of the data points satisfying (𝐴1 ∧ 𝐴2 ) and (𝐴1 ∧ ¬𝐴2 ). However, the target variable within the previous three rules cannot be assumed to be Gaussian at the same time, as the first one is by definition the mixture of the other two. To strike a balance between the computational cost and model expressiveness, we adopt adaptive histograms for density estimation for learning rules for numeric targets. Our contri- bution are as follows: 1) we formalize the problem of learning rule lists as a model selection task; 2) by showing that the model selection criterion of the rule lists can be decomposed to the sum of local criterion for each individual rule, we demonstrate our model selection criterion to be compatible with the separate and conquer strategy; 3) we empirically showcase that the rule lists obtained by our method outperforms that of SSD++ in terms of prediction accuracy, measured by the mean squared error. Related Work. Far fewer rule learning methods exist for numeric targets than categorical targets. Besides SSD++ , PRIM [4] proposes to sequentially search for the rules that deviate from the global mean, which leads to a non-probabilistic rule lists. Further, DSSD [5] proposes to learn a diverse set of non-probabilistic rules for numeric targets. Next, instead of learning rules directly from numeric targets, an alternative approach is to transfer regression rule learning to classification rule learning by dynamically cutting the targets [6]. Also, Meeng and Knobbe thoroughly discuss about dealing with numeric targets by discretization in rule learning for subgroup discovery [7]. Last, RuleFit [8] can generate rules for numeric targets for constructing an ensemble model, which is less interpretable than a single rule list. Note that some of these methods are developed for the subgroup discovery task instead of predictive rule learning. Specifically, our competitor SSD++ is originally proposed for subgroup discovery; however, as pointed out in its original paper, it can be used for regression by formalizing the rule list as a probabilistic model in a slightly different way. 2. Probabilistic Rule Lists for Numeric Targets Adaptive histogram for numeric targets. Consider a one-dimensional random variable π‘Œ taking values in R, a histogram model for π‘Œ is a set of cut points (including boundaries), denoted as ⃗𝑐 = (𝑐1 , . . . , 𝑐𝐾+1 ), where 𝐾 represents the number of bins. For any value 𝑦 in the support of π‘Œ , the associated probability βˆ‘οΈ€ distribution, denoted by π‘ƒβ„Ž (.), has density function defined and denoted as π‘β„Ž (π‘Œ = 𝑦) = 𝐾 𝑗=1 1[𝑐𝑗 ,𝑐𝑗+1 ) (𝑦)𝛽𝑗 , where 1(.) is the indicator function and 𝛽 = (𝛽1 , ..., 𝛽𝐾 ) is the parameter vector to be estimated from data; i.e., 𝛽 represents the probability of π‘Œ taking values in each of all bins. In practice, 𝛽 can be estimated by the maximum |{𝑐 ≀𝑦<𝑐 }| likelihood estimator: 𝛽ˆ𝑗 = |𝑆|𝑗 (𝑐𝑗+1 𝑗+1 βˆ’π‘π‘— ) , where |.| is the cardinality function. Histogram-based rule. Consider the 𝑑-dimensional feature variables 𝑋 = (𝑋1 , . . . , 𝑋𝑑 ), where each 𝑋𝑖 a single dimension of 𝑋, and a target variable π‘Œ ∈ R. A histogram-based rule is written as (𝑋1 ∈ 𝑅1 ∧ 𝑋2 ∈ 𝑅2 ∧ . . .) β†’ π‘ƒβ„Ž,𝑆 (π‘Œ ), where each 𝑋𝑖 ∈ 𝑅𝑖 is called a literal of the condition of the rule. Specifically, each 𝑅𝑖 is a closed interval or a set of categorical levels. A rule in this form describes a subset 𝑆 of the full sample space of 𝑋, such that for any π‘₯ ∈ 𝑆, the conditional distribution 𝑃 (π‘Œ |𝑋 = π‘₯) is approximated by the probability distribution of π‘Œ conditioned on the event {𝑋 ∈ 𝑆}, i.e., 𝑃 (π‘Œ |𝑋 ∈ 𝑆), which is modelled by an adaptive histogram model, associated with 𝑆 and hence denoted as π‘ƒβ„Ž,𝑆 (π‘Œ ). Thus, a histogram-based rule is a local probabilistic model for all instances that satisfy the rule. To simplify the notation, when clear from the context we use 𝑆 to refer to the rule itself and the subset of all instances covered by the rule. Histogram-based rule list. Precisely, a rule list is an ordered list of rules connected by the β€œIf...ElseIf...Else..." statements. For instance, a rule list containing only two rules 𝑆1 and 𝑆2 can be written as β€œIF π‘₯ ∈ 𝑆1 , THEN π‘ƒβ„Ž,𝑆1 ; ELSE IF π‘₯ ∈ 𝑆2 , THEN π‘ƒβ„Ž,𝑆2 ; ELSE: π‘ƒβ„Ž,𝑆0 ". Note that we use π‘ƒβ„Ž,𝑆0 to represent the histogram that is associated with the instances not covered by any rule, and referred to 𝑆0 as the β€œelse rule"1 . Therefore, a rule list connects multiple rules and hence becomes a global model for the whole dataset: given any instance (π‘₯, 𝑦), we can calculate the probability (density) by firstly going over the rule list until π‘₯ satisfies the condition of a certain rule, and then predicting the density function of 𝑦 conditioned on π‘₯ by the histogram model associated with that rule. 3. Learning Rule Lists as a Model Selection Task We formalize the task of learning a rule list as a probabilistic model selection task. We adopt the minimum description length (MDL) principle [9] for the model selection task. The MDL-based model selection has theoretic roots in information theory and can be regarded as an extension of Bayesian model selection. Further, it has a long history of being successfully applied in rule learning, including classic methods like C4.5 and RIPPER [10, 11]. In practice, the MDL-based model selection criterion can be viewed as a form of penalized maximum likelihood, as we will elaborate next. We start with the criterion for optimal histograms of individual rules and then discuss the global criterion for rule lists. 3.1. Preliminaries: learning adaptive histograms for individual rules Consider an individual rule 𝑆 (with a histogram with fixed cut points) and all the π‘š instances satisfying the condition of 𝑆, denoted as (π‘₯π‘š , 𝑦 π‘š ). Formally, the op- * histograms β„‹ is defined as β„Ž* = (οΈ€ histogramπ‘š β„Ž among timal adaptive π‘š all possible )οΈ€ arg minβ„Žβˆˆβ„‹ βˆ’ log 𝑝 Λ†β„Ž,𝑆 (π‘Œ = 𝑦 ) + β„›(π‘š, 𝐾) ; note that 1) 𝑝 Λ†β„Ž,𝑆 is the estimated probabil- ity density function with the maximum likelihood estimator for ⃗𝛽 , 2) 𝑝 Λ†β„Ž,𝑆 (𝑦) is extended to 1 A related and widely used notion is the β€œdefault rule". The difference lies in whether the parameters associated with 𝑆0 is estimated by all instances (default rule) or by instances that are not covered by any rule in the rule list (else rule). Λ†β„Ž,𝑆 (𝑦 π‘š ) with the i.i.d assumption, and 3) β„›(π‘š, 𝐾) is the so-called regret, the β€œpenalty term" of 𝑝 the MDL model selection criterion, which is a function of βˆ«οΈ€sample size π‘š and the number of bins of the histogram 𝐾 [12]. By definition, β„›(π‘š, 𝐾) = log 𝑧 π‘š 𝑝 Λ†β„Ž,𝑆 (π‘Œ π‘š = 𝑧 π‘š ) [9]. The seminal work in this research line [12] developed a dynamic programming optimization algorithm with quadratic time complexity, despite the expensive numeric integral in β„›. 3.2. Model selection for learning histogram-based rule lists The task of learning a rule list needs to simultaneously select the cut points on features (for constructing the rules’ conditions) and on targets (for constructing the adaptive histograms). Formally, denote the (training) dataset as 𝐷 = (π‘₯𝑛 , 𝑦 𝑛 ), then the best histogram-based rule list, denoted as 𝑀 * , among all possible rule lists β„³, is defined as 𝑀 * = arg min 𝐿(𝐷, 𝑀 ) = arg min βˆ’ log 𝑃 Λ† (𝑦 𝑛 |π‘₯𝑛 ) + β„›(𝑀 ) + 𝐿(𝑀 ), (1) 𝑀 βˆˆβ„³ 𝑀 βˆˆβ„³ where 1) 𝑃ˆ (𝑦 𝑛 |π‘₯𝑛 ) is the likelihood for the data, based on the maximum likelihood estimator of the parameter ⃗𝛽 for all histograms; βˆ«οΈ€2) β„›(𝑀 ) is the MDL regret term for the rule list as a global model, defined as β„›(𝑀 ) = log 𝑧 π‘š 𝑃 Λ† (π‘Œ π‘š = 𝑧 π‘š |π‘₯𝑛 ), which can be proved to be the sum of the regret terms β„› of all individual rules [13]; and 3) 𝐿(𝑀 ) is the code length, in bits, that is needed to encode the rule list as a model [9]. Note that for a fixed 𝑀 ∈ β„³, not only the conditions of the rule but also the cut points for associated histograms are fixed. To calculate 𝐿(𝑀 ), we sum up the code lengths needed to encode the following: the number of rules in 𝑀 , the number of literals for each all individual rules in 𝑀 , and the exact variable and value used for the condition of each literal [9]. The formula for calculating the length is the same as that of the method SSD++ [3], and hence we skip the mathematical details here. 3.3. Algorithm: separate and conquer As exhaustive search in rule learning is known to be computationally prohibitive [1], we resort to the separate and conquer strategy: we iteratively search for the next rule, add it to the rule list, and remove the covered instances, until adding a new rule brings no further decrease in minimizing 𝐿(𝐷, 𝑀 ). Since 1) the regret terms β„› of the rule list can be written as the sum of the regret terms of each individual rule, 2) 𝐿(𝑀 ) can be decomposed to individual rules by definition, and 3) the log-likelihood can be decomposed to the likelihood of each rule, the global model selection criterion 𝐿(𝐷, 𝑀 ) can be decomposed to the sum of local criterion of each individual rule. Hence, the separate and conquer strategy can be viewed as a greedy manner of optimizing the global criterion. To search for the next rule based on the current (incomplete) rule list, we grow the rule by iteratively adding literals to an empty rule. Intuitively, we should not search for the next rule by minimizing 𝐿(𝐷, 𝑀 ) directly, as our goal is not to minimize 𝐿(𝐷, 𝑀 ) by adding one more rule only. Instead, we should search for the next rule such that by adding this rule to the rule list, we take a step towards our final rule list in the β€œsteepest direction". Informally, the β€œsteepest direction" can be thought of the direction that reduces 𝐿(𝐷, 𝑀 ) most per extra covered instance. That is, we use the heuristic named β€œnormalized gain" [3]. Similar to SSD++ we use beam search when growing individual rules to avoid (some) local optima. The rule growing procedure continues as long as the normalized gain is positive, i.e., 𝐿(𝐷, 𝑀 ) keeps decreasing. Further, the rule list learning is stopped when adding a new rule will not decrease the global criterion further. 4. Empirical Performance We benchmark our method against SSD++ with widely used UCI datasets for regression tasks, to investigate the predictive performance improvement by adopting the non-parametric histogram models. We also include the results of CART [14] as a baseline interpretable model, for which we use the implementation from scikit-learn with post-pruning. We report the mean squared error (MSE) and the total number of literals in the rule list in Table 1, and the results are obtained by 10 random train/test splits, in which 80% of the instances are used for training. We show that our histogram-based approach outperforms SSD++ in most datasets. Thus, the Gaussian assumption will indeed lead to sub-optimal results for rule learning. Further, we observe that CART outperforms our method in about half the datasets, but CART in general produces more complicated models than our method in terms of the number of literals in each rule (path from root to leaf). Next, we observe that SSD++ produces much simpler models than our method. This can be explained as follows: SSD++ and our method both use the MDL principle to control the model complexity, and hence the rule can grow only when the β€œgain" in likelihood exceeds the β€œcost" in the regret term and model complexity. Since adaptive histograms are much more expressive than Gaussian parametric models, it is β€œeasier" for the rule to obtain substantial β€œgain" in likelihoods, which drives the rules to grow longer and hence cover smaller subsets. As a result, more rules are needed to cover the whole dataset and hence our method produces longer rules and a larger number of rules. 5. Conclusion and Discussion We developed a rule list method for numeric targets with the adaptive histogram model, and we demonstrated that the predictive performance is superior to the existing method based on parametric Gaussian models. The limitation of our method is scalability, due to the quadratic complexity of searching the optimal adaptive histograms: for datasets with tens of features and thousands of sample sizes, the training process can take a few minutes. The future work may be focused on developing methods for calculating the β€œscore" of each rule list approximately but efficiently. Acknowledgments This work is part of the research programme β€˜Human-Guided Data Science by Interactive Model Selection’ with project number 612.001.804, which is (partly) financed by the Dutch Research Council (NWO). Table 1 Predictive performance measured by MSE, and model complexity measured by the total number of literals in each model. The results are obtained in 10 random train/test splits for each dataset. MSE # total literals data Hist. Rule List SSD++ CART Hist. Rule List SSD++ CART cholesterol 3287.72 3286.12 4081.33 12.17 1.92 26.85 autoMPG8 12.26 12.35 13.14 29.64 14.16 163.23 dee 0.64 0.89 0.26 24.85 4.62 63.62 ele-1 458790.85 481793.78 536584.54 26.25 13.36 51.99 forestFires 6520.97 7147.07 4447.27 85.19 61.31 15.29 concrete 72.19 74.86 50.66 126.88 53.98 2215.52 abalone 6.09 6.31 5.69 121.29 62.86 131.84 References [1] J. FΓΌrnkranz, D. Gamberger, N. Lavrač, Foundations of rule learning, Springer Science & Business Media, 2012. [2] J. FΓΌrnkranz, P. A. Flach, Roc β€˜n’rule learningβ€”towards a better understanding of covering algorithms, Machine learning 58 (2005) 39–77. [3] H. M. e. a. ProenΓ§a, Discovering outstanding subgroup lists for numeric targets using mdl, in: ECMLPKDD 2020, Springer, 2020. [4] J. H. Friedman, N. I. Fisher, Bump hunting in high-dimensional data, Statistics and computing 9 (1999) 123–143. [5] M. Van Leeuwen, A. Knobbe, Diverse subgroup set discovery, Data Mining and Knowledge Discovery 25 (2012) 208–242. [6] F. Janssen, J. FΓΌrnkranz, Heuristic rule-based regression via dynamic reduction to clas- sification, in: Twenty-Second International Joint Conference on Artificial Intelligence, 2011. [7] M. Meeng, A. Knobbe, For real: a thorough look at numeric attributes in subgroup discovery, Data Mining and Knowledge Discovery 35 (2021) 158–212. [8] J. H. Friedman, B. E. Popescu, Predictive learning via rule ensembles, The annals of applied statistics (2008) 916–954. [9] P. D. GrΓΌnwald, The minimum description length principle, MIT press, 2007. [10] J. R. Quinlan, C4. 5: programs for machine learning, Elsevier, 2014. [11] W. W. Cohen, Fast effective rule induction, in: Machine learning proceedings 1995, Elsevier, 1995, pp. 115–123. [12] P. Kontkanen, P. MyllymΓ€ki, Mdl histogram density estimation, in: Artificial intelligence and statistics, PMLR, 2007, pp. 219–226. [13] L. Yang, M. van Leeuwen, Truly unordered probabilistic rule sets for multi-class classifica- tion, arXiv preprint arXiv:2206.08804 (2022). [14] L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen, Classification and regression trees, CRC press, 1984.