Histogram-based Probabilistic Rule Lists for Numeric
Targets
Lincen Yang* , Tim Opdam and Matthijs van Leeuwen
LIACS, Leiden University, The Netherlands


                                         Abstract
                                         While most rule learning methods focus on categorical targets for tasks including classification and
                                         subgroup discovery, rule learning for numeric targets are under-studied, especially using probabilistic
                                         rules: the only existing probabilistic rule list method for numeric targets, named SSD++ , is based on
                                         Gaussian parametric models. To extend this method, we adopt an adaptive histogram model to estimate
                                         the probability distribution of the target variable. We formalize the rule list learning as a model selection
                                         problem, which we tackle with the minimum description length principle. We demonstrate that our
                                         method produces rule lists with better predictive performance than SSD++ .

                                         Keywords
                                         Rule list, Regression rule, MDL model selection.


1. Introduction and Related Work
Learning rules from data is a long studied problem in inductive reasoning, data mining, and
machine learning. It has been widely used in practice as rules are directly readable by analysts
and domain experts, revealing actionable insights for real-world data-driven tasks. However,
rule learning for numeric targets is under-studied, especially the task of inducing a rule-based
global model for describing the whole dataset.
    Classic rule learning algorithms often follow the separate and conquer strategy: learn a single
rule from the dataset, remove the covered instances, and repeat the process until some stopping
criterion is met. As a result, the separate and conquer strategy leads to an ordered list of rules,
also referred to as decision list or rule list. While it is proved to be efficient in classification
tasks [1], the extension from categorical to numeric targets is non-trivial.
    The criteria used in rule lists for categorical targets are often based on probabilistic estimates
(i.e., precision, false positive rate, etc), including 1) heuristics for evaluating individual rules
as local patterns, 2) global evaluation metrics for a rule list model, and 3) statistics used for
statistical significance testing in order to prevent overfit [2]. However, for numeric targets,
performing probabilistic density estimation can be computationally expensive, especially when
doing this a large number of times, i.e., each time when a rule is assessed. Specifically, standard

KDID 2022: 20th anniversary of international workshop on knowledge discovery in inductive databases Workshop.
*
 Corresponding author.
$ l.yang@liacs.leidenuniv.nl (L. Yang); timopdam@hotmail.com (T. Opdam); m.van.leeuwen@liacs.leidenuniv.nl
(M. v. Leeuwen)
 0000-0003-1936-2784 (L. Yang); 0000-0002-9459-2646 (T. Opdam); 0000-0002-0510-3549 (M. v. Leeuwen)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
approaches like kernel density estimation (KDE) sutffer from a high computational cost (mainly
due to the cross-validation needed for choosing hyper-parameters), which makes it unsuitable
for learning rule lists. Parametric models, however, are fast but may lead to mis-specification.
One existing method named SSD++ [3] assumes Gaussian distribution for the numeric targets
of instances covered by a certain individual rule, but this is theoretically sub-optimal and leads
to self-contradiction. This is because any individual rule can be regarded as the “union" of its
refinements: for instance, data points satisfying some condition (𝐴1 ) is the union of the data
points satisfying (𝐴1 ∧ 𝐴2 ) and (𝐴1 ∧ ¬𝐴2 ). However, the target variable within the previous
three rules cannot be assumed to be Gaussian at the same time, as the first one is by definition
the mixture of the other two.
   To strike a balance between the computational cost and model expressiveness, we adopt
adaptive histograms for density estimation for learning rules for numeric targets. Our contri-
bution are as follows: 1) we formalize the problem of learning rule lists as a model selection
task; 2) by showing that the model selection criterion of the rule lists can be decomposed to the
sum of local criterion for each individual rule, we demonstrate our model selection criterion
to be compatible with the separate and conquer strategy; 3) we empirically showcase that the
rule lists obtained by our method outperforms that of SSD++ in terms of prediction accuracy,
measured by the mean squared error.

Related Work. Far fewer rule learning methods exist for numeric targets than categorical
targets. Besides SSD++ , PRIM [4] proposes to sequentially search for the rules that deviate from
the global mean, which leads to a non-probabilistic rule lists. Further, DSSD [5] proposes to
learn a diverse set of non-probabilistic rules for numeric targets. Next, instead of learning rules
directly from numeric targets, an alternative approach is to transfer regression rule learning to
classification rule learning by dynamically cutting the targets [6]. Also, Meeng and Knobbe
thoroughly discuss about dealing with numeric targets by discretization in rule learning for
subgroup discovery [7]. Last, RuleFit [8] can generate rules for numeric targets for constructing
an ensemble model, which is less interpretable than a single rule list. Note that some of these
methods are developed for the subgroup discovery task instead of predictive rule learning.
Specifically, our competitor SSD++ is originally proposed for subgroup discovery; however, as
pointed out in its original paper, it can be used for regression by formalizing the rule list as a
probabilistic model in a slightly different way.


2. Probabilistic Rule Lists for Numeric Targets
Adaptive histogram for numeric targets. Consider a one-dimensional random variable
𝑌 taking values in R, a histogram model for 𝑌 is a set of cut points (including boundaries),
denoted as ⃗𝑐 = (𝑐1 , . . . , 𝑐𝐾+1 ), where 𝐾 represents the number of bins. For any value 𝑦 in the
support of 𝑌 , the associated probability  ∑︀ distribution, denoted by 𝑃ℎ (.), has density function
defined and denoted as 𝑝ℎ (𝑌 = 𝑦) = 𝐾          𝑗=1 1[𝑐𝑗 ,𝑐𝑗+1 ) (𝑦)𝛽𝑗 , where 1(.) is the indicator function
and 𝛽 = (𝛽1 , ..., 𝛽𝐾 ) is the parameter vector to be estimated from data; i.e., 𝛽 represents the
probability of 𝑌 taking values in each of all bins. In practice, 𝛽 can be estimated by the maximum
                                |{𝑐 ≤𝑦<𝑐    }|
likelihood estimator: 𝛽ˆ𝑗 = |𝑆|𝑗 (𝑐𝑗+1 𝑗+1
                                         −𝑐𝑗 ) , where |.| is the cardinality function.
Histogram-based rule. Consider the 𝑑-dimensional feature variables 𝑋 = (𝑋1 , . . . , 𝑋𝑑 ),
where each 𝑋𝑖 a single dimension of 𝑋, and a target variable 𝑌 ∈ R. A histogram-based rule is
written as (𝑋1 ∈ 𝑅1 ∧ 𝑋2 ∈ 𝑅2 ∧ . . .) → 𝑃ℎ,𝑆 (𝑌 ), where each 𝑋𝑖 ∈ 𝑅𝑖 is called a literal of
the condition of the rule. Specifically, each 𝑅𝑖 is a closed interval or a set of categorical levels. A
rule in this form describes a subset 𝑆 of the full sample space of 𝑋, such that for any 𝑥 ∈ 𝑆,
the conditional distribution 𝑃 (𝑌 |𝑋 = 𝑥) is approximated by the probability distribution of
𝑌 conditioned on the event {𝑋 ∈ 𝑆}, i.e., 𝑃 (𝑌 |𝑋 ∈ 𝑆), which is modelled by an adaptive
histogram model, associated with 𝑆 and hence denoted as 𝑃ℎ,𝑆 (𝑌 ). Thus, a histogram-based
rule is a local probabilistic model for all instances that satisfy the rule. To simplify the notation,
when clear from the context we use 𝑆 to refer to the rule itself and the subset of all instances
covered by the rule.

Histogram-based rule list. Precisely, a rule list is an ordered list of rules connected by the
“If...ElseIf...Else..." statements. For instance, a rule list containing only two rules 𝑆1 and 𝑆2 can
be written as “IF 𝑥 ∈ 𝑆1 , THEN 𝑃ℎ,𝑆1 ; ELSE IF 𝑥 ∈ 𝑆2 , THEN 𝑃ℎ,𝑆2 ; ELSE: 𝑃ℎ,𝑆0 ". Note that we
use 𝑃ℎ,𝑆0 to represent the histogram that is associated with the instances not covered by any
rule, and referred to 𝑆0 as the “else rule"1 . Therefore, a rule list connects multiple rules and
hence becomes a global model for the whole dataset: given any instance (𝑥, 𝑦), we can calculate
the probability (density) by firstly going over the rule list until 𝑥 satisfies the condition of a
certain rule, and then predicting the density function of 𝑦 conditioned on 𝑥 by the histogram
model associated with that rule.


3. Learning Rule Lists as a Model Selection Task
We formalize the task of learning a rule list as a probabilistic model selection task. We adopt the
minimum description length (MDL) principle [9] for the model selection task. The MDL-based
model selection has theoretic roots in information theory and can be regarded as an extension
of Bayesian model selection. Further, it has a long history of being successfully applied in rule
learning, including classic methods like C4.5 and RIPPER [10, 11]. In practice, the MDL-based
model selection criterion can be viewed as a form of penalized maximum likelihood, as we will
elaborate next. We start with the criterion for optimal histograms of individual rules and then
discuss the global criterion for rule lists.

3.1. Preliminaries: learning adaptive histograms for individual rules
Consider an individual rule 𝑆 (with a histogram with fixed cut points) and all the
𝑚 instances satisfying the condition of 𝑆, denoted as (𝑥𝑚 , 𝑦 𝑚 ). Formally, the op-
                              *                     histograms ℋ is defined as ℎ* =
             (︀ histogram𝑚 ℎ among
timal adaptive
                                𝑚
                                      all possible
                                              )︀
arg minℎ∈ℋ − log 𝑝 ˆℎ,𝑆 (𝑌 = 𝑦 ) + ℛ(𝑚, 𝐾) ; note that 1) 𝑝 ˆℎ,𝑆 is the estimated probabil-
ity density function with the maximum likelihood estimator for ⃗𝛽 , 2) 𝑝
                                                                       ˆℎ,𝑆 (𝑦) is extended to

1
    A related and widely used notion is the “default rule". The difference lies in whether the parameters associated
    with 𝑆0 is estimated by all instances (default rule) or by instances that are not covered by any rule in the rule list
    (else rule).
ˆℎ,𝑆 (𝑦 𝑚 ) with the i.i.d assumption, and 3) ℛ(𝑚, 𝐾) is the so-called regret, the “penalty term" of
𝑝
the MDL model selection criterion, which is a function of ∫︀sample size 𝑚 and the number of bins
of the histogram 𝐾 [12]. By definition, ℛ(𝑚, 𝐾) = log 𝑧 𝑚 𝑝     ˆℎ,𝑆 (𝑌 𝑚 = 𝑧 𝑚 ) [9]. The seminal
work in this research line [12] developed a dynamic programming optimization algorithm with
quadratic time complexity, despite the expensive numeric integral in ℛ.

3.2. Model selection for learning histogram-based rule lists
The task of learning a rule list needs to simultaneously select the cut points on features (for
constructing the rules’ conditions) and on targets (for constructing the adaptive histograms).
Formally, denote the (training) dataset as 𝐷 = (𝑥𝑛 , 𝑦 𝑛 ), then the best histogram-based rule list,
denoted as 𝑀 * , among all possible rule lists ℳ, is defined as

          𝑀 * = arg min 𝐿(𝐷, 𝑀 ) = arg min − log 𝑃
                                                 ˆ (𝑦 𝑛 |𝑥𝑛 ) + ℛ(𝑀 ) + 𝐿(𝑀 ),                   (1)
                     𝑀 ∈ℳ                    𝑀 ∈ℳ

where 1) 𝑃ˆ (𝑦 𝑛 |𝑥𝑛 ) is the likelihood for the data, based on the maximum likelihood estimator
of the parameter ⃗𝛽 for all histograms; ∫︀2) ℛ(𝑀 ) is the MDL regret term for the rule list as a
global model, defined as ℛ(𝑀 ) = log 𝑧 𝑚 𝑃       ˆ (𝑌 𝑚 = 𝑧 𝑚 |𝑥𝑛 ), which can be proved to be the
sum of the regret terms ℛ of all individual rules [13]; and 3) 𝐿(𝑀 ) is the code length, in bits,
that is needed to encode the rule list as a model [9]. Note that for a fixed 𝑀 ∈ ℳ, not only the
conditions of the rule but also the cut points for associated histograms are fixed. To calculate
𝐿(𝑀 ), we sum up the code lengths needed to encode the following: the number of rules in 𝑀 ,
the number of literals for each all individual rules in 𝑀 , and the exact variable and value used
for the condition of each literal [9]. The formula for calculating the length is the same as that of
the method SSD++ [3], and hence we skip the mathematical details here.

3.3. Algorithm: separate and conquer
As exhaustive search in rule learning is known to be computationally prohibitive [1], we resort
to the separate and conquer strategy: we iteratively search for the next rule, add it to the rule
list, and remove the covered instances, until adding a new rule brings no further decrease in
minimizing 𝐿(𝐷, 𝑀 ). Since 1) the regret terms ℛ of the rule list can be written as the sum
of the regret terms of each individual rule, 2) 𝐿(𝑀 ) can be decomposed to individual rules by
definition, and 3) the log-likelihood can be decomposed to the likelihood of each rule, the global
model selection criterion 𝐿(𝐷, 𝑀 ) can be decomposed to the sum of local criterion of each
individual rule. Hence, the separate and conquer strategy can be viewed as a greedy manner of
optimizing the global criterion.
   To search for the next rule based on the current (incomplete) rule list, we grow the rule
by iteratively adding literals to an empty rule. Intuitively, we should not search for the next
rule by minimizing 𝐿(𝐷, 𝑀 ) directly, as our goal is not to minimize 𝐿(𝐷, 𝑀 ) by adding one
more rule only. Instead, we should search for the next rule such that by adding this rule to the
rule list, we take a step towards our final rule list in the “steepest direction". Informally, the
“steepest direction" can be thought of the direction that reduces 𝐿(𝐷, 𝑀 ) most per extra covered
instance. That is, we use the heuristic named “normalized gain" [3]. Similar to SSD++ we use
beam search when growing individual rules to avoid (some) local optima. The rule growing
procedure continues as long as the normalized gain is positive, i.e., 𝐿(𝐷, 𝑀 ) keeps decreasing.
Further, the rule list learning is stopped when adding a new rule will not decrease the global
criterion further.


4. Empirical Performance
We benchmark our method against SSD++ with widely used UCI datasets for regression tasks, to
investigate the predictive performance improvement by adopting the non-parametric histogram
models. We also include the results of CART [14] as a baseline interpretable model, for which
we use the implementation from scikit-learn with post-pruning.
   We report the mean squared error (MSE) and the total number of literals in the rule list in
Table 1, and the results are obtained by 10 random train/test splits, in which 80% of the instances
are used for training. We show that our histogram-based approach outperforms SSD++ in
most datasets. Thus, the Gaussian assumption will indeed lead to sub-optimal results for rule
learning. Further, we observe that CART outperforms our method in about half the datasets, but
CART in general produces more complicated models than our method in terms of the number
of literals in each rule (path from root to leaf).
   Next, we observe that SSD++ produces much simpler models than our method. This can
be explained as follows: SSD++ and our method both use the MDL principle to control the
model complexity, and hence the rule can grow only when the “gain" in likelihood exceeds
the “cost" in the regret term and model complexity. Since adaptive histograms are much more
expressive than Gaussian parametric models, it is “easier" for the rule to obtain substantial
“gain" in likelihoods, which drives the rules to grow longer and hence cover smaller subsets.
As a result, more rules are needed to cover the whole dataset and hence our method produces
longer rules and a larger number of rules.


5. Conclusion and Discussion
We developed a rule list method for numeric targets with the adaptive histogram model, and
we demonstrated that the predictive performance is superior to the existing method based on
parametric Gaussian models. The limitation of our method is scalability, due to the quadratic
complexity of searching the optimal adaptive histograms: for datasets with tens of features and
thousands of sample sizes, the training process can take a few minutes. The future work may
be focused on developing methods for calculating the “score" of each rule list approximately
but efficiently.


Acknowledgments
This work is part of the research programme ‘Human-Guided Data Science by Interactive Model
Selection’ with project number 612.001.804, which is (partly) financed by the Dutch Research
Council (NWO).
Table 1
Predictive performance measured by MSE, and model complexity measured by the total number of
literals in each model. The results are obtained in 10 random train/test splits for each dataset.
                                           MSE                          # total literals
     data          Hist. Rule List    SSD++        CART      Hist. Rule List SSD++          CART
     cholesterol      3287.72         3286.12     4081.33         12.17            1.92     26.85
     autoMPG8           12.26          12.35       13.14          29.64           14.16    163.23
     dee                 0.64           0.89        0.26          24.85            4.62     63.62
     ele-1           458790.85       481793.78   536584.54        26.25           13.36     51.99
     forestFires      6520.97         7147.07     4447.27         85.19           61.31     15.29
     concrete           72.19          74.86       50.66         126.88           53.98    2215.52
     abalone             6.09           6.31        5.69         121.29           62.86    131.84


References
 [1] J. Fürnkranz, D. Gamberger, N. Lavrač, Foundations of rule learning, Springer Science &
     Business Media, 2012.
 [2] J. Fürnkranz, P. A. Flach, Roc ‘n’rule learning—towards a better understanding of covering
     algorithms, Machine learning 58 (2005) 39–77.
 [3] H. M. e. a. Proença, Discovering outstanding subgroup lists for numeric targets using mdl,
     in: ECMLPKDD 2020, Springer, 2020.
 [4] J. H. Friedman, N. I. Fisher, Bump hunting in high-dimensional data, Statistics and
     computing 9 (1999) 123–143.
 [5] M. Van Leeuwen, A. Knobbe, Diverse subgroup set discovery, Data Mining and Knowledge
     Discovery 25 (2012) 208–242.
 [6] F. Janssen, J. Fürnkranz, Heuristic rule-based regression via dynamic reduction to clas-
     sification, in: Twenty-Second International Joint Conference on Artificial Intelligence,
     2011.
 [7] M. Meeng, A. Knobbe, For real: a thorough look at numeric attributes in subgroup
     discovery, Data Mining and Knowledge Discovery 35 (2021) 158–212.
 [8] J. H. Friedman, B. E. Popescu, Predictive learning via rule ensembles, The annals of applied
     statistics (2008) 916–954.
 [9] P. D. Grünwald, The minimum description length principle, MIT press, 2007.
[10] J. R. Quinlan, C4. 5: programs for machine learning, Elsevier, 2014.
[11] W. W. Cohen, Fast effective rule induction, in: Machine learning proceedings 1995, Elsevier,
     1995, pp. 115–123.
[12] P. Kontkanen, P. Myllymäki, Mdl histogram density estimation, in: Artificial intelligence
     and statistics, PMLR, 2007, pp. 219–226.
[13] L. Yang, M. van Leeuwen, Truly unordered probabilistic rule sets for multi-class classifica-
     tion, arXiv preprint arXiv:2206.08804 (2022).
[14] L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen, Classification and regression trees, CRC
     press, 1984.