<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Histogram-based Probabilistic Rule Lists for Numeric Targets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lincen Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Opdam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthijs van Leeuwen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIACS, Leiden University</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While most rule learning methods focus on categorical targets for tasks including classification and subgroup discovery, rule learning for numeric targets are under-studied, especially using probabilistic rules: the only existing probabilistic rule list method for numeric targets, named SSD++, is based on Gaussian parametric models. To extend this method, we adopt an adaptive histogram model to estimate the probability distribution of the target variable. We formalize the rule list learning as a model selection problem, which we tackle with the minimum description length principle. We demonstrate that our method produces rule lists with better predictive performance than SSD++.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Rule list</kwd>
        <kwd>Regression rule</kwd>
        <kwd>MDL model selection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Related Work</title>
      <p>Learning rules from data is a long studied problem in inductive reasoning, data mining, and
machine learning. It has been widely used in practice as rules are directly readable by analysts
and domain experts, revealing actionable insights for real-world data-driven tasks. However,
rule learning for numeric targets is under-studied, especially the task of inducing a rule-based
global model for describing the whole dataset.</p>
      <p>
        Classic rule learning algorithms often follow the separate and conquer strategy: learn a single
rule from the dataset, remove the covered instances, and repeat the process until some stopping
criterion is met. As a result, the separate and conquer strategy leads to an ordered list of rules,
also referred to as decision list or rule list. While it is proved to be eficient in classification
tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the extension from categorical to numeric targets is non-trivial.
      </p>
      <p>
        The criteria used in rule lists for categorical targets are often based on probabilistic estimates
(i.e., precision, false positive rate, etc), including 1) heuristics for evaluating individual rules
as local patterns, 2) global evaluation metrics for a rule list model, and 3) statistics used for
statistical significance testing in order to prevent overfit [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, for numeric targets,
performing probabilistic density estimation can be computationally expensive, especially when
doing this a large number of times, i.e., each time when a rule is assessed. Specifically, standard
approaches like kernel density estimation (KDE) sutfer from a high computational cost (mainly
due to the cross-validation needed for choosing hyper-parameters), which makes it unsuitable
for learning rule lists. Parametric models, however, are fast but may lead to mis-specification.
One existing method named SSD++ [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] assumes Gaussian distribution for the numeric targets
of instances covered by a certain individual rule, but this is theoretically sub-optimal and leads
to self-contradiction. This is because any individual rule can be regarded as the “union" of its
refinements: for instance, data points satisfying some condition (1) is the union of the data
points satisfying (1 ∧ 2) and (1 ∧ ¬2). However, the target variable within the previous
three rules cannot be assumed to be Gaussian at the same time, as the first one is by definition
the mixture of the other two.
      </p>
      <p>To strike a balance between the computational cost and model expressiveness, we adopt
adaptive histograms for density estimation for learning rules for numeric targets. Our
contribution are as follows: 1) we formalize the problem of learning rule lists as a model selection
task; 2) by showing that the model selection criterion of the rule lists can be decomposed to the
sum of local criterion for each individual rule, we demonstrate our model selection criterion
to be compatible with the separate and conquer strategy; 3) we empirically showcase that the
rule lists obtained by our method outperforms that of SSD++ in terms of prediction accuracy,
measured by the mean squared error.</p>
      <p>
        Related Work. Far fewer rule learning methods exist for numeric targets than categorical
targets. Besides SSD++, PRIM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposes to sequentially search for the rules that deviate from
the global mean, which leads to a non-probabilistic rule lists. Further, DSSD [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposes to
learn a diverse set of non-probabilistic rules for numeric targets. Next, instead of learning rules
directly from numeric targets, an alternative approach is to transfer regression rule learning to
classification rule learning by dynamically cutting the targets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Also, Meeng and Knobbe
thoroughly discuss about dealing with numeric targets by discretization in rule learning for
subgroup discovery [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Last, RuleFit [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] can generate rules for numeric targets for constructing
an ensemble model, which is less interpretable than a single rule list. Note that some of these
methods are developed for the subgroup discovery task instead of predictive rule learning.
Specifically, our competitor SSD ++ is originally proposed for subgroup discovery; however, as
pointed out in its original paper, it can be used for regression by formalizing the rule list as a
probabilistic model in a slightly diferent way.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Probabilistic Rule Lists for Numeric Targets</title>
      <p>Adaptive histogram for numeric targets. Consider a one-dimensional random variable
 taking values in R, a histogram model for  is a set of cut points (including boundaries),
denoted as ⃗ = (1, . . . , +1), where  represents the number of bins. For any value  in the
support of  , the associated probability distribution, denoted by ℎ(.), has density function
defined and denoted as ℎ( = ) = ∑︀</p>
      <p>=1 1[,+1)()  , where 1(.) is the indicator function
and  = ( 1, ...,   ) is the parameter vector to be estimated from data; i.e.,  represents the
probability of  taking values in each of all bins. In practice,  can be estimated by the maximum
likelihood estimator:  ˆ = |{≤ &lt;+1}| , where |.| is the cardinality function.</p>
      <p>|| (+1− )</p>
      <sec id="sec-2-1">
        <title>Histogram-based rule.</title>
        <p>Consider the -dimensional feature variables  = (1, . . . , ),
where each  a single dimension of , and a target variable  ∈ R. A histogram-based rule is
written as (1 ∈ 1 ∧ 2 ∈ 2 ∧ . . .) → ℎ,( ), where each  ∈  is called a literal of
the condition of the rule. Specifically, each  is a closed interval or a set of categorical levels. A
rule in this form describes a subset  of the full sample space of , such that for any  ∈ ,
the conditional distribution  ( | = ) is approximated by the probability distribution of
 conditioned on the event { ∈ }, i.e.,  ( | ∈ ), which is modelled by an adaptive
histogram model, associated with  and hence denoted as ℎ,( ). Thus, a histogram-based
rule is a local probabilistic model for all instances that satisfy the rule. To simplify the notation,
when clear from the context we use  to refer to the rule itself and the subset of all instances
covered by the rule.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Histogram-based rule list.</title>
        <p>Precisely, a rule list is an ordered list of rules connected by the
“If...ElseIf...Else..." statements. For instance, a rule list containing only two rules 1 and 2 can
be written as “IF  ∈ 1, THEN ℎ,1 ; ELSE IF  ∈ 2, THEN ℎ,2 ; ELSE: ℎ,0 ". Note that we
use ℎ,0 to represent the histogram that is associated with the instances not covered by any
rule, and referred to 0 as the “else rule"1. Therefore, a rule list connects multiple rules and
hence becomes a global model for the whole dataset: given any instance (, ), we can calculate
the probability (density) by firstly going over the rule list until  satisfies the condition of a
certain rule, and then predicting the density function of  conditioned on  by the histogram
model associated with that rule.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Learning Rule Lists as a Model Selection Task</title>
      <p>
        We formalize the task of learning a rule list as a probabilistic model selection task. We adopt the
minimum description length (MDL) principle [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for the model selection task. The MDL-based
model selection has theoretic roots in information theory and can be regarded as an extension
of Bayesian model selection. Further, it has a long history of being successfully applied in rule
learning, including classic methods like C4.5 and RIPPER [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. In practice, the MDL-based
model selection criterion can be viewed as a form of penalized maximum likelihood, as we will
elaborate next. We start with the criterion for optimal histograms of individual rules and then
discuss the global criterion for rule lists.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Preliminaries: learning adaptive histograms for individual rules</title>
        <p>Consider an individual rule  (with a histogram with fixed cut points) and all the
 instances satisfying the condition of , denoted as (, ).</p>
        <p>
          Formally, the
optimal adaptive histogram ℎ* among all possible histograms ℋ is defined as
arg minℎ∈ℋ − log ˆℎ,(  = ) + ℛ(, ))︀ ; note that 1) ˆℎ, is the estimated
probabil︀(
ity density function with the maximum likelihood estimator for ⃗ , 2) ˆℎ,() is extended to
=
ℎ*
1A related and widely used notion is the “default rule". The diference lies in whether the parameters associated
with 0 is estimated by all instances (default rule) or by instances that are not covered by any rule in the rule list
(else rule).
ˆℎ,() with the i.i.d assumption, and 3) ℛ(, ) is the so-called regret, the “penalty term" of
the MDL model selection criterion, which is a function of sample size  and the number of bins
of the histogram  [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. By definition, ℛ(, ) = log ∫︀ ˆℎ,(  = ) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The seminal
work in this research line [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] developed a dynamic programming optimization algorithm with
quadratic time complexity, despite the expensive numeric integral in ℛ.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model selection for learning histogram-based rule lists</title>
        <p>
          The task of learning a rule list needs to simultaneously select the cut points on features (for
constructing the rules’ conditions) and on targets (for constructing the adaptive histograms).
Formally, denote the (training) dataset as  = (, ), then the best histogram-based rule list,
denoted as  * , among all possible rule lists ℳ, is defined as
 * = arg min (,  ) = arg min − log ˆ (|) + ℛ( ) + ( ),
∈ℳ ∈ℳ
(1)
where 1) ˆ (|) is the likelihood for the data, based on the maximum likelihood estimator
of the parameter ⃗ for all histograms; 2) ℛ( ) is the MDL regret term for the rule list as a
global model, defined as ℛ( ) = log ∫︀ ˆ (  = |), which can be proved to be the
sum of the regret terms ℛ of all individual rules [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]; and 3) ( ) is the code length, in bits,
that is needed to encode the rule list as a model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Note that for a fixed  ∈ ℳ, not only the
conditions of the rule but also the cut points for associated histograms are fixed. To calculate
( ), we sum up the code lengths needed to encode the following: the number of rules in  ,
the number of literals for each all individual rules in  , and the exact variable and value used
for the condition of each literal [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The formula for calculating the length is the same as that of
the method SSD++ [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and hence we skip the mathematical details here.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Algorithm: separate and conquer</title>
        <p>
          As exhaustive search in rule learning is known to be computationally prohibitive [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], we resort
to the separate and conquer strategy: we iteratively search for the next rule, add it to the rule
list, and remove the covered instances, until adding a new rule brings no further decrease in
minimizing (,  ). Since 1) the regret terms ℛ of the rule list can be written as the sum
of the regret terms of each individual rule, 2) ( ) can be decomposed to individual rules by
definition, and 3) the log-likelihood can be decomposed to the likelihood of each rule, the global
model selection criterion (,  ) can be decomposed to the sum of local criterion of each
individual rule. Hence, the separate and conquer strategy can be viewed as a greedy manner of
optimizing the global criterion.
        </p>
        <p>
          To search for the next rule based on the current (incomplete) rule list, we grow the rule
by iteratively adding literals to an empty rule. Intuitively, we should not search for the next
rule by minimizing (,  ) directly, as our goal is not to minimize (,  ) by adding one
more rule only. Instead, we should search for the next rule such that by adding this rule to the
rule list, we take a step towards our final rule list in the “steepest direction". Informally, the
“steepest direction" can be thought of the direction that reduces (,  ) most per extra covered
instance. That is, we use the heuristic named “normalized gain" [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Similar to SSD++ we use
beam search when growing individual rules to avoid (some) local optima. The rule growing
procedure continues as long as the normalized gain is positive, i.e., (,  ) keeps decreasing.
Further, the rule list learning is stopped when adding a new rule will not decrease the global
criterion further.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Empirical Performance</title>
      <p>
        We benchmark our method against SSD++ with widely used UCI datasets for regression tasks, to
investigate the predictive performance improvement by adopting the non-parametric histogram
models. We also include the results of CART [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] as a baseline interpretable model, for which
we use the implementation from scikit-learn with post-pruning.
      </p>
      <p>We report the mean squared error (MSE) and the total number of literals in the rule list in
Table 1, and the results are obtained by 10 random train/test splits, in which 80% of the instances
are used for training. We show that our histogram-based approach outperforms SSD++ in
most datasets. Thus, the Gaussian assumption will indeed lead to sub-optimal results for rule
learning. Further, we observe that CART outperforms our method in about half the datasets, but
CART in general produces more complicated models than our method in terms of the number
of literals in each rule (path from root to leaf).</p>
      <p>Next, we observe that SSD++ produces much simpler models than our method. This can
be explained as follows: SSD++ and our method both use the MDL principle to control the
model complexity, and hence the rule can grow only when the “gain" in likelihood exceeds
the “cost" in the regret term and model complexity. Since adaptive histograms are much more
expressive than Gaussian parametric models, it is “easier" for the rule to obtain substantial
“gain" in likelihoods, which drives the rules to grow longer and hence cover smaller subsets.
As a result, more rules are needed to cover the whole dataset and hence our method produces
longer rules and a larger number of rules.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Discussion</title>
      <p>We developed a rule list method for numeric targets with the adaptive histogram model, and
we demonstrated that the predictive performance is superior to the existing method based on
parametric Gaussian models. The limitation of our method is scalability, due to the quadratic
complexity of searching the optimal adaptive histograms: for datasets with tens of features and
thousands of sample sizes, the training process can take a few minutes. The future work may
be focused on developing methods for calculating the “score" of each rule list approximately
but eficiently.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is part of the research programme ‘Human-Guided Data Science by Interactive Model
Selection’ with project number 612.001.804, which is (partly) financed by the Dutch Research
Council (NWO).
data
cholesterol
autoMPG8
dee
ele-1
forestFires
concrete
abalone</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gamberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lavrač</surname>
          </string-name>
          , Foundations of rule learning,
          <source>Springer Science &amp; Business Media</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Flach</surname>
          </string-name>
          ,
          <article-title>Roc 'n'rule learning-towards a better understanding of covering algorithms</article-title>
          ,
          <source>Machine learning 58</source>
          (
          <year>2005</year>
          )
          <fpage>39</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>H. M.</surname>
          </string-name>
          <year>e</year>
          . a. Proença,
          <article-title>Discovering outstanding subgroup lists for numeric targets using mdl</article-title>
          ,
          <source>in: ECMLPKDD 2020</source>
          , Springer,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. I. Fisher</surname>
          </string-name>
          ,
          <article-title>Bump hunting in high-dimensional data</article-title>
          ,
          <source>Statistics and computing 9</source>
          (
          <year>1999</year>
          )
          <fpage>123</fpage>
          -
          <lpage>143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Van Leeuwen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Knobbe</surname>
          </string-name>
          ,
          <article-title>Diverse subgroup set discovery</article-title>
          ,
          <source>Data Mining and Knowledge Discovery</source>
          <volume>25</volume>
          (
          <year>2012</year>
          )
          <fpage>208</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Janssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          ,
          <article-title>Heuristic rule-based regression via dynamic reduction to classification</article-title>
          , in: Twenty-Second
          <source>International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Meeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Knobbe</surname>
          </string-name>
          ,
          <article-title>For real: a thorough look at numeric attributes in subgroup discovery</article-title>
          ,
          <source>Data Mining and Knowledge Discovery</source>
          <volume>35</volume>
          (
          <year>2021</year>
          )
          <fpage>158</fpage>
          -
          <lpage>212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <article-title>Predictive learning via rule ensembles</article-title>
          ,
          <source>The annals of applied statistics</source>
          (
          <year>2008</year>
          )
          <fpage>916</fpage>
          -
          <lpage>954</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Grünwald</surname>
          </string-name>
          ,
          <article-title>The minimum description length principle</article-title>
          , MIT press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Quinlan</surname>
          </string-name>
          ,
          <year>C4</year>
          .
          <article-title>5: programs for machine learning</article-title>
          ,
          <source>Elsevier</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>Fast efective rule induction</article-title>
          ,
          <source>in: Machine learning proceedings 1995, Elsevier</source>
          ,
          <year>1995</year>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>123</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kontkanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Myllymäki</surname>
          </string-name>
          ,
          <article-title>Mdl histogram density estimation</article-title>
          ,
          <source>in: Artificial intelligence and statistics</source>
          , PMLR,
          <year>2007</year>
          , pp.
          <fpage>219</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Leeuwen</surname>
          </string-name>
          ,
          <article-title>Truly unordered probabilistic rule sets for multi-class classification</article-title>
          ,
          <source>arXiv preprint arXiv:2206.08804</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Olshen</surname>
          </string-name>
          ,
          <article-title>Classification and regression trees</article-title>
          , CRC press,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>