<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>When characteristic rule-based models should be preferred over discriminative ones</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Florian Beck</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Fürnkranz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Van Quoc Phuong Huynh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler University Linz, LIT Artificial Intelligence Lab / Institute for Application-oriented Knowledge Processing (FAW)</institution>
          ,
          <addr-line>Altenberger Straße 66b/69, 4040 Linz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, the interpretability of machine learning models has gained interest. White-box approaches like rule-based models serve as an interpretable alternative or as surrogate models of black-box approaches. Among these, more compact rule-based models are considered easier to interpret. In addition, they often generalize better and thus provide higher predictive accuracies than their overfitting complex counterparts. In this paper, we argue that more complex, “characteristic” rule-based models are a genuine alternative to more compact, “discriminative” ones. We discuss why characteristic models should not be considered as less interpretable, and that more included features can actually strengthen the model both in terms of robustness and predictive accuracy. For this, we evaluate the efects on the decision boundary for models of diferent complexity, and also modify a recently developed Boolean pattern tree learner to compare a characteristic and a discriminative version on five UCI data sets. We show that the more complex models are indeed more robust to missing data, and that they sometimes even improve the predictive accuracy on the original data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;characteristic rules</kwd>
        <kwd>discriminative rules</kwd>
        <kwd>decision boundaries</kwd>
        <kwd>interpretability</kwd>
        <kwd>robustness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the rise of neural network models in many machine
learning applications, the need has grown to actually
understand what these black-box approaches learn. This
has brought rule-based models back into the spotlight
which can be used as interpretable surrogates of
neural network approaches, e.g., by extracting rules from
the whole network [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or with the focus on explaining
decision boundaries [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Independent of whether rule-based models are used as
surrogates of neural networks or as a stand-alone model,
usually the principle of Occam’s Razor [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is followed,
which can be loosely translated as that the simplest
explanation is the best one. Consequently, discriminative
rules which discriminate an object of one category from
objects of other categories are preferred over
characteristic rules which try to capture all properties that
are common to the objects of the target class [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This
principle is also supported by the observation that longer
explanations tend to overfit the training data, leading
to worse performances on test data. Hence, most rule
learner use some kind of pruning policy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], resulting
in learning short discriminative rules instead of longer
characteristic ones.
      </p>
      <p>However, there is a fine line between avoiding
overiftting and learning too general theories. Consider the
sample dataset in Table 1 consisting of six countries, three
2
6.9
1.8
2.2
9.3
2.3
6.1
belonging to the class Europe and three to the class South
America. For each country, the value for the three
numeric attributes Size, Age and 2 is provided.</p>
      <p>
        Traditional rule learners like, e.g., Ripper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] strive for
discriminative rules, i.e., rules that minimize the number
of used attributes when describing the classes. In this
case, such a perfect, minimal description of the training
data could be learned with a single rule 1 only
considering the first attribute Size, and the corresponding default
rule 0 for the other class2:
1 :  =  ←
      </p>
      <p>&lt; 184
(0 :  =  ← ⊤
).</p>
      <p>(1)</p>
      <p>Rule 1 covers the three examples Austria, Czechia
and Slovakia because these examples fulfill the condition
 &lt; 184. Bolivia, Brazil and Ecuador are not covered
by 1 but only by the most general rule 0, thus classified
as South America. While these rules perfectly describe the
training examples, they fail to correctly classify the test
example Germany, which is not covered by 1 and hence
misclassified as South America by 0. Vice versa, Uruguay
is covered by 1 and hence misclassified as Europe . Note
that these misclassifications could have been avoided if a
diferent feature would have been picked, such as, e.g.,
in rules 2 and 3:
that European and South American countries do not only
difer in size, but also in median age and CO 2 emissions.</p>
      <p>The rest of the paper is organized as follows: Section 2
further specifies the problem of finding good decision
boundaries and presents characteristic models of
nonrule-based classifiers as an inspiration for adaption in the
rule-based setting, presented in Section 3. We modify a
rule-based learner in Section 4 accordingly and evaluate
a discriminative and characteristic version in Section 5
in terms of predictive accuracy and robustness. Section 6
concludes the results and takes a brief look at the
remaining challenges.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Decision boundaries</title>
      <p>As depicted in the introduction, in contrast to long
char2 :  =  ←  ≥ 36.7 (2) acteristic rules being prone to overfitting, short
discrimi3 :  =  ← 2 ≥ 4.2. native rules come with the risk of providing too simplistic
theories that overgeneralize. This can also be illustrated</p>
      <p>However, 2 does not cover the test example Kosovo, by the decision boundary of the country dataset rules,
and 3 does not cover Albania, so that neither of the three see Figure 1. For a better visualization, we omit the third
rules would be suficient to classify all four test examples attribute CO2 to obtain a two-dimensional feature space
correctly, but only a combined rule set of 2 and 3 would using the attribute Size in logarithmic scale on the -axis
do so. Similarly, the three suggested features  &lt; 184, and Age on the -axis. The raw data are shown in
Fig ≥ 36.7 and 2 ≥ 4.2 can also be connected by ure 1a; the six training examples as points and the four
conjunctions to a single rule  for class Europe, while the test examples as circles, colored in blue for class Europe
respectively contrasting features form rule  for class and in red for class South America, respectively.
South America: We see that the training examples are quite easily
separable from each other, while the test examples complicate
 :  =  ←  &lt; 184 ∧  ≥ 36.7 ∧ 2 ≥ 4.2 ifnding a good decision boundary. Figure 1b shows a
discriminative rule  =  ←  &lt; 36, covering all four
 :  =  ←  ≥ 184 ∧  &lt; 36.7 ∧ 2 &lt; 4(.32). South American countries along with one European in</p>
      <p>
        While none of the two rules covers any of the test the light-red area of the feature space. The light-blue
examples, a slight modification of their semantics allows area (classified by the default rule  =  ← ⊤ ) contains
us to use them as reliable classifiers. Instead of requir- ifve true negatives. By adding the condition  &gt; 140,
ing that all conditions of a rule need to be satisfied, we the rule can be defined more characteristic, leading to a
instead assign an example to its closest rule, a method perfect classification of all examples, see Figure 1c.
that is reminiscent of rule stretching [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or nearest hyper- Still, the provided decision boundary in Figure 1c can
rectangle classification [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In our example, the first three be considered suboptimal when compared with
non-ruletest examples are assigned to class Europe, since for each based models. Figure 1d illustrates an arguably better
deof them two out of three conditions of  are satisfied and cision boundary which other methods like, e.g., support
only one out of three of . Analogously, test example vector machines [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], logistic regression [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and naive
Uruguay is correctly classified as South America. Bayes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] can find. All approaches have in common that
      </p>
      <p>Independent of using conjunctions or disjunctions as they usually consider all attributes in the feature space
the connector, we notice that more characteristic rule the- and rely on continuous coeficients to build their models;
ories in Equations 2 and 3 are able to classify all four test in this case:
examples correct, while the discriminative rule theory in
Equation 1 is not able to do so. Moreover, the inclusion  =  ←  − 10 · log10() ≤ 15.
of more features in the characteristic concepts might not In comparison to the methods just mentioned,
convenonly lead to a better performance but also arguably pro- tional rule learners only use combinations of
attributevide more interesting and interpretable models, stating value-combinations for the splits of their classes. As a
consequence, one of the main limitations of rule
learn2Retrieved 2024/07/04 from
https://ourworldindata.org/agestructure and https://ourworldindata.org/co2-and-greenhouse-gas- ing is arguably its restriction to axis-parallel decision
emissions. boundaries. Though, the last two subfigures show two
(a) Original data
(b) Discriminative rule
(c) Characteristic rule
(d) Support Vector Machine
(e) Scoring system
(f) Hyper-rectangles
ways how rule-based methods can still mimic decision Finally, Figure 1f is the illustration of two
characterisboundaries like in Figure 1d. tic rules similar to Equation 3: We describe both classes</p>
      <p>
        In Figure 1e, we see multiple steps in the decision Europe and South America without using a default rule.
boundary. Trivially, this behavior can be achieved by Obviously, the learned rules of the two classes can
overlearning one rule for each step. While this is straightfor- lap or — as in this case — leave wide areas of the feature
ward in this example, it is too hard to maintain in a high- space uncovered, so that a pure Boolean evaluation of
dimensional feature space with an exponentially increas- the rules is not suficient anymore. One way to
haning number of combinations. Scoring systems [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] scale dle these uncovered areas are nearest hyper-rectangles
better by assigning low integer scores to attribute-value [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The decision boundary between the two classes
combinations, hereby providing a trade-of between rules can be shaped arbitrarily if enough hyper-rectangles, i.e.,
and linear models. In the special case that all weights rules, are learned (and is actually neither quite straight
are binary, the scoring system converts into an m-of-n in Figure 1f). Obviously, distances for nominal attributes
concept. With the scores being assigned by the following can not be defined as straightforward as for numerical
scheme and a threshold of 4 for class South America, all attributes, as is discussed in the following section.
examples are classified correctly while providing a more In this work, we aim to expand rule-based approaches
customized decision boundary compared to Figure 1c: to reach this stronger expressiveness shown in the last
three subfigures while still retaining the properties
making them interpretable, i.e., without including all features
instead of interactions and, most notably, without
con :  : tinuous coeficients like in SVMs, logistic regression or
naive Bayes, for what characteristic rules are preferable.
⎧3 if ≥ 1100
⎪
⎪
⎪⎨2 if ≥ 140
⎪1 if ≥ 20
⎪
⎪⎩0 else.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Characteristic rule learning</title>
      <p>
        lead to the learning of discriminative rules, so that in
this context, so-called inverted heuristics 4(, ) are
sugSo far we discussed why characteristic rules can be bene- gested which better reflect the top-down nature of the
ifcial both in terms of interpretability and performance rule refinement process in theory by originating from
but observed as well that almost no rule-based methods the other side of the coverage space [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Because of
learn such concepts. To understand why conventional its typical focus on completeness, inverted heuristic can
rule learners prefer discriminative rules, we first briefly often "delay" the choice of too specific features, hence
introduce the coverage space and related heuristics. Sub- resulting in characteristic rules built of multiple more
sequently, we reveal potential issues with the latter and general features.
identify properties which should be taken into
consideration when developing a characteristic rule-based learner.
3.2. Limitations
3.1. Coverage space and heuristics Even though it has been shown empirically for some
datasets that inverted heuristics result in characteristic
Traditionally, rules are gradually refined by adding in- rules [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], it is not inherent that they lead to
characdividual conditions, whereby conjunctive refinements teristic rules. As a counterexample, consider learning a
specialize a rule (afterwards it can never cover more ex- rule for the class Europe using all examples of the
counamples than before the refinement), whereas disjunctive try dataset except of Brazil. The best single condition
refinements generalize a rule (afterwards it can never is  ≥ 29.1 covering all six examples of class Europe
cover fewer examples than before the refinement). This as well as Uruguay. This false positive can not be
excan be visualized in coverage space, a non-normalized cluded by further (single-cut) conditions on Size or CO2
ROC space, where the -axis shows the covered negative without losing coverage of at least one true positive, so
and the -axis the covered positive examples [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For that the inverted heuristic stops with a rule consisting
example, Figure 2 shows a path that gradually refines of a single condition. Interestingly enough, in this case,
an initially universal rule (covering all  positive and  regular heuristics would even learn longer rules than
negative examples, upper right corner of the coverage inverted heuristics, since they typically prefer this trade
space) into the rule + ←  ∧ . of removing a false positive at the cost of a false negative.
      </p>
      <p>Most importantly though, traditional rule learners
have a severe limitation of focusing only on the
coverage of the learned rules but not how (well) they cover
the examples. We already noticed in Table 1 that rule 1
in Equation 1 can be expanded to  in Equation 3 by
features considering the age and 2 of a country without
covering more positive or less negative examples. Hence,
both 1 and  correspond to the same point in the
coverage space in the top left corner, covering all positive
and no negative training examples. As a consequence,
independent of the chosen heuristic, conventional rule
learners are not able to learn  if a refinement requires
improving the heuristic.</p>
      <p>Even if the heuristic improves by adding a new
condition to the original rule a similar issue can occur. Assume
Figure 2: Rule refinement in coverage space a new rule 4 learned on all ten examples in Table 1,
which focuses on covering the example Germany based</p>
      <p>Apparently, a rule refined to the upper left corner can on the condition  ≥ 316. This rule still covers
Bobe considered perfect, since it covers only positive ex- livia and Brazil as well and could therefore be refined to
amples and no negatives. In most scenarios such a rule rules 5 and 6, both considering the Age-attribute:
can not be found, so that a trade-of must be found
between the importance of covering all positives (complete- 4 :  =  ←  ≥ 316
ness) and not covering any negatives (consistency). For 5 :  =  ←  ≥ 316 ∧  ≥ 36.3 (4)
this purpose, heuristics are defined as functions h(, ), 6 :  =  ←  ≥ 316 ∧  ≥ 44.5.
w(nheegraeti0v e≤) ex a≤mple(0s c≤over≤ edby) ias rthuelen[u1m4]b. er of positive While 5 and 6 both correspond to the same point</p>
      <p>In previous studies it was found that most regular in the coverage space (covering one positive example
heuristics (in particular those striving for consistency) and no negatives), their coverage on unseen examples
might vary crucially since they cover diferent areas in optima, but always use all  iterations. Furthermore,
the feature space. Arguably, 5 should be preferred over all conditions are sorted based on the accuracy metric
6 because the added condition  ≥ 36.3 covers four (ℎ() = ++− ) in the first iteration, so that in
subadditional positive examples (and still no negative) com- sequent iterations always the best "local" condition can
pared to  ≥ 44.5. So to say, while having the same be picked, as discussed in Section 3.
"global" concept, we should choose the rule with the Second, the handling of multiple pattern trees is crucial
better "local" condition. Note that this is not limited to for the decision boundary. In the original aBpt classifier,
numeric attributes. one pattern tree for each class  ∈ Y is learned. Since in</p>
      <p>
        To summarize, characteristic rules are usually not a Boolean context, the output of the Boolean expression
learned because the learners rely on heuristics that only represented by the pattern tree can only be true or false
take the number of covered positive and negative ex- for the features of the test example, ties occur if a test
amples into account instead of separating positive and example is matched by multiple pattern trees, which
negative examples with a variety of rules and conditions. can be broken by a fixed order of the pattern trees in a
Particularly, adding a condition without changing the decision list.
covered examples results in the same heuristic value, in An alternative that is used in fuzzy pattern tree
claswhich case so far the shorter explanation is used, and the sifiers [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is evaluating all pattern trees in a
probabilissearch usually stops. Additionally, the mere focus on the tic way, whereby the highest probability decides about
global coverage can lead to suboptimal "local" conditions the class prediction. A straightforward way to achieve
if ties are not handled appropriately. this behavior in aBpt is using a constant uncertainty
factor , resulting in probabilities ( ) = 1 −  for
fulfilled features and ( ) =  else, which are then
4. Boolean Pattern Trees aggregated bottom-up over the respective child nodes
 as () = ∏︀∈ () for conjunctive and () =
1 − ∏︀∈ (1 − ()) for disjunctive nodes. However, this
comes with two inconveniences of (a) weighing all child
nodes the same — independent of their importance and (b)
penalizing all conjunctive conditions (always decreasing
()) and rewarding all disjunctive conditions (always
increasing ()) independent of their quality, which
particularly afects characteristic models negatively. To address
(a), we determine ( ) flexibly in the range [, 1 − ] as
For the experimental comparison of deterministic and
characteristic concepts, we use two versions of an
alternating Boolean pattern tree (ABPT) learner recently
developed in our group [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The task of learning an ABPT is
quite similar to learning a rule: For every specific class
 ∈ Y, a tree  :  ←  is learned, where  is a logical
expression defined over the input features, which can
be much more flexible than for rules. In contrast to rule
learners which use either conjunctions or disjunctions,
ABPTs can connect binary features by conjunctions and
disjunctions in any arbitrary order. This also complicates
the iterative learning of , having multiple insertion
options per feature. E.g., inserting a disjunction with
feature  in  =  ∧  can result in  ∨ ( ∧ ), ( ∨ ) ∧ 
or  ∧ ( ∨ ), which are all logically diferent. These
insertions are repeated until the maximum number of
iterations  (= number of features in the pattern tree) is
reached. We refer to [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] for further details of the
algorithm and focus on two adjustments for the experiments
in the following.
      </p>
      <p>First, we notice that in the standard version of aBpt
already multiple heuristics for the evaluation of a tree
extension are used, focusing on consistency and
completeness in diferent search branches. By using various cost
ratios in the linear cost metric (ℎ() =  ·  − (1 − ) · ),
aBpt is capable to learn both models preferred of
regular and inverted heuristics. Though, Section 3 presented
as well problems that could not be fixed solely by the
heuristic. To choose characteristic models instead of
discriminative ones in case of ties, the learner picks the
tree learned in a later iteration, i.e., using more
conditions (and vice versa). This way we do not stop in local

( ) =  + (1 − 2) · 
for fulfilled features ( ≃ probability a positive example
fulfills the feature) and</p>
      <p>− 
( ) =  + (1 − 2) ·  −  +  − 
else (≃ probability a positive example not fulfilling the
feature is negative). Additionally, for (b) we relax ()
for the interior tree nodes as
() =
1
2 · (
1</p>
      <p>∑︁ () + min ())
|| · ∈ ∈
for conjunctive nodes and as
() =
1
2 · (
1</p>
      <p>∑︁ () + max ())
|| · ∈ ∈
for disjunctive nodes, resulting in a probability less
dependent on the number of child nodes, so that models
with diferent numbers of conjunctive and disjunctive
nodes can be compared better.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Experiments</title>
      <p>In the experiments we analyze two diferent aspects of
the aBpt learner — using the default configuration of
 = 20 iterations and accuracy and seven diferent
values for the linear cost as metrics. First, we compare a
discriminative version preferring smaller trees in case
of tied heuristics and a characteristic version preferring
bigger trees. Second, we evaluate both versions not only
in a Boolean setting but also in a probabilistic setting, as
suggested in the end of Section 4.</p>
      <p>
        For the experiments, we choose five UCI [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] datasets
where most features are not used in the discriminative
models and therefore the characteristic models could
differ remarkably: labor, mushroom soybean, vote and
zoo. On some datasets (labor, soybean, zoo) the
otherwise inferior naive Bayes classifier even outperforms
rule learners like Ripper, indicating potential for
improvement of the decision boundary in the probabilistic setting.
      </p>
      <p>The learners are not only applied to the original
datasets but also to an "imcomplete" version of the dataset,
where 30% of the values are replaced by missing values,
and finally a "mixed" version, where only the test data is
"incomplete". This way we can analyze the robustness of
the learned models, where characteristic models might
be expected to perform better, since additional features
are used as a fallback option.</p>
      <p>The predictive accuracies of a 10-fold-cross-validation
are shown in Table 2. Each table shows a
head-tohead comparison of the discriminative and characteristic
learner in a given setting of dataset type and used
evaluation. Overall, we see that the discriminative and
characteristic learner perform roughly equally well, while the
efects of missing values vary considerably between the
datasets. Except few cases, the harder the setting (from
left to right), the lower the predictive accuracy.</p>
      <p>Independent of the dataset setting, the predictive
accuracy drops drastically for labor and mushroom when
changing from the Boolean to the probabilistic setting.
Though, it often increases for the other three datasets —
in particular, in the "mixed" setting, the accuracies can be
improved drastically using a probabilistic evaluation,
indicating that the adjusted decision boundaries can make
the model more robust to incomplete training data.
c1 ←
c1 ←
c2 ←
c2 ←
c2 ←
c3 ←
c3 ←
c3 ←
c3 ←
c3 ←
c3 ←
c3 ←
c3 ←
c3 ←
c3 ←
c3 ←
c3 ←
c ←
eggs=false.
hair=true.
toothed=true.
catsize=true.
legs=(1.0:4.0].
backbone=true.
airborne=false.
aquatic=false.
fins=false.
tail=true.
domestic=false.
predator=false.
predator=true.
domestic=true.
tail=false.
catsize=false.
fins=true.
milk=true ∧ c1 ∧ c2 ∧ breathes=true∧
feathers=false ∧ c3.</p>
      <p>As an example, consider the characteristic model for
the zoo dataset in Figure 3. In Boolean evaluation, the
missing values result in too many examples being not
covered by the model, However, the numerous conditions
still indicate the correct class with a probabilistic
evaluation. We also notice a big gap between the performances
of the discriminative and characteristic model here,
indicating that the small trees with only up to four conditions
of the discriminative learner (e.g., for class=mammal it
only uses the condition mlik=true) are not as robust as
the trees of the characteristic counterpart.</p>
      <p>As Figure 3 also shows, the characteristic models are
usually considerably larger. From an interpretability
perspective, this can be preferred, since in this case we do not
only discover that mammals yield milk but also breathe,
do not wear feathers, either do not lay eggs or have hair
and either are toothed, catsized or have 1-4 legs. We also
notice that the last concept c3 (which was also the last
to be added to the model) is not helpful at all and indeed
can be reduced to true because of tautologies. This
indicates that in a characteristic setting stopping criteria
or pruning techniques are needed as well to preserve
interpretability.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <p>In this paper, we look at the possible advantages that
characteristic models, which are rarely learned in
conventional rule learning algorithms, can provide. While
previous work on characteristic rules usually focused
on the interpretability aspects, we have shown that the
inclusion of additional features, both via conjunction and
disjunction, can additionally help to find better decision
boundaries, resulting in more robust models. We also
discussed that for learning characteristic rules a mere
focus on coverage is insuficient so that both regular and
inverted heuristics can not guarantee learning
characteristic rules. Finding a suitable distance metric to separate
positive and negative examples remains as an open
question.</p>
      <p>To analyze the efects of characteristic rule-based
models empirically, we implemented a characteristic version
of the aBpt learner and compared it with the original
discriminative version on five UCI datasets. The
experiments did not show a clear advantage for any of the
learners in terms of predictive accuracy, indicating that
smaller models are not necessarily overgeneralizing and
larger models not inevitably lead to overfitting. In a
robustness check using incomplete test data, characteristic
models outperformed discriminative models.
Furthermore, characteristic models slightly outperformed
discriminative models when combined with probabilistic
evaluation.</p>
      <p>We also see multiple paths to further develop the
potential of characteristic models in future work: Most
importantly, the decision boundary artificially moved by
a probabilistic evaluation (which certainly provides room
for improvement as well) should not only be considered
during classification but also in the learning phase. In this
regard, the definition of heuristics not only considering
coverage but also the "quality" of the coverage, connected
to the distance between the example and the decision
boundary, is crucial. This way rule learners could not
only deliver a prediction but also determine how certain
the prediction is, and optionally abstain from making a
prediction.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Andrews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Diederich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Tickle</surname>
          </string-name>
          ,
          <article-title>Survey and critique of techniques for extracting rules from trained artificial neural networks</article-title>
          ,
          <source>Knowledgebased systems 8</source>
          (
          <year>1995</year>
          )
          <fpage>373</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <article-title>Factual and counterfactual explanations for black box decision making</article-title>
          ,
          <source>IEEE Intelligent Systems</source>
          <volume>34</volume>
          (
          <year>2019</year>
          )
          <fpage>14</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Blumer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ehrenfeucht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Haussler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Warmuth</surname>
          </string-name>
          ,
          <article-title>Occam's razor</article-title>
          ,
          <source>Information processing letters</source>
          <volume>24</volume>
          (
          <year>1987</year>
          )
          <fpage>377</fpage>
          -
          <lpage>380</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Michalski</surname>
          </string-name>
          ,
          <article-title>A theory and methodology of inductive learning</article-title>
          ,
          <source>in: Machine learning, Elsevier</source>
          ,
          <year>1983</year>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          ,
          <article-title>Pruning algorithms for rule learning</article-title>
          ,
          <source>Machine learning 27</source>
          (
          <year>1997</year>
          )
          <fpage>139</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>Fast efective rule induction</article-title>
          ,
          <source>in: Machine learning proceedings 1995, Elsevier</source>
          ,
          <year>1995</year>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>123</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eineborg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Boström</surname>
          </string-name>
          ,
          <article-title>Classifying uncovered examples by rule stretching</article-title>
          , in: C.
          <string-name>
            <surname>Rouveirol</surname>
          </string-name>
          , M. Sebag (Eds.),
          <source>Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP-01)</source>
          , Springer Verlag, Strasbourg, France,
          <year>2001</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Salzberg</surname>
          </string-name>
          ,
          <article-title>A nearest hyperrectangle learning method</article-title>
          ,
          <source>Machine Learning</source>
          <volume>6</volume>
          (
          <year>1991</year>
          )
          <fpage>251</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Cristianini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shawe-Taylor</surname>
          </string-name>
          ,
          <article-title>An introduction to support vector machines and other kernel-based learning methods</article-title>
          , Cambridge university press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cramer</surname>
          </string-name>
          ,
          <article-title>The origins of logistic regression (</article-title>
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I.</given-names>
            <surname>Rish</surname>
          </string-name>
          , et al.,
          <article-title>An empirical study of the naive bayes classifier</article-title>
          ,
          <source>in: IJCAI 2001 workshop on empirical methods in artificial intelligence,</source>
          volume
          <volume>3</volume>
          , Seattle, WA, USA;,
          <year>2001</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ustun</surname>
          </string-name>
          ,
          <article-title>Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice</article-title>
          ,
          <source>Interfaces</source>
          <volume>48</volume>
          (
          <year>2018</year>
          )
          <fpage>449</fpage>
          -
          <lpage>466</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Salzberg</surname>
          </string-name>
          ,
          <article-title>A nearest hyperrectangle learning method</article-title>
          ,
          <source>Machine learning 6</source>
          (
          <year>1991</year>
          )
          <fpage>251</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Flach</surname>
          </string-name>
          ,
          <article-title>Roc'n'rule learning-towards a better understanding of covering algorithms</article-title>
          ,
          <source>Machine learning 58</source>
          (
          <year>2005</year>
          )
          <fpage>39</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Stecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Janssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          ,
          <article-title>Shorter rules are better, aren't they?</article-title>
          ,
          <source>in: Discovery Science: 19th International Conference, DS</source>
          <year>2016</year>
          , Bari, Italy,
          <source>October 19-21</source>
          ,
          <year>2016</year>
          , Proceedings 19, Springer,
          <year>2016</year>
          , pp.
          <fpage>279</fpage>
          -
          <lpage>294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Q. P.</given-names>
            <surname>Huynh</surname>
          </string-name>
          ,
          <article-title>Learning deep rule concepts as alternating boolean pattern trees</article-title>
          ,
          <source>in: Discovery Science: 27th International Conference, DS</source>
          <year>2024</year>
          , Pisa, Italy,
          <source>October 14-16</source>
          ,
          <year>2024</year>
          , Proceedings 27, Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Senge</surname>
          </string-name>
          , E. Hüllermeier,
          <article-title>Top-down induction of fuzzy pattern trees</article-title>
          ,
          <source>IEEE Transactions on Fuzzy Systems</source>
          <volume>19</volume>
          (
          <year>2010</year>
          )
          <fpage>241</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Longjohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nottingham</surname>
          </string-name>
          ,
          <source>The UCI machine learning repository</source>
          ,
          <year>2024</year>
          . URL: https:// archive.ics.uci.edu.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>