When characteristic rule-based models should be preferred
                                over discriminative ones
                                Florian Beck1 , Johannes Fürnkranz1 and Van Quoc Phuong Huynh1
                                1
                                 Johannes Kepler University Linz, LIT Artificial Intelligence Lab / Institute for Application-oriented Knowledge Processing (FAW), Altenberger
                                Straße 66b/69, 4040 Linz, Austria


                                                                             Abstract
                                                                             In recent years, the interpretability of machine learning models has gained interest. White-box approaches like rule-based
                                                                             models serve as an interpretable alternative or as surrogate models of black-box approaches. Among these, more compact
                                                                             rule-based models are considered easier to interpret. In addition, they often generalize better and thus provide higher
                                                                             predictive accuracies than their overfitting complex counterparts. In this paper, we argue that more complex, “characteristic”
                                                                             rule-based models are a genuine alternative to more compact, “discriminative” ones. We discuss why characteristic models
                                                                             should not be considered as less interpretable, and that more included features can actually strengthen the model both in
                                                                             terms of robustness and predictive accuracy. For this, we evaluate the effects on the decision boundary for models of different
                                                                             complexity, and also modify a recently developed Boolean pattern tree learner to compare a characteristic and a discriminative
                                                                             version on five UCI data sets. We show that the more complex models are indeed more robust to missing data, and that they
                                                                             sometimes even improve the predictive accuracy on the original data.

                                                                             Keywords
                                                                             characteristic rules, discriminative rules, decision boundaries, interpretability, robustness


                                1. Introduction                                                                                                                       Table 1
                                                                                                                                                                      A small country dataset with three numeric attributes size
                                With the rise of neural network models in many machine                                                                                (in 1,000 km²), age (median; in years) and 𝐶𝑂2 (emission per
                                learning applications, the need has grown to actually                                                                                 capita and year; in tons)1 . It is split into six training examples
                                understand what these black-box approaches learn. This                                                                                (three for each of the classes Europe and South America) and
                                has brought rule-based models back into the spotlight                                                                                 four test examples of unknown class.
                                which can be used as interpretable surrogates of neu-                                                                                                     Size    Age     𝐶𝑂2      Class
                                ral network approaches, e.g., by extracting rules from                                                                                       Austria        84    42.8     6.9     Europe
                                the whole network [1] or with the focus on explaining                                                                                        Bolivia      1099    23.9     1.8     South America
                                decision boundaries [2].                                                                                                                     Brazil       8515    32.8     2.2     South America
                                   Independent of whether rule-based models are used as                                                                                      Czechia        79    42.6     9.3     Europe
                                surrogates of neural networks or as a stand-alone model,                                                                                     Ecuador       284    27.6     2.3     South America
                                                                                                                                                                             Slovakia       49    40.6     6.1     Europe
                                usually the principle of Occam’s Razor [3] is followed,
                                which can be loosely translated as that the simplest ex-
                                                                                                                                                                             Albania        29    37.3      1.7    ?
                                planation is the best one. Consequently, discriminative                                                                                      Germany       357    44.9      8.0    ?
                                rules which discriminate an object of one category from                                                                                      Kosovo         11    30.5      4.8    ?
                                objects of other categories are preferred over charac-                                                                                       Uruguay       176    35.2      2.3    ?
                                teristic rules which try to capture all properties that
                                are common to the objects of the target class [4]. This
                                principle is also supported by the observation that longer                                                                            belonging to the class Europe and three to the class South
                                explanations tend to overfit the training data, leading                                                                               America. For each country, the value for the three nu-
                                to worse performances on test data. Hence, most rule                                                                                  meric attributes Size, Age and 𝐶𝑂2 is provided.
                                learner use some kind of pruning policy [5], resulting                                                                                   Traditional rule learners like, e.g., Ripper [6] strive for
                                in learning short discriminative rules instead of longer                                                                              discriminative rules, i.e., rules that minimize the number
                                characteristic ones.                                                                                                                  of used attributes when describing the classes. In this
                                   However, there is a fine line between avoiding over-                                                                               case, such a perfect, minimal description of the training
                                fitting and learning too general theories. Consider the                                                                               data could be learned with a single rule 𝑟1 only consider-
                                sample dataset in Table 1 consisting of six countries, three                                                                          ing the first attribute Size, and the corresponding default
                                                                                                                                                                      rule 𝑟0 for the other class2 :
                                $ fbeck@faw.jku.at (F. Beck); juffi@faw.jku.at (J. Fürnkranz);
                                vqphuynh@faw.jku.at (V. Q. P. Huynh)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                                                      2
                                                                                                                                                                          In the following, 𝑐𝑙𝑎𝑠𝑠 = 𝑒𝑢𝑟𝑜𝑝𝑒 is abbreviated as 𝑐 = 𝑒 and
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                            𝑐𝑙𝑎𝑠𝑠 = 𝑠𝑜𝑢𝑡ℎ_𝑎𝑚𝑒𝑟𝑖𝑐𝑎 as 𝑐 = 𝑠𝑎


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                               that European and South American countries do not only
                𝑟1 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 < 184                        differ in size, but also in median age and CO2 emissions.
                                                         (1)
                (𝑟0 : 𝑐 = 𝑠𝑎 ← ⊤).                                The rest of the paper is organized as follows: Section 2
                                                               further specifies the problem of finding good decision
   Rule 𝑟1 covers the three examples Austria, Czechia
                                                               boundaries and presents characteristic models of non-
and Slovakia because these examples fulfill the condition
                                                               rule-based classifiers as an inspiration for adaption in the
𝑠𝑖𝑧𝑒 < 184. Bolivia, Brazil and Ecuador are not covered
                                                               rule-based setting, presented in Section 3. We modify a
by 𝑟1 but only by the most general rule 𝑟0 , thus classified
                                                               rule-based learner in Section 4 accordingly and evaluate
as South America. While these rules perfectly describe the
                                                               a discriminative and characteristic version in Section 5
training examples, they fail to correctly classify the test
                                                               in terms of predictive accuracy and robustness. Section 6
example Germany, which is not covered by 𝑟1 and hence
                                                               concludes the results and takes a brief look at the remain-
misclassified as South America by 𝑟0 . Vice versa, Uruguay
                                                               ing challenges.
is covered by 𝑟1 and hence misclassified as Europe . Note
that these misclassifications could have been avoided if a
different feature would have been picked, such as, e.g.,       2. Decision boundaries
in rules 𝑟2 and 𝑟3 :
                                                               As depicted in the introduction, in contrast to long char-
               𝑟2 : 𝑐 = 𝑒 ← 𝑎𝑔𝑒 ≥ 36.7                         acteristic rules being prone to overfitting, short discrimi-
                                                         (2)
               𝑟3 : 𝑐 = 𝑒 ← 𝐶𝑂2 ≥ 4.2.                         native rules come with the risk of providing too simplistic
                                                               theories that overgeneralize. This can also be illustrated
  However, 𝑟2 does not cover the test example Kosovo,
                                                               by the decision boundary of the country dataset rules,
and 𝑟3 does not cover Albania, so that neither of the three
                                                               see Figure 1. For a better visualization, we omit the third
rules would be sufficient to classify all four test examples
                                                               attribute CO2 to obtain a two-dimensional feature space
correctly, but only a combined rule set of 𝑟2 and 𝑟3 would
                                                               using the attribute Size in logarithmic scale on the 𝑥-axis
do so. Similarly, the three suggested features 𝑠𝑖𝑧𝑒 < 184,
                                                               and Age on the 𝑦-axis. The raw data are shown in Fig-
𝑎𝑔𝑒 ≥ 36.7 and 𝐶𝑂2 ≥ 4.2 can also be connected by
                                                               ure 1a; the six training examples as points and the four
conjunctions to a single rule 𝑟𝑒 for class Europe, while the
                                                               test examples as circles, colored in blue for class Europe
respectively contrasting features form rule 𝑟𝑠 for class
                                                               and in red for class South America, respectively.
South America:
                                                                  We see that the training examples are quite easily sepa-
                                                               rable from each other, while the test examples complicate
𝑟𝑒 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 < 184 ∧ 𝑎𝑔𝑒 ≥ 36.7 ∧ 𝐶𝑂2 ≥ 4.2               finding a good decision boundary. Figure 1b shows a dis-
                                                               criminative rule 𝑐 = 𝑠𝑎 ← 𝑎𝑔𝑒 < 36, covering all four
𝑟𝑠 : 𝑐 = 𝑠𝑎 ← 𝑠𝑖𝑧𝑒 ≥ 184 ∧ 𝑎𝑔𝑒 < 36.7 ∧ 𝐶𝑂2 < 4.2.
                                                               South American countries along with one European in
                                                               (3)
                                                               the light-red area of the feature space. The light-blue
    While none of the two rules covers any of the test
                                                               area (classified by the default rule 𝑐 = 𝑒 ← ⊤) contains
examples, a slight modification of their semantics allows
                                                               five true negatives. By adding the condition 𝑠𝑖𝑧𝑒 > 140,
us to use them as reliable classifiers. Instead of requir-
                                                               the rule can be defined more characteristic, leading to a
ing that all conditions of a rule need to be satisfied, we
                                                               perfect classification of all examples, see Figure 1c.
instead assign an example to its closest rule, a method
                                                                  Still, the provided decision boundary in Figure 1c can
that is reminiscent of rule stretching [7] or nearest hyper-
                                                               be considered suboptimal when compared with non-rule-
rectangle classification [8]. In our example, the first three
                                                               based models. Figure 1d illustrates an arguably better de-
test examples are assigned to class Europe, since for each
                                                               cision boundary which other methods like, e.g., support
of them two out of three conditions of 𝑟𝑒 are satisfied and
                                                               vector machines [9], logistic regression [10] and naive
only one out of three of 𝑟𝑠 . Analogously, test example
                                                               Bayes [11] can find. All approaches have in common that
Uruguay is correctly classified as South America.
                                                               they usually consider all attributes in the feature space
    Independent of using conjunctions or disjunctions as
                                                               and rely on continuous coefficients to build their models;
the connector, we notice that more characteristic rule the-
                                                               in this case:
ories in Equations 2 and 3 are able to classify all four test
examples correct, while the discriminative rule theory in
                                                                           𝑐 = 𝑠𝑎 ← 𝑎𝑔𝑒 − 10 · log10 (𝑠𝑖𝑧𝑒) ≤ 15.
Equation 1 is not able to do so. Moreover, the inclusion
of more features in the characteristic concepts might not            In comparison to the methods just mentioned, conven-
only lead to a better performance but also arguably pro- tional rule learners only use combinations of attribute-
vide more interesting and interpretable models, stating value-combinations for the splits of their classes. As a
2                                                                  consequence, one of the main limitations of rule learn-
  Retrieved 2024/07/04 from https://ourworldindata.org/age-
  structure and https://ourworldindata.org/co2-and-greenhouse-gas- ing is arguably its restriction to axis-parallel decision
  emissions.                                                       boundaries. Though, the last two subfigures show two
            (a) Original data                      (b) Discriminative rule                   (c) Characteristic rule


       (d) Support Vector Machine                    (e) Scoring system                      (f) Hyper-rectangles

Figure 1: Different decision boundaries of various learning approaches for the country dataset reduced to the attributes Size
(𝑥-axis, logarithmic) and Age (𝑦 -axis). (a) shows the six training examples as points and the four test examples as circles,
colored in blue for class Europe and in red for class South America. The remaining subfigures (b)-(f) add a dotted decision
boundary for various learners which show predictions as Europe in light-blue and as South America in light-red.


ways how rule-based methods can still mimic decision               Finally, Figure 1f is the illustration of two characteris-
boundaries like in Figure 1d.                                   tic rules similar to Equation 3: We describe both classes
   In Figure 1e, we see multiple steps in the decision          Europe and South America without using a default rule.
boundary. Trivially, this behavior can be achieved by           Obviously, the learned rules of the two classes can over-
learning one rule for each step. While this is straightfor-     lap or — as in this case — leave wide areas of the feature
ward in this example, it is too hard to maintain in a high-     space uncovered, so that a pure Boolean evaluation of
dimensional feature space with an exponentially increas-        the rules is not sufficient anymore. One way to han-
ing number of combinations. Scoring systems [12] scale          dle these uncovered areas are nearest hyper-rectangles
better by assigning low integer scores to attribute-value       [13]. The decision boundary between the two classes
combinations, hereby providing a trade-off between rules        can be shaped arbitrarily if enough hyper-rectangles, i.e.,
and linear models. In the special case that all weights         rules, are learned (and is actually neither quite straight
are binary, the scoring system converts into an m-of-n          in Figure 1f). Obviously, distances for nominal attributes
concept. With the scores being assigned by the following        can not be defined as straightforward as for numerical
scheme and a threshold of 4 for class South America, all        attributes, as is discussed in the following section.
examples are classified correctly while providing a more           In this work, we aim to expand rule-based approaches
customized decision boundary compared to Figure 1c:             to reach this stronger expressiveness shown in the last
            ⎧                          ⎧                        three subfigures while still retaining the properties mak-
            ⎪
            ⎪ 3    if < 28             ⎪
                                       ⎪ 3    if ≥ 1100         ing them interpretable, i.e., without including all features
            ⎪                          ⎪                        instead of interactions and, most notably, without con-
            ⎨2     if < 36             ⎨2     if ≥ 140
      𝑎𝑔𝑒 :                     𝑠𝑖𝑧𝑒 :                          tinuous coefficients like in SVMs, logistic regression or
            ⎪
            ⎪ 1    if < 40             ⎪
                                       ⎪ 1    if ≥ 20
            ⎪                          ⎪                        naive Bayes, for what characteristic rules are preferable.
              0                          0
            ⎩                          ⎩
                   else                       else.
3. Characteristic rule learning                             lead to the learning of discriminative rules, so that in
                                                            this context, so-called inverted heuristics 4(𝑝, 𝑛) are sug-
So far we discussed why characteristic rules can be bene- gested which better reflect the top-down nature of the
ficial both in terms of interpretability and performance rule refinement process in theory by originating from
but observed as well that almost no rule-based methods the other side of the coverage space [15]. Because of
learn such concepts. To understand why conventional its typical focus on completeness, inverted heuristic can
rule learners prefer discriminative rules, we first briefly often "delay" the choice of too specific features, hence
introduce the coverage space and related heuristics. Sub- resulting in characteristic rules built of multiple more
sequently, we reveal potential issues with the latter and general features.
identify properties which should be taken into considera-
tion when developing a characteristic rule-based learner.
                                                             3.2. Limitations
3.1. Coverage space and heuristics                         Even though it has been shown empirically for some
                                                           datasets that inverted heuristics result in characteristic
Traditionally, rules are gradually refined by adding in- rules [15], it is not inherent that they lead to charac-
dividual conditions, whereby conjunctive refinements teristic rules. As a counterexample, consider learning a
specialize a rule (afterwards it can never cover more ex- rule for the class Europe using all examples of the coun-
amples than before the refinement), whereas disjunctive try dataset except of Brazil. The best single condition
refinements generalize a rule (afterwards it can never is 𝑎𝑔𝑒 ≥ 29.1 covering all six examples of class Europe
cover fewer examples than before the refinement). This as well as Uruguay. This false positive can not be ex-
can be visualized in coverage space, a non-normalized cluded by further (single-cut) conditions on Size or CO2
ROC space, where the 𝑥-axis shows the covered negative without losing coverage of at least one true positive, so
and the 𝑦-axis the covered positive examples [14]. For that the inverted heuristic stops with a rule consisting
example, Figure 2 shows a path that gradually refines of a single condition. Interestingly enough, in this case,
an initially universal rule (covering all 𝑃 positive and 𝑁 regular heuristics would even learn longer rules than
negative examples, upper right corner of the coverage inverted heuristics, since they typically prefer this trade
space) into the rule + ← 𝑐 ∧ 𝑏.                            of removing a false positive at the cost of a false negative.
                                                              Most importantly though, traditional rule learners
                                                           have a severe limitation of focusing only on the cov-
                                                           erage of the learned rules but not how (well) they cover
                                                           the examples. We already noticed in Table 1 that rule 𝑟1
                                                           in Equation 1 can be expanded to 𝑟𝑒 in Equation 3 by fea-
                                                           tures considering the age and 𝐶𝑂2 of a country without
                                                           covering more positive or less negative examples. Hence,
                                                           both 𝑟1 and 𝑟𝑒 correspond to the same point in the cov-
                                                           erage space in the top left corner, covering all positive
                                                           and no negative training examples. As a consequence,
                                                           independent of the chosen heuristic, conventional rule
                                                           learners are not able to learn 𝑟𝑒 if a refinement requires
                                                           improving the heuristic.
                                                              Even if the heuristic improves by adding a new condi-
                                                           tion to the original rule a similar issue can occur. Assume
Figure 2: Rule refinement in coverage space                a new rule 𝑟4 learned on all ten examples in Table 1,
                                                           which focuses on covering the example Germany based
   Apparently, a rule refined to the upper left corner can on the condition 𝑠𝑖𝑧𝑒 ≥ 316. This rule still covers Bo-
be considered perfect, since it covers only positive ex- livia and Brazil as well and could therefore be refined to
amples and no negatives. In most scenarios such a rule rules 𝑟5 and 𝑟6 , both considering the Age-attribute:
can not be found, so that a trade-off must be found be-
                                                                    𝑟4 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 ≥ 316
tween the importance of covering all positives (complete-
ness) and not covering any negatives (consistency). For             𝑟5 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 ≥ 316 ∧ 𝑎𝑔𝑒 ≥ 36.3             (4)
this purpose, heuristics are defined as functions h(𝑝, 𝑛),          𝑟6 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 ≥ 316 ∧ 𝑎𝑔𝑒 ≥ 44.5.
where 0 ≤ 𝑝 ≤ 𝑃 (0 ≤ 𝑛 ≤ 𝑁 ) is the number of positive
                                                              While 𝑟5 and 𝑟6 both correspond to the same point
(negative) examples covered by a rule [14].
                                                           in the coverage space (covering one positive example
   In previous studies it was found that most regular
                                                           and no negatives), their coverage on unseen examples
heuristics (in particular those striving for consistency)
might vary crucially since they cover different areas in      optima, but always use all 𝑘 iterations. Furthermore,
the feature space. Arguably, 𝑟5 should be preferred over      all conditions are sorted based on the accuracy metric
                                                                                −𝑛
𝑟6 because the added condition 𝑎𝑔𝑒 ≥ 36.3 covers four         (ℎ𝑎𝑐𝑐 (𝑡) = 𝑝+𝑁𝑃 +𝑁
                                                                                    ) in the first iteration, so that in sub-
additional positive examples (and still no negative) com-     sequent iterations always the best "local" condition can
pared to 𝑎𝑔𝑒 ≥ 44.5. So to say, while having the same         be picked, as discussed in Section 3.
"global" concept, we should choose the rule with the             Second, the handling of multiple pattern trees is crucial
better "local" condition. Note that this is not limited to    for the decision boundary. In the original aBpt classifier,
numeric attributes.                                           one pattern tree for each class 𝑦 ∈ Y is learned. Since in
    To summarize, characteristic rules are usually not        a Boolean context, the output of the Boolean expression
learned because the learners rely on heuristics that only     represented by the pattern tree can only be true or false
take the number of covered positive and negative ex-          for the features of the test example, ties occur if a test
amples into account instead of separating positive and        example is matched by multiple pattern trees, which
negative examples with a variety of rules and conditions.     can be broken by a fixed order of the pattern trees in a
Particularly, adding a condition without changing the         decision list.
covered examples results in the same heuristic value, in         An alternative that is used in fuzzy pattern tree clas-
which case so far the shorter explanation is used, and the    sifiers [17] is evaluating all pattern trees in a probabilis-
search usually stops. Additionally, the mere focus on the     tic way, whereby the highest probability decides about
global coverage can lead to suboptimal "local" conditions     the class prediction. A straightforward way to achieve
if ties are not handled appropriately.                        this behavior in aBpt is using a constant uncertainty
                                                              factor 𝑢, resulting in probabilities 𝑝(𝑓 ) = 1 − 𝑢 for
                                                              fulfilled features and 𝑝(𝑓 ) = 𝑢 else, which are then
4. Boolean Pattern Trees                                      aggregated bottom-up       over the respective child nodes
                                                                               ∏︀
                                                              𝐶 as∏︀𝑝(𝑛) = 𝑖∈𝐶 𝑝(𝑖) for conjunctive and 𝑝(𝑛) =
For the experimental comparison of deterministic and
                                                              1 − 𝑖∈𝐶 (1 − 𝑝(𝑖)) for disjunctive nodes. However, this
characteristic concepts, we use two versions of an alter-
                                                              comes with two inconveniences of (a) weighing all child
nating Boolean pattern tree (ABPT) learner recently devel-
                                                              nodes the same — independent of their importance and (b)
oped in our group [16]. The task of learning an ABPT is
                                                              penalizing all conjunctive conditions (always decreasing
quite similar to learning a rule: For every specific class
                                                              𝑝(𝑛)) and rewarding all disjunctive conditions (always in-
𝑦𝑗 ∈ Y, a tree 𝑡 : 𝑦𝑗 ← 𝐵 is learned, where 𝐵 is a logical
                                                              creasing 𝑝(𝑛)) independent of their quality, which partic-
expression defined over the input features, which can
                                                              ularly affects characteristic models negatively. To address
be much more flexible than for rules. In contrast to rule
                                                              (a), we determine 𝑝(𝑓 ) flexibly in the range [𝑢, 1 − 𝑢] as
learners which use either conjunctions or disjunctions,
ABPTs can connect binary features by conjunctions and                                                      𝑝
                                                                               𝑝(𝑓 ) = 𝑢 + (1 − 2𝑢) ·
disjunctions in any arbitrary order. This also complicates                                                 𝑃
the iterative learning of 𝐵, having multiple insertion for fulfilled features (≃ probability a positive example
options per feature. E.g., inserting a disjunction with fulfills the feature) and
feature 𝑐 in 𝐵 = 𝑎 ∧ 𝑏 can result in 𝑐 ∨ (𝑎 ∧ 𝑏), (𝑎 ∨ 𝑐) ∧ 𝑏
or 𝑎 ∧ (𝑏 ∨ 𝑐), which are all logically different. These                                                𝑃 −𝑝
                                                                       𝑝(𝑓 ) = 𝑢 + (1 − 2𝑢) ·
insertions are repeated until the maximum number of                                                𝑃 −𝑝+𝑁 −𝑛
iterations 𝑘 (= number of features in the pattern tree) is else (≃ probability a positive example not fulfilling the
reached. We refer to [16] for further details of the algo- feature is negative). Additionally, for (b) we relax 𝑝(𝑛)
rithm and focus on two adjustments for the experiments for the interior tree nodes as
in the following.
   First, we notice that in the standard version of aBpt                         1      1 ∑︁
                                                                       𝑝(𝑛) = · (          ·      𝑝(𝑖) + min 𝑝(𝑖))
already multiple heuristics for the evaluation of a tree ex-                     2 |𝐶| 𝑖∈𝐶                𝑖∈𝐶
tension are used, focusing on consistency and complete-
ness in different search branches. By using various cost for conjunctive nodes and as
ratios in the linear cost metric (ℎ𝑙𝑐 (𝑡) = 𝑐·𝑝−(1−𝑐)·𝑛),
                                                                                1       1 ∑︁
aBpt is capable to learn both models preferred of regu-                𝑝(𝑛) = · (          ·     𝑝(𝑖) + max 𝑝(𝑖))
                                                                                2 |𝐶| 𝑖∈𝐶                 𝑖∈𝐶
lar and inverted heuristics. Though, Section 3 presented
as well problems that could not be fixed solely by the for disjunctive nodes, resulting in a probability less de-
heuristic. To choose characteristic models instead of dis- pendent on the number of child nodes, so that models
criminative ones in case of ties, the learner picks the with different numbers of conjunctive and disjunctive
tree learned in a later iteration, i.e., using more condi- nodes can be compared better.
tions (and vice versa). This way we do not stop in local
Table 2
Predictive accuracies of the aBpt learner on five UCI datasets for six different settings using 10-fold-cross-validation. In the
first row a Boolean evaluation and in the second row a probabilistic evaluation of the pattern tree is used. The first column
shows results on the original dataset, the second on an incomplete version of the dataset where 30% of the values are replaced
by missing values, and the third a combination using the original data for training and the incomplete data for testing.
                      discr.    char.                          discr.   char.                              discr.   char.
     labor            87.72     84.21            labor          77.19 78.95                 labor          78.95 80.70
     mushroom 100.00 100.00                      mushroom       96.91 96.91                 mushroom       84.00 88.00
     soybean          92.68     92.53            soybean        66.03 66.33                 soybean        19.77 46.71
     vote             94.48     94.71            vote           88.28 88.74                 vote           78.39 78.16
     zoo              89.11     86.14            zoo            76.24 76.24                 zoo            41.58 65.35
          (a) Original + Boolean                   (b) Incomplete + Boolean                      (c) Mixed + Boolean
                      discr.   char.                           discr.   char.                              discr.   char.
       labor           66.67 64.91               labor         64.91 64.91                  labor          64.91 64.91
       mushroom        83.83 87.64               mushroom      49.93 49.93                  mushroom       79.60 75.49
       soybean         90.04 90.48               soybean       67.64 68.52                  soybean        52.86 57.25
       vote            95.40 95.40               vote          88.51 88.51                  vote           87.13 87.13
       zoo             89.11 92.08               zoo           77.23 75.25                  zoo            62.38 85.15
        (d) Original + Probabilistic            (e) Incomplete + Probabilistic                (f) Mixed + Probabilistic


5. Experiments                                               left to right), the lower the predictive accuracy.
                                                                Independent of the dataset setting, the predictive ac-
In the experiments we analyze two different aspects of curacy drops drastically for labor and mushroom when
the aBpt learner — using the default configuration of changing from the Boolean to the probabilistic setting.
𝑘 = 20 iterations and accuracy and seven different val- Though, it often increases for the other three datasets —
ues for the linear cost as metrics. First, we compare a in particular, in the "mixed" setting, the accuracies can be
discriminative version preferring smaller trees in case improved drastically using a probabilistic evaluation, in-
of tied heuristics and a characteristic version preferring dicating that the adjusted decision boundaries can make
bigger trees. Second, we evaluate both versions not only the model more robust to incomplete training data.
in a Boolean setting but also in a probabilistic setting, as
suggested in the end of Section 4.
   For the experiments, we choose five UCI [18] datasets         c1 ← eggs=false.
where most features are not used in the discriminative           c1 ← hair=true.
models and therefore the characteristic models could dif-        c2 ← toothed=true.
fer remarkably: labor, mushroom soybean, vote and                c2 ← catsize=true.
zoo. On some datasets (labor, soybean, zoo) the oth-             c2 ← legs=(1.0:4.0].
erwise inferior naive Bayes classifier even outperforms          c3 ← backbone=true.
rule learners like Ripper, indicating potential for improve-     c3 ← airborne=false.
ment of the decision boundary in the probabilistic setting.      c3 ← aquatic=false.
   The learners are not only applied to the original             c3 ← fins=false.
                                                                 c3 ← tail=true.
datasets but also to an "imcomplete" version of the dataset,
                                                                 c3 ← domestic=false.
where 30% of the values are replaced by missing values,
                                                                 c3 ← predator=false.
and finally a "mixed" version, where only the test data is       c3 ← predator=true.
"incomplete". This way we can analyze the robustness of          c3 ← domestic=true.
the learned models, where characteristic models might            c3 ← tail=false.
be expected to perform better, since additional features         c3 ← catsize=false.
are used as a fallback option.                                   c3 ← fins=true.
   The predictive accuracies of a 10-fold-cross-validation       c←       milk=true ∧ c1 ∧ c2 ∧ breathes=true∧
are shown in Table 2. Each table shows a head-to-                         feathers=false ∧ c3.
head comparison of the discriminative and characteristic
learner in a given setting of dataset type and used evalu- Figure 3: Model learned by characteristic aBpt on the zoo
ation. Overall, we see that the discriminative and charac- dataset for 𝑐𝑙𝑎𝑠𝑠 = 𝑚𝑎𝑚𝑚𝑎𝑙 when being transformed into
teristic learner perform roughly equally well, while the a set of conjunctive rules.
effects of missing values vary considerably between the
datasets. Except few cases, the harder the setting (from
   As an example, consider the characteristic model for         tential of characteristic models in future work: Most
the zoo dataset in Figure 3. In Boolean evaluation, the         importantly, the decision boundary artificially moved by
missing values result in too many examples being not            a probabilistic evaluation (which certainly provides room
covered by the model, However, the numerous conditions          for improvement as well) should not only be considered
still indicate the correct class with a probabilistic evalua-   during classification but also in the learning phase. In this
tion. We also notice a big gap between the performances         regard, the definition of heuristics not only considering
of the discriminative and characteristic model here, indi-      coverage but also the "quality" of the coverage, connected
cating that the small trees with only up to four conditions     to the distance between the example and the decision
of the discriminative learner (e.g., for class=mammal it        boundary, is crucial. This way rule learners could not
only uses the condition mlik=true) are not as robust as         only deliver a prediction but also determine how certain
the trees of the characteristic counterpart.                    the prediction is, and optionally abstain from making a
   As Figure 3 also shows, the characteristic models are        prediction.
usually considerably larger. From an interpretability per-
spective, this can be preferred, since in this case we do not
only discover that mammals yield milk but also breathe,         References
do not wear feathers, either do not lay eggs or have hair
                                                                 [1] R. Andrews, J. Diederich, A. B. Tickle, Survey and
and either are toothed, catsized or have 1-4 legs. We also
                                                                     critique of techniques for extracting rules from
notice that the last concept c3 (which was also the last
                                                                     trained artificial neural networks, Knowledge-
to be added to the model) is not helpful at all and indeed
                                                                     based systems 8 (1995) 373–389.
can be reduced to true because of tautologies. This in-
                                                                 [2] R. Guidotti, A. Monreale, F. Giannotti, D. Pedreschi,
dicates that in a characteristic setting stopping criteria
                                                                     S. Ruggieri, F. Turini, Factual and counterfactual
or pruning techniques are needed as well to preserve
                                                                     explanations for black box decision making, IEEE
interpretability.
                                                                     Intelligent Systems 34 (2019) 14–23.
                                                                 [3] A. Blumer, A. Ehrenfeucht, D. Haussler, M. K. War-
6. Conclusion                                                        muth, Occam’s razor, Information processing let-
                                                                     ters 24 (1987) 377–380.
In this paper, we look at the possible advantages that           [4] R. S. Michalski, A theory and methodology of in-
characteristic models, which are rarely learned in con-              ductive learning, in: Machine learning, Elsevier,
ventional rule learning algorithms, can provide. While               1983, pp. 83–134.
previous work on characteristic rules usually focused            [5] J. Fürnkranz, Pruning algorithms for rule learning,
on the interpretability aspects, we have shown that the              Machine learning 27 (1997) 139–172.
inclusion of additional features, both via conjunction and       [6] W. W. Cohen, Fast effective rule induction, in:
disjunction, can additionally help to find better decision           Machine learning proceedings 1995, Elsevier, 1995,
boundaries, resulting in more robust models. We also                 pp. 115–123.
discussed that for learning characteristic rules a mere          [7] M. Eineborg, H. Boström, Classifying uncovered
focus on coverage is insufficient so that both regular and           examples by rule stretching, in: C. Rouveirol, M. Se-
inverted heuristics can not guarantee learning character-            bag (Eds.), Proceedings of the Eleventh Interna-
istic rules. Finding a suitable distance metric to separate          tional Conference on Inductive Logic Programming
positive and negative examples remains as an open ques-              (ILP-01), Springer Verlag, Strasbourg, France, 2001,
tion.                                                                pp. 41–50.
   To analyze the effects of characteristic rule-based mod-      [8] S. Salzberg, A nearest hyperrectangle learning
els empirically, we implemented a characteristic version             method, Machine Learning 6 (1991) 251–276.
of the aBpt learner and compared it with the original            [9] N. Cristianini, J. Shawe-Taylor, An introduction to
discriminative version on five UCI datasets. The exper-              support vector machines and other kernel-based
iments did not show a clear advantage for any of the                 learning methods, Cambridge university press,
learners in terms of predictive accuracy, indicating that            2000.
smaller models are not necessarily overgeneralizing and         [10] J. S. Cramer, The origins of logistic regression
larger models not inevitably lead to overfitting. In a ro-           (2002).
bustness check using incomplete test data, characteristic       [11] I. Rish, et al., An empirical study of the naive bayes
models outperformed discriminative models. Further-                  classifier, in: IJCAI 2001 workshop on empirical
more, characteristic models slightly outperformed dis-               methods in artificial intelligence, volume 3, Seattle,
criminative models when combined with probabilistic                  WA, USA;, 2001, pp. 41–46.
evaluation.                                                     [12] C. Rudin, B. Ustun, Optimized scoring systems:
   We also see multiple paths to further develop the po-
     Toward trust in machine learning for healthcare
     and criminal justice, Interfaces 48 (2018) 449–466.
[13] S. Salzberg, A nearest hyperrectangle learning
     method, Machine learning 6 (1991) 251–276.
[14] J. Fürnkranz, P. A. Flach,         Roc’n’rule learn-
     ing—towards a better understanding of covering
     algorithms, Machine learning 58 (2005) 39–77.
[15] J. Stecher, F. Janssen, J. Fürnkranz, Shorter rules
     are better, aren’t they?, in: Discovery Science: 19th
     International Conference, DS 2016, Bari, Italy, Oc-
     tober 19–21, 2016, Proceedings 19, Springer, 2016,
     pp. 279–294.
[16] F. Beck, J. Fürnkranz, V. Q. P. Huynh, Learning
     deep rule concepts as alternating boolean pattern
     trees, in: Discovery Science: 27th International
     Conference, DS 2024, Pisa, Italy, October 14–16,
     2024, Proceedings 27, Springer, 2024.
[17] R. Senge, E. Hüllermeier, Top-down induction of
     fuzzy pattern trees, IEEE Transactions on Fuzzy
     Systems 19 (2010) 241–252.
[18] M. Kelly, R. Longjohn, K. Nottingham, The UCI
     machine learning repository, 2024. URL: https://
     archive.ics.uci.edu.