=Paper= {{Paper |id=Vol-3792/paper6 |storemode=property |title=When Characteristic Rule-based Models Should be Preferred over Discriminative Ones |pdfUrl=https://ceur-ws.org/Vol-3792/paper6.pdf |volume=Vol-3792 |authors=Florian Beck,Johannes Fürnkranz,Van Quoc Phuong Huynh |dblpUrl=https://dblp.org/rec/conf/itat/BeckFQ24 }} ==When Characteristic Rule-based Models Should be Preferred over Discriminative Ones == https://ceur-ws.org/Vol-3792/paper6.pdf

When characteristic rule-based models should be preferred
over discriminative ones
Florian Beck1 , Johannes Fürnkranz1 and Van Quoc Phuong Huynh1
1
Johannes Kepler University Linz, LIT Artificial Intelligence Lab / Institute for Application-oriented Knowledge Processing (FAW), Altenberger
Straße 66b/69, 4040 Linz, Austria

Abstract
In recent years, the interpretability of machine learning models has gained interest. White-box approaches like rule-based
models serve as an interpretable alternative or as surrogate models of black-box approaches. Among these, more compact
rule-based models are considered easier to interpret. In addition, they often generalize better and thus provide higher
predictive accuracies than their overfitting complex counterparts. In this paper, we argue that more complex, “characteristic”
rule-based models are a genuine alternative to more compact, “discriminative” ones. We discuss why characteristic models
should not be considered as less interpretable, and that more included features can actually strengthen the model both in
terms of robustness and predictive accuracy. For this, we evaluate the effects on the decision boundary for models of different
complexity, and also modify a recently developed Boolean pattern tree learner to compare a characteristic and a discriminative
version on five UCI data sets. We show that the more complex models are indeed more robust to missing data, and that they
sometimes even improve the predictive accuracy on the original data.

Keywords
characteristic rules, discriminative rules, decision boundaries, interpretability, robustness

1. Introduction Table 1
A small country dataset with three numeric attributes size
With the rise of neural network models in many machine (in 1,000 km²), age (median; in years) and 𝐶𝑂2 (emission per
learning applications, the need has grown to actually capita and year; in tons)1 . It is split into six training examples
understand what these black-box approaches learn. This (three for each of the classes Europe and South America) and
has brought rule-based models back into the spotlight four test examples of unknown class.
which can be used as interpretable surrogates of neu- Size Age 𝐶𝑂2 Class
ral network approaches, e.g., by extracting rules from Austria 84 42.8 6.9 Europe
the whole network [1] or with the focus on explaining Bolivia 1099 23.9 1.8 South America
decision boundaries [2]. Brazil 8515 32.8 2.2 South America
Independent of whether rule-based models are used as Czechia 79 42.6 9.3 Europe
surrogates of neural networks or as a stand-alone model, Ecuador 284 27.6 2.3 South America
Slovakia 49 40.6 6.1 Europe
usually the principle of Occam’s Razor [3] is followed,
which can be loosely translated as that the simplest ex-
Albania 29 37.3 1.7 ?
planation is the best one. Consequently, discriminative Germany 357 44.9 8.0 ?
rules which discriminate an object of one category from Kosovo 11 30.5 4.8 ?
objects of other categories are preferred over charac- Uruguay 176 35.2 2.3 ?
teristic rules which try to capture all properties that
are common to the objects of the target class [4]. This
principle is also supported by the observation that longer belonging to the class Europe and three to the class South
explanations tend to overfit the training data, leading America. For each country, the value for the three nu-
to worse performances on test data. Hence, most rule meric attributes Size, Age and 𝐶𝑂2 is provided.
learner use some kind of pruning policy [5], resulting Traditional rule learners like, e.g., Ripper [6] strive for
in learning short discriminative rules instead of longer discriminative rules, i.e., rules that minimize the number
characteristic ones. of used attributes when describing the classes. In this
However, there is a fine line between avoiding over- case, such a perfect, minimal description of the training
fitting and learning too general theories. Consider the data could be learned with a single rule 𝑟1 only consider-
sample dataset in Table 1 consisting of six countries, three ing the first attribute Size, and the corresponding default
rule 𝑟0 for the other class2 :
$ fbeck@faw.jku.at (F. Beck); juffi@faw.jku.at (J. Fürnkranz);
vqphuynh@faw.jku.at (V. Q. P. Huynh)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
2
In the following, 𝑐𝑙𝑎𝑠𝑠 = 𝑒𝑢𝑟𝑜𝑝𝑒 is abbreviated as 𝑐 = 𝑒 and
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) 𝑐𝑙𝑎𝑠𝑠 = 𝑠𝑜𝑢𝑡ℎ_𝑎𝑚𝑒𝑟𝑖𝑐𝑎 as 𝑐 = 𝑠𝑎

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
that European and South American countries do not only
𝑟1 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 < 184 differ in size, but also in median age and CO2 emissions.
(1)
(𝑟0 : 𝑐 = 𝑠𝑎 ← ⊤). The rest of the paper is organized as follows: Section 2
further specifies the problem of finding good decision
Rule 𝑟1 covers the three examples Austria, Czechia
boundaries and presents characteristic models of non-
and Slovakia because these examples fulfill the condition
rule-based classifiers as an inspiration for adaption in the
𝑠𝑖𝑧𝑒 < 184. Bolivia, Brazil and Ecuador are not covered
rule-based setting, presented in Section 3. We modify a
by 𝑟1 but only by the most general rule 𝑟0 , thus classified
rule-based learner in Section 4 accordingly and evaluate
as South America. While these rules perfectly describe the
a discriminative and characteristic version in Section 5
training examples, they fail to correctly classify the test
in terms of predictive accuracy and robustness. Section 6
example Germany, which is not covered by 𝑟1 and hence
concludes the results and takes a brief look at the remain-
misclassified as South America by 𝑟0 . Vice versa, Uruguay
ing challenges.
is covered by 𝑟1 and hence misclassified as Europe . Note
that these misclassifications could have been avoided if a
different feature would have been picked, such as, e.g., 2. Decision boundaries
in rules 𝑟2 and 𝑟3 :
As depicted in the introduction, in contrast to long char-
𝑟2 : 𝑐 = 𝑒 ← 𝑎𝑔𝑒 ≥ 36.7 acteristic rules being prone to overfitting, short discrimi-
(2)
𝑟3 : 𝑐 = 𝑒 ← 𝐶𝑂2 ≥ 4.2. native rules come with the risk of providing too simplistic
theories that overgeneralize. This can also be illustrated
However, 𝑟2 does not cover the test example Kosovo,
by the decision boundary of the country dataset rules,
and 𝑟3 does not cover Albania, so that neither of the three
see Figure 1. For a better visualization, we omit the third
rules would be sufficient to classify all four test examples
attribute CO2 to obtain a two-dimensional feature space
correctly, but only a combined rule set of 𝑟2 and 𝑟3 would
using the attribute Size in logarithmic scale on the 𝑥-axis
do so. Similarly, the three suggested features 𝑠𝑖𝑧𝑒 < 184,
and Age on the 𝑦-axis. The raw data are shown in Fig-
𝑎𝑔𝑒 ≥ 36.7 and 𝐶𝑂2 ≥ 4.2 can also be connected by
ure 1a; the six training examples as points and the four
conjunctions to a single rule 𝑟𝑒 for class Europe, while the
test examples as circles, colored in blue for class Europe
respectively contrasting features form rule 𝑟𝑠 for class
and in red for class South America, respectively.
South America:
We see that the training examples are quite easily sepa-
rable from each other, while the test examples complicate
𝑟𝑒 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 < 184 ∧ 𝑎𝑔𝑒 ≥ 36.7 ∧ 𝐶𝑂2 ≥ 4.2 finding a good decision boundary. Figure 1b shows a dis-
criminative rule 𝑐 = 𝑠𝑎 ← 𝑎𝑔𝑒 < 36, covering all four
𝑟𝑠 : 𝑐 = 𝑠𝑎 ← 𝑠𝑖𝑧𝑒 ≥ 184 ∧ 𝑎𝑔𝑒 < 36.7 ∧ 𝐶𝑂2 < 4.2.
South American countries along with one European in
(3)
the light-red area of the feature space. The light-blue
While none of the two rules covers any of the test
area (classified by the default rule 𝑐 = 𝑒 ← ⊤) contains
examples, a slight modification of their semantics allows
five true negatives. By adding the condition 𝑠𝑖𝑧𝑒 > 140,
us to use them as reliable classifiers. Instead of requir-
the rule can be defined more characteristic, leading to a
ing that all conditions of a rule need to be satisfied, we
perfect classification of all examples, see Figure 1c.
instead assign an example to its closest rule, a method
Still, the provided decision boundary in Figure 1c can
that is reminiscent of rule stretching [7] or nearest hyper-
be considered suboptimal when compared with non-rule-
rectangle classification [8]. In our example, the first three
based models. Figure 1d illustrates an arguably better de-
test examples are assigned to class Europe, since for each
cision boundary which other methods like, e.g., support
of them two out of three conditions of 𝑟𝑒 are satisfied and
vector machines [9], logistic regression [10] and naive
only one out of three of 𝑟𝑠 . Analogously, test example
Bayes [11] can find. All approaches have in common that
Uruguay is correctly classified as South America.
they usually consider all attributes in the feature space
Independent of using conjunctions or disjunctions as
and rely on continuous coefficients to build their models;
the connector, we notice that more characteristic rule the-
in this case:
ories in Equations 2 and 3 are able to classify all four test
examples correct, while the discriminative rule theory in
𝑐 = 𝑠𝑎 ← 𝑎𝑔𝑒 − 10 · log10 (𝑠𝑖𝑧𝑒) ≤ 15.
Equation 1 is not able to do so. Moreover, the inclusion
of more features in the characteristic concepts might not In comparison to the methods just mentioned, conven-
only lead to a better performance but also arguably pro- tional rule learners only use combinations of attribute-
vide more interesting and interpretable models, stating value-combinations for the splits of their classes. As a
2 consequence, one of the main limitations of rule learn-
Retrieved 2024/07/04 from https://ourworldindata.org/age-
structure and https://ourworldindata.org/co2-and-greenhouse-gas- ing is arguably its restriction to axis-parallel decision
emissions. boundaries. Though, the last two subfigures show two
(a) Original data (b) Discriminative rule (c) Characteristic rule

(d) Support Vector Machine (e) Scoring system (f) Hyper-rectangles

Figure 1: Different decision boundaries of various learning approaches for the country dataset reduced to the attributes Size
(𝑥-axis, logarithmic) and Age (𝑦 -axis). (a) shows the six training examples as points and the four test examples as circles,
colored in blue for class Europe and in red for class South America. The remaining subfigures (b)-(f) add a dotted decision
boundary for various learners which show predictions as Europe in light-blue and as South America in light-red.

ways how rule-based methods can still mimic decision Finally, Figure 1f is the illustration of two characteris-
boundaries like in Figure 1d. tic rules similar to Equation 3: We describe both classes
In Figure 1e, we see multiple steps in the decision Europe and South America without using a default rule.
boundary. Trivially, this behavior can be achieved by Obviously, the learned rules of the two classes can over-
learning one rule for each step. While this is straightfor- lap or — as in this case — leave wide areas of the feature
ward in this example, it is too hard to maintain in a high- space uncovered, so that a pure Boolean evaluation of
dimensional feature space with an exponentially increas- the rules is not sufficient anymore. One way to han-
ing number of combinations. Scoring systems [12] scale dle these uncovered areas are nearest hyper-rectangles
better by assigning low integer scores to attribute-value [13]. The decision boundary between the two classes
combinations, hereby providing a trade-off between rules can be shaped arbitrarily if enough hyper-rectangles, i.e.,
and linear models. In the special case that all weights rules, are learned (and is actually neither quite straight
are binary, the scoring system converts into an m-of-n in Figure 1f). Obviously, distances for nominal attributes
concept. With the scores being assigned by the following can not be defined as straightforward as for numerical
scheme and a threshold of 4 for class South America, all attributes, as is discussed in the following section.
examples are classified correctly while providing a more In this work, we aim to expand rule-based approaches
customized decision boundary compared to Figure 1c: to reach this stronger expressiveness shown in the last
⎧ ⎧ three subfigures while still retaining the properties mak-
⎪
⎪ 3 if < 28 ⎪
⎪ 3 if ≥ 1100 ing them interpretable, i.e., without including all features
⎪ ⎪ instead of interactions and, most notably, without con-
⎨2 if < 36 ⎨2 if ≥ 140
𝑎𝑔𝑒 : 𝑠𝑖𝑧𝑒 : tinuous coefficients like in SVMs, logistic regression or
⎪
⎪ 1 if < 40 ⎪
⎪ 1 if ≥ 20
⎪ ⎪ naive Bayes, for what characteristic rules are preferable.
0 0
⎩ ⎩
else else.
3. Characteristic rule learning lead to the learning of discriminative rules, so that in
this context, so-called inverted heuristics 4(𝑝, 𝑛) are sug-
So far we discussed why characteristic rules can be bene- gested which better reflect the top-down nature of the
ficial both in terms of interpretability and performance rule refinement process in theory by originating from
but observed as well that almost no rule-based methods the other side of the coverage space [15]. Because of
learn such concepts. To understand why conventional its typical focus on completeness, inverted heuristic can
rule learners prefer discriminative rules, we first briefly often "delay" the choice of too specific features, hence
introduce the coverage space and related heuristics. Sub- resulting in characteristic rules built of multiple more
sequently, we reveal potential issues with the latter and general features.
identify properties which should be taken into considera-
tion when developing a characteristic rule-based learner.
3.2. Limitations
3.1. Coverage space and heuristics Even though it has been shown empirically for some
datasets that inverted heuristics result in characteristic
Traditionally, rules are gradually refined by adding in- rules [15], it is not inherent that they lead to charac-
dividual conditions, whereby conjunctive refinements teristic rules. As a counterexample, consider learning a
specialize a rule (afterwards it can never cover more ex- rule for the class Europe using all examples of the coun-
amples than before the refinement), whereas disjunctive try dataset except of Brazil. The best single condition
refinements generalize a rule (afterwards it can never is 𝑎𝑔𝑒 ≥ 29.1 covering all six examples of class Europe
cover fewer examples than before the refinement). This as well as Uruguay. This false positive can not be ex-
can be visualized in coverage space, a non-normalized cluded by further (single-cut) conditions on Size or CO2
ROC space, where the 𝑥-axis shows the covered negative without losing coverage of at least one true positive, so
and the 𝑦-axis the covered positive examples [14]. For that the inverted heuristic stops with a rule consisting
example, Figure 2 shows a path that gradually refines of a single condition. Interestingly enough, in this case,
an initially universal rule (covering all 𝑃 positive and 𝑁 regular heuristics would even learn longer rules than
negative examples, upper right corner of the coverage inverted heuristics, since they typically prefer this trade
space) into the rule + ← 𝑐 ∧ 𝑏. of removing a false positive at the cost of a false negative.
Most importantly though, traditional rule learners
have a severe limitation of focusing only on the cov-
erage of the learned rules but not how (well) they cover
the examples. We already noticed in Table 1 that rule 𝑟1
in Equation 1 can be expanded to 𝑟𝑒 in Equation 3 by fea-
tures considering the age and 𝐶𝑂2 of a country without
covering more positive or less negative examples. Hence,
both 𝑟1 and 𝑟𝑒 correspond to the same point in the cov-
erage space in the top left corner, covering all positive
and no negative training examples. As a consequence,
independent of the chosen heuristic, conventional rule
learners are not able to learn 𝑟𝑒 if a refinement requires
improving the heuristic.
Even if the heuristic improves by adding a new condi-
tion to the original rule a similar issue can occur. Assume
Figure 2: Rule refinement in coverage space a new rule 𝑟4 learned on all ten examples in Table 1,
which focuses on covering the example Germany based
Apparently, a rule refined to the upper left corner can on the condition 𝑠𝑖𝑧𝑒 ≥ 316. This rule still covers Bo-
be considered perfect, since it covers only positive ex- livia and Brazil as well and could therefore be refined to
amples and no negatives. In most scenarios such a rule rules 𝑟5 and 𝑟6 , both considering the Age-attribute:
can not be found, so that a trade-off must be found be-
𝑟4 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 ≥ 316
tween the importance of covering all positives (complete-
ness) and not covering any negatives (consistency). For 𝑟5 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 ≥ 316 ∧ 𝑎𝑔𝑒 ≥ 36.3 (4)
this purpose, heuristics are defined as functions h(𝑝, 𝑛), 𝑟6 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 ≥ 316 ∧ 𝑎𝑔𝑒 ≥ 44.5.
where 0 ≤ 𝑝 ≤ 𝑃 (0 ≤ 𝑛 ≤ 𝑁 ) is the number of positive
While 𝑟5 and 𝑟6 both correspond to the same point
(negative) examples covered by a rule [14].
in the coverage space (covering one positive example
In previous studies it was found that most regular
and no negatives), their coverage on unseen examples
heuristics (in particular those striving for consistency)
might vary crucially since they cover different areas in optima, but always use all 𝑘 iterations. Furthermore,
the feature space. Arguably, 𝑟5 should be preferred over all conditions are sorted based on the accuracy metric
−𝑛
𝑟6 because the added condition 𝑎𝑔𝑒 ≥ 36.3 covers four (ℎ𝑎𝑐𝑐 (𝑡) = 𝑝+𝑁𝑃 +𝑁
) in the first iteration, so that in sub-
additional positive examples (and still no negative) com- sequent iterations always the best "local" condition can
pared to 𝑎𝑔𝑒 ≥ 44.5. So to say, while having the same be picked, as discussed in Section 3.
"global" concept, we should choose the rule with the Second, the handling of multiple pattern trees is crucial
better "local" condition. Note that this is not limited to for the decision boundary. In the original aBpt classifier,
numeric attributes. one pattern tree for each class 𝑦 ∈ Y is learned. Since in
To summarize, characteristic rules are usually not a Boolean context, the output of the Boolean expression
learned because the learners rely on heuristics that only represented by the pattern tree can only be true or false
take the number of covered positive and negative ex- for the features of the test example, ties occur if a test
amples into account instead of separating positive and example is matched by multiple pattern trees, which
negative examples with a variety of rules and conditions. can be broken by a fixed order of the pattern trees in a
Particularly, adding a condition without changing the decision list.
covered examples results in the same heuristic value, in An alternative that is used in fuzzy pattern tree clas-
which case so far the shorter explanation is used, and the sifiers [17] is evaluating all pattern trees in a probabilis-
search usually stops. Additionally, the mere focus on the tic way, whereby the highest probability decides about
global coverage can lead to suboptimal "local" conditions the class prediction. A straightforward way to achieve
if ties are not handled appropriately. this behavior in aBpt is using a constant uncertainty
factor 𝑢, resulting in probabilities 𝑝(𝑓 ) = 1 − 𝑢 for
fulfilled features and 𝑝(𝑓 ) = 𝑢 else, which are then
4. Boolean Pattern Trees aggregated bottom-up over the respective child nodes
∏︀
𝐶 as∏︀𝑝(𝑛) = 𝑖∈𝐶 𝑝(𝑖) for conjunctive and 𝑝(𝑛) =
For the experimental comparison of deterministic and
1 − 𝑖∈𝐶 (1 − 𝑝(𝑖)) for disjunctive nodes. However, this
characteristic concepts, we use two versions of an alter-
comes with two inconveniences of (a) weighing all child
nating Boolean pattern tree (ABPT) learner recently devel-
nodes the same — independent of their importance and (b)
oped in our group [16]. The task of learning an ABPT is
penalizing all conjunctive conditions (always decreasing
quite similar to learning a rule: For every specific class
𝑝(𝑛)) and rewarding all disjunctive conditions (always in-
𝑦𝑗 ∈ Y, a tree 𝑡 : 𝑦𝑗 ← 𝐵 is learned, where 𝐵 is a logical
creasing 𝑝(𝑛)) independent of their quality, which partic-
expression defined over the input features, which can
ularly affects characteristic models negatively. To address
be much more flexible than for rules. In contrast to rule
(a), we determine 𝑝(𝑓 ) flexibly in the range [𝑢, 1 − 𝑢] as
learners which use either conjunctions or disjunctions,
ABPTs can connect binary features by conjunctions and 𝑝
𝑝(𝑓 ) = 𝑢 + (1 − 2𝑢) ·
disjunctions in any arbitrary order. This also complicates 𝑃
the iterative learning of 𝐵, having multiple insertion for fulfilled features (≃ probability a positive example
options per feature. E.g., inserting a disjunction with fulfills the feature) and
feature 𝑐 in 𝐵 = 𝑎 ∧ 𝑏 can result in 𝑐 ∨ (𝑎 ∧ 𝑏), (𝑎 ∨ 𝑐) ∧ 𝑏
or 𝑎 ∧ (𝑏 ∨ 𝑐), which are all logically different. These 𝑃 −𝑝
𝑝(𝑓 ) = 𝑢 + (1 − 2𝑢) ·
insertions are repeated until the maximum number of 𝑃 −𝑝+𝑁 −𝑛
iterations 𝑘 (= number of features in the pattern tree) is else (≃ probability a positive example not fulfilling the
reached. We refer to [16] for further details of the algo- feature is negative). Additionally, for (b) we relax 𝑝(𝑛)
rithm and focus on two adjustments for the experiments for the interior tree nodes as
in the following.
First, we notice that in the standard version of aBpt 1 1 ∑︁
𝑝(𝑛) = · ( · 𝑝(𝑖) + min 𝑝(𝑖))
already multiple heuristics for the evaluation of a tree ex- 2 |𝐶| 𝑖∈𝐶 𝑖∈𝐶
tension are used, focusing on consistency and complete-
ness in different search branches. By using various cost for conjunctive nodes and as
ratios in the linear cost metric (ℎ𝑙𝑐 (𝑡) = 𝑐·𝑝−(1−𝑐)·𝑛),
1 1 ∑︁
aBpt is capable to learn both models preferred of regu- 𝑝(𝑛) = · ( · 𝑝(𝑖) + max 𝑝(𝑖))
2 |𝐶| 𝑖∈𝐶 𝑖∈𝐶
lar and inverted heuristics. Though, Section 3 presented
as well problems that could not be fixed solely by the for disjunctive nodes, resulting in a probability less de-
heuristic. To choose characteristic models instead of dis- pendent on the number of child nodes, so that models
criminative ones in case of ties, the learner picks the with different numbers of conjunctive and disjunctive
tree learned in a later iteration, i.e., using more condi- nodes can be compared better.
tions (and vice versa). This way we do not stop in local
Table 2
Predictive accuracies of the aBpt learner on five UCI datasets for six different settings using 10-fold-cross-validation. In the
first row a Boolean evaluation and in the second row a probabilistic evaluation of the pattern tree is used. The first column
shows results on the original dataset, the second on an incomplete version of the dataset where 30% of the values are replaced
by missing values, and the third a combination using the original data for training and the incomplete data for testing.
discr. char. discr. char. discr. char.
labor 87.72 84.21 labor 77.19 78.95 labor 78.95 80.70
mushroom 100.00 100.00 mushroom 96.91 96.91 mushroom 84.00 88.00
soybean 92.68 92.53 soybean 66.03 66.33 soybean 19.77 46.71
vote 94.48 94.71 vote 88.28 88.74 vote 78.39 78.16
zoo 89.11 86.14 zoo 76.24 76.24 zoo 41.58 65.35
(a) Original + Boolean (b) Incomplete + Boolean (c) Mixed + Boolean
discr. char. discr. char. discr. char.
labor 66.67 64.91 labor 64.91 64.91 labor 64.91 64.91
mushroom 83.83 87.64 mushroom 49.93 49.93 mushroom 79.60 75.49
soybean 90.04 90.48 soybean 67.64 68.52 soybean 52.86 57.25
vote 95.40 95.40 vote 88.51 88.51 vote 87.13 87.13
zoo 89.11 92.08 zoo 77.23 75.25 zoo 62.38 85.15
(d) Original + Probabilistic (e) Incomplete + Probabilistic (f) Mixed + Probabilistic

5. Experiments left to right), the lower the predictive accuracy.
Independent of the dataset setting, the predictive ac-
In the experiments we analyze two different aspects of curacy drops drastically for labor and mushroom when
the aBpt learner — using the default configuration of changing from the Boolean to the probabilistic setting.
𝑘 = 20 iterations and accuracy and seven different val- Though, it often increases for the other three datasets —
ues for the linear cost as metrics. First, we compare a in particular, in the "mixed" setting, the accuracies can be
discriminative version preferring smaller trees in case improved drastically using a probabilistic evaluation, in-
of tied heuristics and a characteristic version preferring dicating that the adjusted decision boundaries can make
bigger trees. Second, we evaluate both versions not only the model more robust to incomplete training data.
in a Boolean setting but also in a probabilistic setting, as
suggested in the end of Section 4.
For the experiments, we choose five UCI [18] datasets c1 ← eggs=false.
where most features are not used in the discriminative c1 ← hair=true.
models and therefore the characteristic models could dif- c2 ← toothed=true.
fer remarkably: labor, mushroom soybean, vote and c2 ← catsize=true.
zoo. On some datasets (labor, soybean, zoo) the oth- c2 ← legs=(1.0:4.0].
erwise inferior naive Bayes classifier even outperforms c3 ← backbone=true.
rule learners like Ripper, indicating potential for improve- c3 ← airborne=false.
ment of the decision boundary in the probabilistic setting. c3 ← aquatic=false.
The learners are not only applied to the original c3 ← fins=false.
c3 ← tail=true.
datasets but also to an "imcomplete" version of the dataset,
c3 ← domestic=false.
where 30% of the values are replaced by missing values,
c3 ← predator=false.
and finally a "mixed" version, where only the test data is c3 ← predator=true.
"incomplete". This way we can analyze the robustness of c3 ← domestic=true.
the learned models, where characteristic models might c3 ← tail=false.
be expected to perform better, since additional features c3 ← catsize=false.
are used as a fallback option. c3 ← fins=true.
The predictive accuracies of a 10-fold-cross-validation c← milk=true ∧ c1 ∧ c2 ∧ breathes=true∧
are shown in Table 2. Each table shows a head-to- feathers=false ∧ c3.
head comparison of the discriminative and characteristic
learner in a given setting of dataset type and used evalu- Figure 3: Model learned by characteristic aBpt on the zoo
ation. Overall, we see that the discriminative and charac- dataset for 𝑐𝑙𝑎𝑠𝑠 = 𝑚𝑎𝑚𝑚𝑎𝑙 when being transformed into
teristic learner perform roughly equally well, while the a set of conjunctive rules.
effects of missing values vary considerably between the
datasets. Except few cases, the harder the setting (from
As an example, consider the characteristic model for tential of characteristic models in future work: Most
the zoo dataset in Figure 3. In Boolean evaluation, the importantly, the decision boundary artificially moved by
missing values result in too many examples being not a probabilistic evaluation (which certainly provides room
covered by the model, However, the numerous conditions for improvement as well) should not only be considered
still indicate the correct class with a probabilistic evalua- during classification but also in the learning phase. In this
tion. We also notice a big gap between the performances regard, the definition of heuristics not only considering
of the discriminative and characteristic model here, indi- coverage but also the "quality" of the coverage, connected
cating that the small trees with only up to four conditions to the distance between the example and the decision
of the discriminative learner (e.g., for class=mammal it boundary, is crucial. This way rule learners could not
only uses the condition mlik=true) are not as robust as only deliver a prediction but also determine how certain
the trees of the characteristic counterpart. the prediction is, and optionally abstain from making a
As Figure 3 also shows, the characteristic models are prediction.
usually considerably larger. From an interpretability per-
spective, this can be preferred, since in this case we do not
only discover that mammals yield milk but also breathe, References
do not wear feathers, either do not lay eggs or have hair
[1] R. Andrews, J. Diederich, A. B. Tickle, Survey and
and either are toothed, catsized or have 1-4 legs. We also
critique of techniques for extracting rules from
notice that the last concept c3 (which was also the last
trained artificial neural networks, Knowledge-
to be added to the model) is not helpful at all and indeed
based systems 8 (1995) 373–389.
can be reduced to true because of tautologies. This in-
[2] R. Guidotti, A. Monreale, F. Giannotti, D. Pedreschi,
dicates that in a characteristic setting stopping criteria
S. Ruggieri, F. Turini, Factual and counterfactual
or pruning techniques are needed as well to preserve
explanations for black box decision making, IEEE
interpretability.
Intelligent Systems 34 (2019) 14–23.
[3] A. Blumer, A. Ehrenfeucht, D. Haussler, M. K. War-
6. Conclusion muth, Occam’s razor, Information processing let-
ters 24 (1987) 377–380.
In this paper, we look at the possible advantages that [4] R. S. Michalski, A theory and methodology of in-
characteristic models, which are rarely learned in con- ductive learning, in: Machine learning, Elsevier,
ventional rule learning algorithms, can provide. While 1983, pp. 83–134.
previous work on characteristic rules usually focused [5] J. Fürnkranz, Pruning algorithms for rule learning,
on the interpretability aspects, we have shown that the Machine learning 27 (1997) 139–172.
inclusion of additional features, both via conjunction and [6] W. W. Cohen, Fast effective rule induction, in:
disjunction, can additionally help to find better decision Machine learning proceedings 1995, Elsevier, 1995,
boundaries, resulting in more robust models. We also pp. 115–123.
discussed that for learning characteristic rules a mere [7] M. Eineborg, H. Boström, Classifying uncovered
focus on coverage is insufficient so that both regular and examples by rule stretching, in: C. Rouveirol, M. Se-
inverted heuristics can not guarantee learning character- bag (Eds.), Proceedings of the Eleventh Interna-
istic rules. Finding a suitable distance metric to separate tional Conference on Inductive Logic Programming
positive and negative examples remains as an open ques- (ILP-01), Springer Verlag, Strasbourg, France, 2001,
tion. pp. 41–50.
To analyze the effects of characteristic rule-based mod- [8] S. Salzberg, A nearest hyperrectangle learning
els empirically, we implemented a characteristic version method, Machine Learning 6 (1991) 251–276.
of the aBpt learner and compared it with the original [9] N. Cristianini, J. Shawe-Taylor, An introduction to
discriminative version on five UCI datasets. The exper- support vector machines and other kernel-based
iments did not show a clear advantage for any of the learning methods, Cambridge university press,
learners in terms of predictive accuracy, indicating that 2000.
smaller models are not necessarily overgeneralizing and [10] J. S. Cramer, The origins of logistic regression
larger models not inevitably lead to overfitting. In a ro- (2002).
bustness check using incomplete test data, characteristic [11] I. Rish, et al., An empirical study of the naive bayes
models outperformed discriminative models. Further- classifier, in: IJCAI 2001 workshop on empirical
more, characteristic models slightly outperformed dis- methods in artificial intelligence, volume 3, Seattle,
criminative models when combined with probabilistic WA, USA;, 2001, pp. 41–46.
evaluation. [12] C. Rudin, B. Ustun, Optimized scoring systems:
We also see multiple paths to further develop the po-
Toward trust in machine learning for healthcare
and criminal justice, Interfaces 48 (2018) 449–466.
[13] S. Salzberg, A nearest hyperrectangle learning
method, Machine learning 6 (1991) 251–276.
[14] J. Fürnkranz, P. A. Flach, Roc’n’rule learn-
ing—towards a better understanding of covering
algorithms, Machine learning 58 (2005) 39–77.
[15] J. Stecher, F. Janssen, J. Fürnkranz, Shorter rules
are better, aren’t they?, in: Discovery Science: 19th
International Conference, DS 2016, Bari, Italy, Oc-
tober 19–21, 2016, Proceedings 19, Springer, 2016,
pp. 279–294.
[16] F. Beck, J. Fürnkranz, V. Q. P. Huynh, Learning
deep rule concepts as alternating boolean pattern
trees, in: Discovery Science: 27th International
Conference, DS 2024, Pisa, Italy, October 14–16,
2024, Proceedings 27, Springer, 2024.
[17] R. Senge, E. Hüllermeier, Top-down induction of
fuzzy pattern trees, IEEE Transactions on Fuzzy
Systems 19 (2010) 241–252.
[18] M. Kelly, R. Longjohn, K. Nottingham, The UCI
machine learning repository, 2024. URL: https://
archive.ics.uci.edu.