When characteristic rule-based models should be preferred over discriminative ones Florian Beck1 , Johannes Fürnkranz1 and Van Quoc Phuong Huynh1 1 Johannes Kepler University Linz, LIT Artificial Intelligence Lab / Institute for Application-oriented Knowledge Processing (FAW), Altenberger Straße 66b/69, 4040 Linz, Austria Abstract In recent years, the interpretability of machine learning models has gained interest. White-box approaches like rule-based models serve as an interpretable alternative or as surrogate models of black-box approaches. Among these, more compact rule-based models are considered easier to interpret. In addition, they often generalize better and thus provide higher predictive accuracies than their overfitting complex counterparts. In this paper, we argue that more complex, “characteristic” rule-based models are a genuine alternative to more compact, “discriminative” ones. We discuss why characteristic models should not be considered as less interpretable, and that more included features can actually strengthen the model both in terms of robustness and predictive accuracy. For this, we evaluate the effects on the decision boundary for models of different complexity, and also modify a recently developed Boolean pattern tree learner to compare a characteristic and a discriminative version on five UCI data sets. We show that the more complex models are indeed more robust to missing data, and that they sometimes even improve the predictive accuracy on the original data. Keywords characteristic rules, discriminative rules, decision boundaries, interpretability, robustness 1. Introduction Table 1 A small country dataset with three numeric attributes size With the rise of neural network models in many machine (in 1,000 km²), age (median; in years) and 𝐶𝑂2 (emission per learning applications, the need has grown to actually capita and year; in tons)1 . It is split into six training examples understand what these black-box approaches learn. This (three for each of the classes Europe and South America) and has brought rule-based models back into the spotlight four test examples of unknown class. which can be used as interpretable surrogates of neu- Size Age 𝐶𝑂2 Class ral network approaches, e.g., by extracting rules from Austria 84 42.8 6.9 Europe the whole network [1] or with the focus on explaining Bolivia 1099 23.9 1.8 South America decision boundaries [2]. Brazil 8515 32.8 2.2 South America Independent of whether rule-based models are used as Czechia 79 42.6 9.3 Europe surrogates of neural networks or as a stand-alone model, Ecuador 284 27.6 2.3 South America Slovakia 49 40.6 6.1 Europe usually the principle of Occam’s Razor [3] is followed, which can be loosely translated as that the simplest ex- Albania 29 37.3 1.7 ? planation is the best one. Consequently, discriminative Germany 357 44.9 8.0 ? rules which discriminate an object of one category from Kosovo 11 30.5 4.8 ? objects of other categories are preferred over charac- Uruguay 176 35.2 2.3 ? teristic rules which try to capture all properties that are common to the objects of the target class [4]. This principle is also supported by the observation that longer belonging to the class Europe and three to the class South explanations tend to overfit the training data, leading America. For each country, the value for the three nu- to worse performances on test data. Hence, most rule meric attributes Size, Age and 𝐶𝑂2 is provided. learner use some kind of pruning policy [5], resulting Traditional rule learners like, e.g., Ripper [6] strive for in learning short discriminative rules instead of longer discriminative rules, i.e., rules that minimize the number characteristic ones. of used attributes when describing the classes. In this However, there is a fine line between avoiding over- case, such a perfect, minimal description of the training fitting and learning too general theories. Consider the data could be learned with a single rule 𝑟1 only consider- sample dataset in Table 1 consisting of six countries, three ing the first attribute Size, and the corresponding default rule 𝑟0 for the other class2 : $ fbeck@faw.jku.at (F. Beck); juffi@faw.jku.at (J. Fürnkranz); vqphuynh@faw.jku.at (V. Q. P. Huynh) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 In the following, 𝑐𝑙𝑎𝑠𝑠 = 𝑒𝑢𝑟𝑜𝑝𝑒 is abbreviated as 𝑐 = 𝑒 and CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 𝑐𝑙𝑎𝑠𝑠 = 𝑠𝑜𝑢𝑡ℎ_𝑎𝑚𝑒𝑟𝑖𝑐𝑎 as 𝑐 = 𝑠𝑎 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings that European and South American countries do not only 𝑟1 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 < 184 differ in size, but also in median age and CO2 emissions. (1) (𝑟0 : 𝑐 = 𝑠𝑎 ← ⊤). The rest of the paper is organized as follows: Section 2 further specifies the problem of finding good decision Rule 𝑟1 covers the three examples Austria, Czechia boundaries and presents characteristic models of non- and Slovakia because these examples fulfill the condition rule-based classifiers as an inspiration for adaption in the 𝑠𝑖𝑧𝑒 < 184. Bolivia, Brazil and Ecuador are not covered rule-based setting, presented in Section 3. We modify a by 𝑟1 but only by the most general rule 𝑟0 , thus classified rule-based learner in Section 4 accordingly and evaluate as South America. While these rules perfectly describe the a discriminative and characteristic version in Section 5 training examples, they fail to correctly classify the test in terms of predictive accuracy and robustness. Section 6 example Germany, which is not covered by 𝑟1 and hence concludes the results and takes a brief look at the remain- misclassified as South America by 𝑟0 . Vice versa, Uruguay ing challenges. is covered by 𝑟1 and hence misclassified as Europe . Note that these misclassifications could have been avoided if a different feature would have been picked, such as, e.g., 2. Decision boundaries in rules 𝑟2 and 𝑟3 : As depicted in the introduction, in contrast to long char- 𝑟2 : 𝑐 = 𝑒 ← 𝑎𝑔𝑒 ≥ 36.7 acteristic rules being prone to overfitting, short discrimi- (2) 𝑟3 : 𝑐 = 𝑒 ← 𝐶𝑂2 ≥ 4.2. native rules come with the risk of providing too simplistic theories that overgeneralize. This can also be illustrated However, 𝑟2 does not cover the test example Kosovo, by the decision boundary of the country dataset rules, and 𝑟3 does not cover Albania, so that neither of the three see Figure 1. For a better visualization, we omit the third rules would be sufficient to classify all four test examples attribute CO2 to obtain a two-dimensional feature space correctly, but only a combined rule set of 𝑟2 and 𝑟3 would using the attribute Size in logarithmic scale on the 𝑥-axis do so. Similarly, the three suggested features 𝑠𝑖𝑧𝑒 < 184, and Age on the 𝑦-axis. The raw data are shown in Fig- 𝑎𝑔𝑒 ≥ 36.7 and 𝐶𝑂2 ≥ 4.2 can also be connected by ure 1a; the six training examples as points and the four conjunctions to a single rule 𝑟𝑒 for class Europe, while the test examples as circles, colored in blue for class Europe respectively contrasting features form rule 𝑟𝑠 for class and in red for class South America, respectively. South America: We see that the training examples are quite easily sepa- rable from each other, while the test examples complicate 𝑟𝑒 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 < 184 ∧ 𝑎𝑔𝑒 ≥ 36.7 ∧ 𝐶𝑂2 ≥ 4.2 finding a good decision boundary. Figure 1b shows a dis- criminative rule 𝑐 = 𝑠𝑎 ← 𝑎𝑔𝑒 < 36, covering all four 𝑟𝑠 : 𝑐 = 𝑠𝑎 ← 𝑠𝑖𝑧𝑒 ≥ 184 ∧ 𝑎𝑔𝑒 < 36.7 ∧ 𝐶𝑂2 < 4.2. South American countries along with one European in (3) the light-red area of the feature space. The light-blue While none of the two rules covers any of the test area (classified by the default rule 𝑐 = 𝑒 ← ⊤) contains examples, a slight modification of their semantics allows five true negatives. By adding the condition 𝑠𝑖𝑧𝑒 > 140, us to use them as reliable classifiers. Instead of requir- the rule can be defined more characteristic, leading to a ing that all conditions of a rule need to be satisfied, we perfect classification of all examples, see Figure 1c. instead assign an example to its closest rule, a method Still, the provided decision boundary in Figure 1c can that is reminiscent of rule stretching [7] or nearest hyper- be considered suboptimal when compared with non-rule- rectangle classification [8]. In our example, the first three based models. Figure 1d illustrates an arguably better de- test examples are assigned to class Europe, since for each cision boundary which other methods like, e.g., support of them two out of three conditions of 𝑟𝑒 are satisfied and vector machines [9], logistic regression [10] and naive only one out of three of 𝑟𝑠 . Analogously, test example Bayes [11] can find. All approaches have in common that Uruguay is correctly classified as South America. they usually consider all attributes in the feature space Independent of using conjunctions or disjunctions as and rely on continuous coefficients to build their models; the connector, we notice that more characteristic rule the- in this case: ories in Equations 2 and 3 are able to classify all four test examples correct, while the discriminative rule theory in 𝑐 = 𝑠𝑎 ← 𝑎𝑔𝑒 − 10 · log10 (𝑠𝑖𝑧𝑒) ≤ 15. Equation 1 is not able to do so. Moreover, the inclusion of more features in the characteristic concepts might not In comparison to the methods just mentioned, conven- only lead to a better performance but also arguably pro- tional rule learners only use combinations of attribute- vide more interesting and interpretable models, stating value-combinations for the splits of their classes. As a 2 consequence, one of the main limitations of rule learn- Retrieved 2024/07/04 from https://ourworldindata.org/age- structure and https://ourworldindata.org/co2-and-greenhouse-gas- ing is arguably its restriction to axis-parallel decision emissions. boundaries. Though, the last two subfigures show two (a) Original data (b) Discriminative rule (c) Characteristic rule (d) Support Vector Machine (e) Scoring system (f) Hyper-rectangles Figure 1: Different decision boundaries of various learning approaches for the country dataset reduced to the attributes Size (𝑥-axis, logarithmic) and Age (𝑦 -axis). (a) shows the six training examples as points and the four test examples as circles, colored in blue for class Europe and in red for class South America. The remaining subfigures (b)-(f) add a dotted decision boundary for various learners which show predictions as Europe in light-blue and as South America in light-red. ways how rule-based methods can still mimic decision Finally, Figure 1f is the illustration of two characteris- boundaries like in Figure 1d. tic rules similar to Equation 3: We describe both classes In Figure 1e, we see multiple steps in the decision Europe and South America without using a default rule. boundary. Trivially, this behavior can be achieved by Obviously, the learned rules of the two classes can over- learning one rule for each step. While this is straightfor- lap or — as in this case — leave wide areas of the feature ward in this example, it is too hard to maintain in a high- space uncovered, so that a pure Boolean evaluation of dimensional feature space with an exponentially increas- the rules is not sufficient anymore. One way to han- ing number of combinations. Scoring systems [12] scale dle these uncovered areas are nearest hyper-rectangles better by assigning low integer scores to attribute-value [13]. The decision boundary between the two classes combinations, hereby providing a trade-off between rules can be shaped arbitrarily if enough hyper-rectangles, i.e., and linear models. In the special case that all weights rules, are learned (and is actually neither quite straight are binary, the scoring system converts into an m-of-n in Figure 1f). Obviously, distances for nominal attributes concept. With the scores being assigned by the following can not be defined as straightforward as for numerical scheme and a threshold of 4 for class South America, all attributes, as is discussed in the following section. examples are classified correctly while providing a more In this work, we aim to expand rule-based approaches customized decision boundary compared to Figure 1c: to reach this stronger expressiveness shown in the last ⎧ ⎧ three subfigures while still retaining the properties mak- ⎪ ⎪ 3 if < 28 ⎪ ⎪ 3 if ≥ 1100 ing them interpretable, i.e., without including all features ⎪ ⎪ instead of interactions and, most notably, without con- ⎨2 if < 36 ⎨2 if ≥ 140 𝑎𝑔𝑒 : 𝑠𝑖𝑧𝑒 : tinuous coefficients like in SVMs, logistic regression or ⎪ ⎪ 1 if < 40 ⎪ ⎪ 1 if ≥ 20 ⎪ ⎪ naive Bayes, for what characteristic rules are preferable. 0 0 ⎩ ⎩ else else. 3. Characteristic rule learning lead to the learning of discriminative rules, so that in this context, so-called inverted heuristics 4(𝑝, 𝑛) are sug- So far we discussed why characteristic rules can be bene- gested which better reflect the top-down nature of the ficial both in terms of interpretability and performance rule refinement process in theory by originating from but observed as well that almost no rule-based methods the other side of the coverage space [15]. Because of learn such concepts. To understand why conventional its typical focus on completeness, inverted heuristic can rule learners prefer discriminative rules, we first briefly often "delay" the choice of too specific features, hence introduce the coverage space and related heuristics. Sub- resulting in characteristic rules built of multiple more sequently, we reveal potential issues with the latter and general features. identify properties which should be taken into considera- tion when developing a characteristic rule-based learner. 3.2. Limitations 3.1. Coverage space and heuristics Even though it has been shown empirically for some datasets that inverted heuristics result in characteristic Traditionally, rules are gradually refined by adding in- rules [15], it is not inherent that they lead to charac- dividual conditions, whereby conjunctive refinements teristic rules. As a counterexample, consider learning a specialize a rule (afterwards it can never cover more ex- rule for the class Europe using all examples of the coun- amples than before the refinement), whereas disjunctive try dataset except of Brazil. The best single condition refinements generalize a rule (afterwards it can never is 𝑎𝑔𝑒 ≥ 29.1 covering all six examples of class Europe cover fewer examples than before the refinement). This as well as Uruguay. This false positive can not be ex- can be visualized in coverage space, a non-normalized cluded by further (single-cut) conditions on Size or CO2 ROC space, where the 𝑥-axis shows the covered negative without losing coverage of at least one true positive, so and the 𝑦-axis the covered positive examples [14]. For that the inverted heuristic stops with a rule consisting example, Figure 2 shows a path that gradually refines of a single condition. Interestingly enough, in this case, an initially universal rule (covering all 𝑃 positive and 𝑁 regular heuristics would even learn longer rules than negative examples, upper right corner of the coverage inverted heuristics, since they typically prefer this trade space) into the rule + ← 𝑐 ∧ 𝑏. of removing a false positive at the cost of a false negative. Most importantly though, traditional rule learners have a severe limitation of focusing only on the cov- erage of the learned rules but not how (well) they cover the examples. We already noticed in Table 1 that rule 𝑟1 in Equation 1 can be expanded to 𝑟𝑒 in Equation 3 by fea- tures considering the age and 𝐶𝑂2 of a country without covering more positive or less negative examples. Hence, both 𝑟1 and 𝑟𝑒 correspond to the same point in the cov- erage space in the top left corner, covering all positive and no negative training examples. As a consequence, independent of the chosen heuristic, conventional rule learners are not able to learn 𝑟𝑒 if a refinement requires improving the heuristic. Even if the heuristic improves by adding a new condi- tion to the original rule a similar issue can occur. Assume Figure 2: Rule refinement in coverage space a new rule 𝑟4 learned on all ten examples in Table 1, which focuses on covering the example Germany based Apparently, a rule refined to the upper left corner can on the condition 𝑠𝑖𝑧𝑒 ≥ 316. This rule still covers Bo- be considered perfect, since it covers only positive ex- livia and Brazil as well and could therefore be refined to amples and no negatives. In most scenarios such a rule rules 𝑟5 and 𝑟6 , both considering the Age-attribute: can not be found, so that a trade-off must be found be- 𝑟4 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 ≥ 316 tween the importance of covering all positives (complete- ness) and not covering any negatives (consistency). For 𝑟5 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 ≥ 316 ∧ 𝑎𝑔𝑒 ≥ 36.3 (4) this purpose, heuristics are defined as functions h(𝑝, 𝑛), 𝑟6 : 𝑐 = 𝑒 ← 𝑠𝑖𝑧𝑒 ≥ 316 ∧ 𝑎𝑔𝑒 ≥ 44.5. where 0 ≤ 𝑝 ≤ 𝑃 (0 ≤ 𝑛 ≤ 𝑁 ) is the number of positive While 𝑟5 and 𝑟6 both correspond to the same point (negative) examples covered by a rule [14]. in the coverage space (covering one positive example In previous studies it was found that most regular and no negatives), their coverage on unseen examples heuristics (in particular those striving for consistency) might vary crucially since they cover different areas in optima, but always use all 𝑘 iterations. Furthermore, the feature space. Arguably, 𝑟5 should be preferred over all conditions are sorted based on the accuracy metric −𝑛 𝑟6 because the added condition 𝑎𝑔𝑒 ≥ 36.3 covers four (ℎ𝑎𝑐𝑐 (𝑡) = 𝑝+𝑁𝑃 +𝑁 ) in the first iteration, so that in sub- additional positive examples (and still no negative) com- sequent iterations always the best "local" condition can pared to 𝑎𝑔𝑒 ≥ 44.5. So to say, while having the same be picked, as discussed in Section 3. "global" concept, we should choose the rule with the Second, the handling of multiple pattern trees is crucial better "local" condition. Note that this is not limited to for the decision boundary. In the original aBpt classifier, numeric attributes. one pattern tree for each class 𝑦 ∈ Y is learned. Since in To summarize, characteristic rules are usually not a Boolean context, the output of the Boolean expression learned because the learners rely on heuristics that only represented by the pattern tree can only be true or false take the number of covered positive and negative ex- for the features of the test example, ties occur if a test amples into account instead of separating positive and example is matched by multiple pattern trees, which negative examples with a variety of rules and conditions. can be broken by a fixed order of the pattern trees in a Particularly, adding a condition without changing the decision list. covered examples results in the same heuristic value, in An alternative that is used in fuzzy pattern tree clas- which case so far the shorter explanation is used, and the sifiers [17] is evaluating all pattern trees in a probabilis- search usually stops. Additionally, the mere focus on the tic way, whereby the highest probability decides about global coverage can lead to suboptimal "local" conditions the class prediction. A straightforward way to achieve if ties are not handled appropriately. this behavior in aBpt is using a constant uncertainty factor 𝑢, resulting in probabilities 𝑝(𝑓 ) = 1 − 𝑢 for fulfilled features and 𝑝(𝑓 ) = 𝑢 else, which are then 4. Boolean Pattern Trees aggregated bottom-up over the respective child nodes ∏︀ 𝐶 as∏︀𝑝(𝑛) = 𝑖∈𝐶 𝑝(𝑖) for conjunctive and 𝑝(𝑛) = For the experimental comparison of deterministic and 1 − 𝑖∈𝐶 (1 − 𝑝(𝑖)) for disjunctive nodes. However, this characteristic concepts, we use two versions of an alter- comes with two inconveniences of (a) weighing all child nating Boolean pattern tree (ABPT) learner recently devel- nodes the same — independent of their importance and (b) oped in our group [16]. The task of learning an ABPT is penalizing all conjunctive conditions (always decreasing quite similar to learning a rule: For every specific class 𝑝(𝑛)) and rewarding all disjunctive conditions (always in- 𝑦𝑗 ∈ Y, a tree 𝑡 : 𝑦𝑗 ← 𝐵 is learned, where 𝐵 is a logical creasing 𝑝(𝑛)) independent of their quality, which partic- expression defined over the input features, which can ularly affects characteristic models negatively. To address be much more flexible than for rules. In contrast to rule (a), we determine 𝑝(𝑓 ) flexibly in the range [𝑢, 1 − 𝑢] as learners which use either conjunctions or disjunctions, ABPTs can connect binary features by conjunctions and 𝑝 𝑝(𝑓 ) = 𝑢 + (1 − 2𝑢) · disjunctions in any arbitrary order. This also complicates 𝑃 the iterative learning of 𝐵, having multiple insertion for fulfilled features (≃ probability a positive example options per feature. E.g., inserting a disjunction with fulfills the feature) and feature 𝑐 in 𝐵 = 𝑎 ∧ 𝑏 can result in 𝑐 ∨ (𝑎 ∧ 𝑏), (𝑎 ∨ 𝑐) ∧ 𝑏 or 𝑎 ∧ (𝑏 ∨ 𝑐), which are all logically different. These 𝑃 −𝑝 𝑝(𝑓 ) = 𝑢 + (1 − 2𝑢) · insertions are repeated until the maximum number of 𝑃 −𝑝+𝑁 −𝑛 iterations 𝑘 (= number of features in the pattern tree) is else (≃ probability a positive example not fulfilling the reached. We refer to [16] for further details of the algo- feature is negative). Additionally, for (b) we relax 𝑝(𝑛) rithm and focus on two adjustments for the experiments for the interior tree nodes as in the following. First, we notice that in the standard version of aBpt 1 1 ∑︁ 𝑝(𝑛) = · ( · 𝑝(𝑖) + min 𝑝(𝑖)) already multiple heuristics for the evaluation of a tree ex- 2 |𝐶| 𝑖∈𝐶 𝑖∈𝐶 tension are used, focusing on consistency and complete- ness in different search branches. By using various cost for conjunctive nodes and as ratios in the linear cost metric (ℎ𝑙𝑐 (𝑡) = 𝑐·𝑝−(1−𝑐)·𝑛), 1 1 ∑︁ aBpt is capable to learn both models preferred of regu- 𝑝(𝑛) = · ( · 𝑝(𝑖) + max 𝑝(𝑖)) 2 |𝐶| 𝑖∈𝐶 𝑖∈𝐶 lar and inverted heuristics. Though, Section 3 presented as well problems that could not be fixed solely by the for disjunctive nodes, resulting in a probability less de- heuristic. To choose characteristic models instead of dis- pendent on the number of child nodes, so that models criminative ones in case of ties, the learner picks the with different numbers of conjunctive and disjunctive tree learned in a later iteration, i.e., using more condi- nodes can be compared better. tions (and vice versa). This way we do not stop in local Table 2 Predictive accuracies of the aBpt learner on five UCI datasets for six different settings using 10-fold-cross-validation. In the first row a Boolean evaluation and in the second row a probabilistic evaluation of the pattern tree is used. The first column shows results on the original dataset, the second on an incomplete version of the dataset where 30% of the values are replaced by missing values, and the third a combination using the original data for training and the incomplete data for testing. discr. char. discr. char. discr. char. labor 87.72 84.21 labor 77.19 78.95 labor 78.95 80.70 mushroom 100.00 100.00 mushroom 96.91 96.91 mushroom 84.00 88.00 soybean 92.68 92.53 soybean 66.03 66.33 soybean 19.77 46.71 vote 94.48 94.71 vote 88.28 88.74 vote 78.39 78.16 zoo 89.11 86.14 zoo 76.24 76.24 zoo 41.58 65.35 (a) Original + Boolean (b) Incomplete + Boolean (c) Mixed + Boolean discr. char. discr. char. discr. char. labor 66.67 64.91 labor 64.91 64.91 labor 64.91 64.91 mushroom 83.83 87.64 mushroom 49.93 49.93 mushroom 79.60 75.49 soybean 90.04 90.48 soybean 67.64 68.52 soybean 52.86 57.25 vote 95.40 95.40 vote 88.51 88.51 vote 87.13 87.13 zoo 89.11 92.08 zoo 77.23 75.25 zoo 62.38 85.15 (d) Original + Probabilistic (e) Incomplete + Probabilistic (f) Mixed + Probabilistic 5. Experiments left to right), the lower the predictive accuracy. Independent of the dataset setting, the predictive ac- In the experiments we analyze two different aspects of curacy drops drastically for labor and mushroom when the aBpt learner — using the default configuration of changing from the Boolean to the probabilistic setting. 𝑘 = 20 iterations and accuracy and seven different val- Though, it often increases for the other three datasets — ues for the linear cost as metrics. First, we compare a in particular, in the "mixed" setting, the accuracies can be discriminative version preferring smaller trees in case improved drastically using a probabilistic evaluation, in- of tied heuristics and a characteristic version preferring dicating that the adjusted decision boundaries can make bigger trees. Second, we evaluate both versions not only the model more robust to incomplete training data. in a Boolean setting but also in a probabilistic setting, as suggested in the end of Section 4. For the experiments, we choose five UCI [18] datasets c1 ← eggs=false. where most features are not used in the discriminative c1 ← hair=true. models and therefore the characteristic models could dif- c2 ← toothed=true. fer remarkably: labor, mushroom soybean, vote and c2 ← catsize=true. zoo. On some datasets (labor, soybean, zoo) the oth- c2 ← legs=(1.0:4.0]. erwise inferior naive Bayes classifier even outperforms c3 ← backbone=true. rule learners like Ripper, indicating potential for improve- c3 ← airborne=false. ment of the decision boundary in the probabilistic setting. c3 ← aquatic=false. The learners are not only applied to the original c3 ← fins=false. c3 ← tail=true. datasets but also to an "imcomplete" version of the dataset, c3 ← domestic=false. where 30% of the values are replaced by missing values, c3 ← predator=false. and finally a "mixed" version, where only the test data is c3 ← predator=true. "incomplete". This way we can analyze the robustness of c3 ← domestic=true. the learned models, where characteristic models might c3 ← tail=false. be expected to perform better, since additional features c3 ← catsize=false. are used as a fallback option. c3 ← fins=true. The predictive accuracies of a 10-fold-cross-validation c← milk=true ∧ c1 ∧ c2 ∧ breathes=true∧ are shown in Table 2. Each table shows a head-to- feathers=false ∧ c3. head comparison of the discriminative and characteristic learner in a given setting of dataset type and used evalu- Figure 3: Model learned by characteristic aBpt on the zoo ation. Overall, we see that the discriminative and charac- dataset for 𝑐𝑙𝑎𝑠𝑠 = 𝑚𝑎𝑚𝑚𝑎𝑙 when being transformed into teristic learner perform roughly equally well, while the a set of conjunctive rules. effects of missing values vary considerably between the datasets. Except few cases, the harder the setting (from As an example, consider the characteristic model for tential of characteristic models in future work: Most the zoo dataset in Figure 3. In Boolean evaluation, the importantly, the decision boundary artificially moved by missing values result in too many examples being not a probabilistic evaluation (which certainly provides room covered by the model, However, the numerous conditions for improvement as well) should not only be considered still indicate the correct class with a probabilistic evalua- during classification but also in the learning phase. In this tion. We also notice a big gap between the performances regard, the definition of heuristics not only considering of the discriminative and characteristic model here, indi- coverage but also the "quality" of the coverage, connected cating that the small trees with only up to four conditions to the distance between the example and the decision of the discriminative learner (e.g., for class=mammal it boundary, is crucial. This way rule learners could not only uses the condition mlik=true) are not as robust as only deliver a prediction but also determine how certain the trees of the characteristic counterpart. the prediction is, and optionally abstain from making a As Figure 3 also shows, the characteristic models are prediction. usually considerably larger. From an interpretability per- spective, this can be preferred, since in this case we do not only discover that mammals yield milk but also breathe, References do not wear feathers, either do not lay eggs or have hair [1] R. Andrews, J. Diederich, A. B. Tickle, Survey and and either are toothed, catsized or have 1-4 legs. We also critique of techniques for extracting rules from notice that the last concept c3 (which was also the last trained artificial neural networks, Knowledge- to be added to the model) is not helpful at all and indeed based systems 8 (1995) 373–389. can be reduced to true because of tautologies. This in- [2] R. Guidotti, A. Monreale, F. Giannotti, D. Pedreschi, dicates that in a characteristic setting stopping criteria S. Ruggieri, F. Turini, Factual and counterfactual or pruning techniques are needed as well to preserve explanations for black box decision making, IEEE interpretability. Intelligent Systems 34 (2019) 14–23. [3] A. Blumer, A. Ehrenfeucht, D. Haussler, M. K. War- 6. Conclusion muth, Occam’s razor, Information processing let- ters 24 (1987) 377–380. In this paper, we look at the possible advantages that [4] R. S. Michalski, A theory and methodology of in- characteristic models, which are rarely learned in con- ductive learning, in: Machine learning, Elsevier, ventional rule learning algorithms, can provide. While 1983, pp. 83–134. previous work on characteristic rules usually focused [5] J. Fürnkranz, Pruning algorithms for rule learning, on the interpretability aspects, we have shown that the Machine learning 27 (1997) 139–172. inclusion of additional features, both via conjunction and [6] W. W. Cohen, Fast effective rule induction, in: disjunction, can additionally help to find better decision Machine learning proceedings 1995, Elsevier, 1995, boundaries, resulting in more robust models. We also pp. 115–123. discussed that for learning characteristic rules a mere [7] M. Eineborg, H. Boström, Classifying uncovered focus on coverage is insufficient so that both regular and examples by rule stretching, in: C. Rouveirol, M. Se- inverted heuristics can not guarantee learning character- bag (Eds.), Proceedings of the Eleventh Interna- istic rules. Finding a suitable distance metric to separate tional Conference on Inductive Logic Programming positive and negative examples remains as an open ques- (ILP-01), Springer Verlag, Strasbourg, France, 2001, tion. pp. 41–50. To analyze the effects of characteristic rule-based mod- [8] S. Salzberg, A nearest hyperrectangle learning els empirically, we implemented a characteristic version method, Machine Learning 6 (1991) 251–276. of the aBpt learner and compared it with the original [9] N. Cristianini, J. Shawe-Taylor, An introduction to discriminative version on five UCI datasets. The exper- support vector machines and other kernel-based iments did not show a clear advantage for any of the learning methods, Cambridge university press, learners in terms of predictive accuracy, indicating that 2000. smaller models are not necessarily overgeneralizing and [10] J. S. Cramer, The origins of logistic regression larger models not inevitably lead to overfitting. In a ro- (2002). bustness check using incomplete test data, characteristic [11] I. Rish, et al., An empirical study of the naive bayes models outperformed discriminative models. Further- classifier, in: IJCAI 2001 workshop on empirical more, characteristic models slightly outperformed dis- methods in artificial intelligence, volume 3, Seattle, criminative models when combined with probabilistic WA, USA;, 2001, pp. 41–46. evaluation. [12] C. Rudin, B. Ustun, Optimized scoring systems: We also see multiple paths to further develop the po- Toward trust in machine learning for healthcare and criminal justice, Interfaces 48 (2018) 449–466. [13] S. Salzberg, A nearest hyperrectangle learning method, Machine learning 6 (1991) 251–276. [14] J. Fürnkranz, P. A. Flach, Roc’n’rule learn- ing—towards a better understanding of covering algorithms, Machine learning 58 (2005) 39–77. [15] J. Stecher, F. Janssen, J. Fürnkranz, Shorter rules are better, aren’t they?, in: Discovery Science: 19th International Conference, DS 2016, Bari, Italy, Oc- tober 19–21, 2016, Proceedings 19, Springer, 2016, pp. 279–294. [16] F. Beck, J. Fürnkranz, V. Q. P. Huynh, Learning deep rule concepts as alternating boolean pattern trees, in: Discovery Science: 27th International Conference, DS 2024, Pisa, Italy, October 14–16, 2024, Proceedings 27, Springer, 2024. [17] R. Senge, E. Hüllermeier, Top-down induction of fuzzy pattern trees, IEEE Transactions on Fuzzy Systems 19 (2010) 241–252. [18] M. Kelly, R. Longjohn, K. Nottingham, The UCI machine learning repository, 2024. URL: https:// archive.ics.uci.edu.