1. Introduction

When characteristic rule-based models should be preferred over discriminative ones

Florian Beck

Johannes Fürnkranz

Van Quoc Phuong Huynh

0 0 Johannes Kepler University Linz, LIT Artificial Intelligence Lab / Institute for Application-oriented Knowledge Processing (FAW) , Altenberger Straße 66b/69, 4040 Linz , Austria

In recent years, the interpretability of machine learning models has gained interest. White-box approaches like rule-based models serve as an interpretable alternative or as surrogate models of black-box approaches. Among these, more compact rule-based models are considered easier to interpret. In addition, they often generalize better and thus provide higher predictive accuracies than their overfitting complex counterparts. In this paper, we argue that more complex, “characteristic” rule-based models are a genuine alternative to more compact, “discriminative” ones. We discuss why characteristic models should not be considered as less interpretable, and that more included features can actually strengthen the model both in terms of robustness and predictive accuracy. For this, we evaluate the efects on the decision boundary for models of diferent complexity, and also modify a recently developed Boolean pattern tree learner to compare a characteristic and a discriminative version on five UCI data sets. We show that the more complex models are indeed more robust to missing data, and that they sometimes even improve the predictive accuracy on the original data.

eol>characteristic rules discriminative rules decision boundaries interpretability robustness

1. Introduction

With the rise of neural network models in many machine learning applications, the need has grown to actually understand what these black-box approaches learn. This has brought rule-based models back into the spotlight which can be used as interpretable surrogates of neural network approaches, e.g., by extracting rules from the whole network [ 1 ] or with the focus on explaining decision boundaries [ 2 ].

Independent of whether rule-based models are used as surrogates of neural networks or as a stand-alone model, usually the principle of Occam’s Razor [ 3 ] is followed, which can be loosely translated as that the simplest explanation is the best one. Consequently, discriminative rules which discriminate an object of one category from objects of other categories are preferred over characteristic rules which try to capture all properties that are common to the objects of the target class [ 4 ]. This principle is also supported by the observation that longer explanations tend to overfit the training data, leading to worse performances on test data. Hence, most rule learner use some kind of pruning policy [ 5 ], resulting in learning short discriminative rules instead of longer characteristic ones.

However, there is a fine line between avoiding overiftting and learning too general theories. Consider the sample dataset in Table 1 consisting of six countries, three 2 6.9 1.8 2.2 9.3 2.3 6.1 belonging to the class Europe and three to the class South America. For each country, the value for the three numeric attributes Size, Age and 2 is provided.

Traditional rule learners like, e.g., Ripper [ 6 ] strive for discriminative rules, i.e., rules that minimize the number of used attributes when describing the classes. In this case, such a perfect, minimal description of the training data could be learned with a single rule 1 only considering the first attribute Size, and the corresponding default rule 0 for the other class2: 1 : = ←

< 184 (0 : = ← ⊤ ).

(1)

Rule 1 covers the three examples Austria, Czechia and Slovakia because these examples fulfill the condition < 184. Bolivia, Brazil and Ecuador are not covered by 1 but only by the most general rule 0, thus classified as South America. While these rules perfectly describe the training examples, they fail to correctly classify the test example Germany, which is not covered by 1 and hence misclassified as South America by 0. Vice versa, Uruguay is covered by 1 and hence misclassified as Europe . Note that these misclassifications could have been avoided if a diferent feature would have been picked, such as, e.g., in rules 2 and 3: that European and South American countries do not only difer in size, but also in median age and CO 2 emissions.

The rest of the paper is organized as follows: Section 2 further specifies the problem of finding good decision boundaries and presents characteristic models of nonrule-based classifiers as an inspiration for adaption in the rule-based setting, presented in Section 3. We modify a rule-based learner in Section 4 accordingly and evaluate a discriminative and characteristic version in Section 5 in terms of predictive accuracy and robustness. Section 6 concludes the results and takes a brief look at the remaining challenges.

2. Decision boundaries

As depicted in the introduction, in contrast to long char2 : = ← ≥ 36.7 (2) acteristic rules being prone to overfitting, short discrimi3 : = ← 2 ≥ 4.2. native rules come with the risk of providing too simplistic theories that overgeneralize. This can also be illustrated

However, 2 does not cover the test example Kosovo, by the decision boundary of the country dataset rules, and 3 does not cover Albania, so that neither of the three see Figure 1. For a better visualization, we omit the third rules would be suficient to classify all four test examples attribute CO2 to obtain a two-dimensional feature space correctly, but only a combined rule set of 2 and 3 would using the attribute Size in logarithmic scale on the -axis do so. Similarly, the three suggested features < 184, and Age on the -axis. The raw data are shown in Fig ≥ 36.7 and 2 ≥ 4.2 can also be connected by ure 1a; the six training examples as points and the four conjunctions to a single rule for class Europe, while the test examples as circles, colored in blue for class Europe respectively contrasting features form rule for class and in red for class South America, respectively. South America: We see that the training examples are quite easily separable from each other, while the test examples complicate : = ← < 184 ∧ ≥ 36.7 ∧ 2 ≥ 4.2 ifnding a good decision boundary. Figure 1b shows a discriminative rule = ← < 36, covering all four : = ← ≥ 184 ∧ < 36.7 ∧ 2 < 4(.32). South American countries along with one European in

While none of the two rules covers any of the test the light-red area of the feature space. The light-blue examples, a slight modification of their semantics allows area (classified by the default rule = ← ⊤ ) contains us to use them as reliable classifiers. Instead of requir- ifve true negatives. By adding the condition > 140, ing that all conditions of a rule need to be satisfied, we the rule can be defined more characteristic, leading to a instead assign an example to its closest rule, a method perfect classification of all examples, see Figure 1c. that is reminiscent of rule stretching [ 7 ] or nearest hyper- Still, the provided decision boundary in Figure 1c can rectangle classification [ 8 ]. In our example, the first three be considered suboptimal when compared with non-ruletest examples are assigned to class Europe, since for each based models. Figure 1d illustrates an arguably better deof them two out of three conditions of are satisfied and cision boundary which other methods like, e.g., support only one out of three of . Analogously, test example vector machines [ 9 ], logistic regression [ 10 ] and naive Uruguay is correctly classified as South America. Bayes [ 11 ] can find. All approaches have in common that

Independent of using conjunctions or disjunctions as they usually consider all attributes in the feature space the connector, we notice that more characteristic rule the- and rely on continuous coeficients to build their models; ories in Equations 2 and 3 are able to classify all four test in this case: examples correct, while the discriminative rule theory in Equation 1 is not able to do so. Moreover, the inclusion = ← − 10 · log10() ≤ 15. of more features in the characteristic concepts might not In comparison to the methods just mentioned, convenonly lead to a better performance but also arguably pro- tional rule learners only use combinations of attributevide more interesting and interpretable models, stating value-combinations for the splits of their classes. As a consequence, one of the main limitations of rule learn2Retrieved 2024/07/04 from https://ourworldindata.org/agestructure and https://ourworldindata.org/co2-and-greenhouse-gas- ing is arguably its restriction to axis-parallel decision emissions. boundaries. Though, the last two subfigures show two (a) Original data (b) Discriminative rule (c) Characteristic rule (d) Support Vector Machine (e) Scoring system (f) Hyper-rectangles ways how rule-based methods can still mimic decision Finally, Figure 1f is the illustration of two characterisboundaries like in Figure 1d. tic rules similar to Equation 3: We describe both classes

In Figure 1e, we see multiple steps in the decision Europe and South America without using a default rule. boundary. Trivially, this behavior can be achieved by Obviously, the learned rules of the two classes can overlearning one rule for each step. While this is straightfor- lap or — as in this case — leave wide areas of the feature ward in this example, it is too hard to maintain in a high- space uncovered, so that a pure Boolean evaluation of dimensional feature space with an exponentially increas- the rules is not suficient anymore. One way to haning number of combinations. Scoring systems [ 12 ] scale dle these uncovered areas are nearest hyper-rectangles better by assigning low integer scores to attribute-value [ 13 ]. The decision boundary between the two classes combinations, hereby providing a trade-of between rules can be shaped arbitrarily if enough hyper-rectangles, i.e., and linear models. In the special case that all weights rules, are learned (and is actually neither quite straight are binary, the scoring system converts into an m-of-n in Figure 1f). Obviously, distances for nominal attributes concept. With the scores being assigned by the following can not be defined as straightforward as for numerical scheme and a threshold of 4 for class South America, all attributes, as is discussed in the following section. examples are classified correctly while providing a more In this work, we aim to expand rule-based approaches customized decision boundary compared to Figure 1c: to reach this stronger expressiveness shown in the last three subfigures while still retaining the properties making them interpretable, i.e., without including all features instead of interactions and, most notably, without con : : tinuous coeficients like in SVMs, logistic regression or naive Bayes, for what characteristic rules are preferable. ⎧3 if ≥ 1100 ⎪ ⎪ ⎪⎨2 if ≥ 140 ⎪1 if ≥ 20 ⎪ ⎪⎩0 else.

3. Characteristic rule learning

lead to the learning of discriminative rules, so that in this context, so-called inverted heuristics 4(, ) are sugSo far we discussed why characteristic rules can be bene- gested which better reflect the top-down nature of the ifcial both in terms of interpretability and performance rule refinement process in theory by originating from but observed as well that almost no rule-based methods the other side of the coverage space [ 15 ]. Because of learn such concepts. To understand why conventional its typical focus on completeness, inverted heuristic can rule learners prefer discriminative rules, we first briefly often "delay" the choice of too specific features, hence introduce the coverage space and related heuristics. Sub- resulting in characteristic rules built of multiple more sequently, we reveal potential issues with the latter and general features. identify properties which should be taken into consideration when developing a characteristic rule-based learner. 3.2. Limitations 3.1. Coverage space and heuristics Even though it has been shown empirically for some datasets that inverted heuristics result in characteristic Traditionally, rules are gradually refined by adding in- rules [ 15 ], it is not inherent that they lead to characdividual conditions, whereby conjunctive refinements teristic rules. As a counterexample, consider learning a specialize a rule (afterwards it can never cover more ex- rule for the class Europe using all examples of the counamples than before the refinement), whereas disjunctive try dataset except of Brazil. The best single condition refinements generalize a rule (afterwards it can never is ≥ 29.1 covering all six examples of class Europe cover fewer examples than before the refinement). This as well as Uruguay. This false positive can not be excan be visualized in coverage space, a non-normalized cluded by further (single-cut) conditions on Size or CO2 ROC space, where the -axis shows the covered negative without losing coverage of at least one true positive, so and the -axis the covered positive examples [ 14 ]. For that the inverted heuristic stops with a rule consisting example, Figure 2 shows a path that gradually refines of a single condition. Interestingly enough, in this case, an initially universal rule (covering all positive and regular heuristics would even learn longer rules than negative examples, upper right corner of the coverage inverted heuristics, since they typically prefer this trade space) into the rule + ← ∧ . of removing a false positive at the cost of a false negative.

Most importantly though, traditional rule learners have a severe limitation of focusing only on the coverage of the learned rules but not how (well) they cover the examples. We already noticed in Table 1 that rule 1 in Equation 1 can be expanded to in Equation 3 by features considering the age and 2 of a country without covering more positive or less negative examples. Hence, both 1 and correspond to the same point in the coverage space in the top left corner, covering all positive and no negative training examples. As a consequence, independent of the chosen heuristic, conventional rule learners are not able to learn if a refinement requires improving the heuristic.

Even if the heuristic improves by adding a new condition to the original rule a similar issue can occur. Assume Figure 2: Rule refinement in coverage space a new rule 4 learned on all ten examples in Table 1, which focuses on covering the example Germany based

Apparently, a rule refined to the upper left corner can on the condition ≥ 316. This rule still covers Bobe considered perfect, since it covers only positive ex- livia and Brazil as well and could therefore be refined to amples and no negatives. In most scenarios such a rule rules 5 and 6, both considering the Age-attribute: can not be found, so that a trade-of must be found between the importance of covering all positives (complete- 4 : = ← ≥ 316 ness) and not covering any negatives (consistency). For 5 : = ← ≥ 316 ∧ ≥ 36.3 (4) this purpose, heuristics are defined as functions h(, ), 6 : = ← ≥ 316 ∧ ≥ 44.5. w(nheegraeti0v e≤) ex a≤mple(0s c≤over≤ edby) ias rthuelen[u1m4]b. er of positive While 5 and 6 both correspond to the same point

In previous studies it was found that most regular in the coverage space (covering one positive example heuristics (in particular those striving for consistency) and no negatives), their coverage on unseen examples might vary crucially since they cover diferent areas in optima, but always use all iterations. Furthermore, the feature space. Arguably, 5 should be preferred over all conditions are sorted based on the accuracy metric 6 because the added condition ≥ 36.3 covers four (ℎ() = ++− ) in the first iteration, so that in subadditional positive examples (and still no negative) com- sequent iterations always the best "local" condition can pared to ≥ 44.5. So to say, while having the same be picked, as discussed in Section 3. "global" concept, we should choose the rule with the Second, the handling of multiple pattern trees is crucial better "local" condition. Note that this is not limited to for the decision boundary. In the original aBpt classifier, numeric attributes. one pattern tree for each class ∈ Y is learned. Since in

To summarize, characteristic rules are usually not a Boolean context, the output of the Boolean expression learned because the learners rely on heuristics that only represented by the pattern tree can only be true or false take the number of covered positive and negative ex- for the features of the test example, ties occur if a test amples into account instead of separating positive and example is matched by multiple pattern trees, which negative examples with a variety of rules and conditions. can be broken by a fixed order of the pattern trees in a Particularly, adding a condition without changing the decision list. covered examples results in the same heuristic value, in An alternative that is used in fuzzy pattern tree claswhich case so far the shorter explanation is used, and the sifiers [ 17 ] is evaluating all pattern trees in a probabilissearch usually stops. Additionally, the mere focus on the tic way, whereby the highest probability decides about global coverage can lead to suboptimal "local" conditions the class prediction. A straightforward way to achieve if ties are not handled appropriately. this behavior in aBpt is using a constant uncertainty factor , resulting in probabilities ( ) = 1 − for fulfilled features and ( ) = else, which are then 4. Boolean Pattern Trees aggregated bottom-up over the respective child nodes as () = ∏︀∈ () for conjunctive and () = 1 − ∏︀∈ (1 − ()) for disjunctive nodes. However, this comes with two inconveniences of (a) weighing all child nodes the same — independent of their importance and (b) penalizing all conjunctive conditions (always decreasing ()) and rewarding all disjunctive conditions (always increasing ()) independent of their quality, which particularly afects characteristic models negatively. To address (a), we determine ( ) flexibly in the range [, 1 − ] as For the experimental comparison of deterministic and characteristic concepts, we use two versions of an alternating Boolean pattern tree (ABPT) learner recently developed in our group [ 16 ]. The task of learning an ABPT is quite similar to learning a rule: For every specific class ∈ Y, a tree : ← is learned, where is a logical expression defined over the input features, which can be much more flexible than for rules. In contrast to rule learners which use either conjunctions or disjunctions, ABPTs can connect binary features by conjunctions and disjunctions in any arbitrary order. This also complicates the iterative learning of , having multiple insertion options per feature. E.g., inserting a disjunction with feature in = ∧ can result in ∨ ( ∧ ), ( ∨ ) ∧ or ∧ ( ∨ ), which are all logically diferent. These insertions are repeated until the maximum number of iterations (= number of features in the pattern tree) is reached. We refer to [ 16 ] for further details of the algorithm and focus on two adjustments for the experiments in the following.

First, we notice that in the standard version of aBpt already multiple heuristics for the evaluation of a tree extension are used, focusing on consistency and completeness in diferent search branches. By using various cost ratios in the linear cost metric (ℎ() = · − (1 − ) · ), aBpt is capable to learn both models preferred of regular and inverted heuristics. Though, Section 3 presented as well problems that could not be fixed solely by the heuristic. To choose characteristic models instead of discriminative ones in case of ties, the learner picks the tree learned in a later iteration, i.e., using more conditions (and vice versa). This way we do not stop in local ( ) = + (1 − 2) · for fulfilled features ( ≃ probability a positive example fulfills the feature) and

− ( ) = + (1 − 2) · − + − else (≃ probability a positive example not fulfilling the feature is negative). Additionally, for (b) we relax () for the interior tree nodes as () = 1 2 · ( 1

∑︁ () + min ()) || · ∈ ∈ for conjunctive nodes and as () = 1 2 · ( 1

∑︁ () + max ()) || · ∈ ∈ for disjunctive nodes, resulting in a probability less dependent on the number of child nodes, so that models with diferent numbers of conjunctive and disjunctive nodes can be compared better.

5. Experiments

In the experiments we analyze two diferent aspects of the aBpt learner — using the default configuration of = 20 iterations and accuracy and seven diferent values for the linear cost as metrics. First, we compare a discriminative version preferring smaller trees in case of tied heuristics and a characteristic version preferring bigger trees. Second, we evaluate both versions not only in a Boolean setting but also in a probabilistic setting, as suggested in the end of Section 4.

For the experiments, we choose five UCI [ 18 ] datasets where most features are not used in the discriminative models and therefore the characteristic models could differ remarkably: labor, mushroom soybean, vote and zoo. On some datasets (labor, soybean, zoo) the otherwise inferior naive Bayes classifier even outperforms rule learners like Ripper, indicating potential for improvement of the decision boundary in the probabilistic setting.

The learners are not only applied to the original datasets but also to an "imcomplete" version of the dataset, where 30% of the values are replaced by missing values, and finally a "mixed" version, where only the test data is "incomplete". This way we can analyze the robustness of the learned models, where characteristic models might be expected to perform better, since additional features are used as a fallback option.

The predictive accuracies of a 10-fold-cross-validation are shown in Table 2. Each table shows a head-tohead comparison of the discriminative and characteristic learner in a given setting of dataset type and used evaluation. Overall, we see that the discriminative and characteristic learner perform roughly equally well, while the efects of missing values vary considerably between the datasets. Except few cases, the harder the setting (from left to right), the lower the predictive accuracy.

Independent of the dataset setting, the predictive accuracy drops drastically for labor and mushroom when changing from the Boolean to the probabilistic setting. Though, it often increases for the other three datasets — in particular, in the "mixed" setting, the accuracies can be improved drastically using a probabilistic evaluation, indicating that the adjusted decision boundaries can make the model more robust to incomplete training data. c1 ← c1 ← c2 ← c2 ← c2 ← c3 ← c3 ← c3 ← c3 ← c3 ← c3 ← c3 ← c3 ← c3 ← c3 ← c3 ← c3 ← c ← eggs=false. hair=true. toothed=true. catsize=true. legs=(1.0:4.0]. backbone=true. airborne=false. aquatic=false. fins=false. tail=true. domestic=false. predator=false. predator=true. domestic=true. tail=false. catsize=false. fins=true. milk=true ∧ c1 ∧ c2 ∧ breathes=true∧ feathers=false ∧ c3.

As an example, consider the characteristic model for the zoo dataset in Figure 3. In Boolean evaluation, the missing values result in too many examples being not covered by the model, However, the numerous conditions still indicate the correct class with a probabilistic evaluation. We also notice a big gap between the performances of the discriminative and characteristic model here, indicating that the small trees with only up to four conditions of the discriminative learner (e.g., for class=mammal it only uses the condition mlik=true) are not as robust as the trees of the characteristic counterpart.

As Figure 3 also shows, the characteristic models are usually considerably larger. From an interpretability perspective, this can be preferred, since in this case we do not only discover that mammals yield milk but also breathe, do not wear feathers, either do not lay eggs or have hair and either are toothed, catsized or have 1-4 legs. We also notice that the last concept c3 (which was also the last to be added to the model) is not helpful at all and indeed can be reduced to true because of tautologies. This indicates that in a characteristic setting stopping criteria or pruning techniques are needed as well to preserve interpretability.

6. Conclusion

In this paper, we look at the possible advantages that characteristic models, which are rarely learned in conventional rule learning algorithms, can provide. While previous work on characteristic rules usually focused on the interpretability aspects, we have shown that the inclusion of additional features, both via conjunction and disjunction, can additionally help to find better decision boundaries, resulting in more robust models. We also discussed that for learning characteristic rules a mere focus on coverage is insuficient so that both regular and inverted heuristics can not guarantee learning characteristic rules. Finding a suitable distance metric to separate positive and negative examples remains as an open question.

To analyze the efects of characteristic rule-based models empirically, we implemented a characteristic version of the aBpt learner and compared it with the original discriminative version on five UCI datasets. The experiments did not show a clear advantage for any of the learners in terms of predictive accuracy, indicating that smaller models are not necessarily overgeneralizing and larger models not inevitably lead to overfitting. In a robustness check using incomplete test data, characteristic models outperformed discriminative models. Furthermore, characteristic models slightly outperformed discriminative models when combined with probabilistic evaluation.

We also see multiple paths to further develop the potential of characteristic models in future work: Most importantly, the decision boundary artificially moved by a probabilistic evaluation (which certainly provides room for improvement as well) should not only be considered during classification but also in the learning phase. In this regard, the definition of heuristics not only considering coverage but also the "quality" of the coverage, connected to the distance between the example and the decision boundary, is crucial. This way rule learners could not only deliver a prediction but also determine how certain the prediction is, and optionally abstain from making a prediction.

[1]

Andrews ,

Diederich ,

A. B.

Tickle , Survey and critique of techniques for extracting rules from trained artificial neural networks , Knowledgebased systems 8 ( 1995 ) 373 - 389 .

[2]

Guidotti ,

Monreale ,

Giannotti ,

Pedreschi ,

Ruggieri ,

Turini , Factual and counterfactual explanations for black box decision making , IEEE Intelligent Systems 34 ( 2019 ) 14 - 23 .

[3]

Blumer ,

Ehrenfeucht ,

Haussler ,

M. K.

Warmuth , Occam's razor , Information processing letters 24 ( 1987 ) 377 - 380 .

[4]

R. S.

Michalski , A theory and methodology of inductive learning , in: Machine learning, Elsevier , 1983 , pp. 83 - 134 .

[5]

Fürnkranz , Pruning algorithms for rule learning , Machine learning 27 ( 1997 ) 139 - 172 .

[6]

W. W.

Cohen , Fast efective rule induction , in: Machine learning proceedings 1995, Elsevier , 1995 , pp. 115 - 123 .

[7]

Eineborg ,

Boström , Classifying uncovered examples by rule stretching , in: C. Rouveirol , M. Sebag (Eds.), Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP-01) , Springer Verlag, Strasbourg, France, 2001 , pp. 41 - 50 .

[8]

Salzberg , A nearest hyperrectangle learning method , Machine Learning 6 ( 1991 ) 251 - 276 .

[9]

Cristianini ,

Shawe-Taylor , An introduction to support vector machines and other kernel-based learning methods , Cambridge university press, 2000 .

[10]

J. S.

Cramer , The origins of logistic regression ( 2002 ).

[11]

Rish , et al., An empirical study of the naive bayes classifier , in: IJCAI 2001 workshop on empirical methods in artificial intelligence, volume 3 , Seattle, WA, USA;, 2001 , pp. 41 - 46 .

[12]

Rudin ,

Ustun , Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice , Interfaces 48 ( 2018 ) 449 - 466 .

[13]

Salzberg , A nearest hyperrectangle learning method , Machine learning 6 ( 1991 ) 251 - 276 .

[14]

Fürnkranz ,

P. A.

Flach , Roc'n'rule learning-towards a better understanding of covering algorithms , Machine learning 58 ( 2005 ) 39 - 77 .

[15]

Stecher ,

Janssen ,

Fürnkranz , Shorter rules are better, aren't they? , in: Discovery Science: 19th International Conference, DS 2016 , Bari, Italy, October 19-21 , 2016 , Proceedings 19, Springer, 2016 , pp. 279 - 294 .

[16]

Beck ,

Fürnkranz ,

V. Q. P.

Huynh , Learning deep rule concepts as alternating boolean pattern trees , in: Discovery Science: 27th International Conference, DS 2024 , Pisa, Italy, October 14-16 , 2024 , Proceedings 27, Springer, 2024 .

[17]

Senge , E. Hüllermeier, Top-down induction of fuzzy pattern trees , IEEE Transactions on Fuzzy Systems 19 ( 2010 ) 241 - 252 .

[18]

Kelly ,

Longjohn ,

Nottingham , The UCI machine learning repository , 2024 . URL: https:// archive.ics.uci.edu.