-

1613-0073

Interpretable classifiers for tabular data via feature selection and discretization

Reijo Jaakkola

0 1

Tomi Janhunen

0 1

Antti Kuusisto

0 1

Masood Feyzbakhsh Rankooh

0 1

Miikka Vilander

0 1

Interpretable AI, Boolean logic, Overfitting

0 Santiago de Compostela , Spain 1 Tampere University , Finland

We introduce a method for computing immediately human interpretable yet accurate classifiers from tabular data. The classifiers obtained are short Boolean formulas, computed via first discretizing the original data and then using feature selection coupled with a very fast algorithm for producing the best possible Boolean classifier for the setting. We demonstrate the approach via 12 experiments, obtaining results with accuracies comparable to ones obtained via random forests, XGBoost, and existing results for the same datasets in the literature. In most cases, the accuracy of our method is in fact similar to that of the reference methods, even though the main objective of our study is the immediate interpretability of our classifiers. We also prove a new result on the probability that the classifier we obtain from real-life data corresponds to the ideally best classifier with respect to the background distribution the data comes from.

CEUR ceur-ws.org

1. Introduction

Explainability and human interpretability are becoming an increasingly important part of research on machine learning. In addition to the immediate benefits of explanations and interpretability in scientific contexts, the capacity to provide explanations behind automated decisions has already been widely addressed also on the level of legislation. For example, the from the classifier itself, which contrasts with explaining the operation of an external, black-box classifier [ 7 ].1 For most of our experiments, the classifiers we obtain are globally interpretable, meaning that the classifier itself immediately reveals how it works on every possible input. Global interpretability stems from the classifiers being extremely short logical formulas and thereby directly intelligible. Now, globally interpretable classifiers obtained by our method are automatically also locally interpretable, meaning that they provide an easy way of explicating why any particular input was classified in a particular way. For a reasonably small part of our experiments, the classifiers obtained are only locally but not quite globally interpretable.

The classifiers we produce are given in the form of short Boolean DNF-formulas which can naturally be conceived as simply Boolean concepts. One of the key issues that makes our approach possible is the surprising power of successful feature selection. Indeed, in our formulas, we use very small numbers of attributes, making them short and thus interpretable. Already in [ 8 ], it was observed that by using a single attribute, one could get, for many of the 16 datasets studied in that paper, an accuracy not drastically diferent from the then state-of-the-art decision trees. We apply this “simplicity first” approach in our method, together with a very rough discretization of numerical attributes to Boolean form. While we focus on interpretability, we obtain also relatively accurate classifiers. Indeed, our classifiers perform surprisingly well in comparison to widely recognized methods despite the fact that generally, there exists an obvious trade-of between accuracy and interpretability.

1.1. Overview of our method and contributions

Our method works as follows. We have a tabular dataset with numerical and categorical attributes 1, … , and a binary target attribute . Our goal is to produce a classifier for based on 1, … , . Our method is based on the following steps; see Section 3 for more details. 1. We first discretize the data to Boolean form using a very rough method of chopping numerical predicates at the median. 2. Then, for increasing numbers ℓ, we use feature selection to choose ℓ suitable attributes.

After that, we compute the best possible Boolean formula for predicting . By “best possible” we mean the Boolean formula has the least percentage of misclassified points among all Boolean formulas using the ℓ chosen attributes. In other words, the formula has the least empirical error with respect to the 0-1 loss function. The formula is given in DNF-form. 3. We use nested cross-validation to test the accuracy of the formulas and find the most accurate formula using up to 10 attributes (the parameter 10 can be adjusted, but the choice 10 turned out suficient for our experiments: see Section 3 for a discussion). We look through the sequence of formulas obtained above for diferent numbers ℓ of features. Among the formulas with accuracy within one percentage point of the most accurate formula in the sequence, we select the one with the least number of features. We use the 1However, our method could of course also be used to explicate existing black-box models in a model agnostic way via first generating data with the black-box model to be explained and then using the method to produce a closely corresponding interpretable classifier.

attributes in this formula to train the final formula via the entire training data. Finally, we slightly simplify the final obtained formula—keeping it in DNF-form—via using a standard method from the literature.

To demonstrate the robustness of our method, the discretization step is performed very roughly. Also the feature selection procedures are not critical to our approach; we use three readily available ones and choose the best final formula.

Concerning step 2, for computing the best possible DNF-formula, we present a very eficient algorithm running in time (| || || |) , where | | is the number of rows in the data, | | the number of selected features and | | ≤ min(| |, 2 || ) the number of diferent row types realized in the data with the selected features. The algorithm is fixed-parameter linear when | | is bounded by a constant—which it is in our scenario. If | | was not constant, the algorithm would have running time (| | 2). Now, all our tests ran fast and smoothly on a laptop even with large datasets containing up to 423 680 rows. For six of the datasets, the runs took less than 30 seconds per split. Of the remaining datasets, five took less than 12 minutes per split. Even the two largest datasets had a runtime of less than an hour. The hyperparameter optimizations of random forests and XGBoost took systematically longer than running our method. For example, for the largest dataset, the hyperparameter optimization for XGBoost took roughly two and half hours per split.

As already indicated, we test our method on 12 tabular datasets: seven binary classification tasks from the UCI machine learning repository; a high-dimensional binary classification task with biological data; and four benchmark sets from [ 9 ]. See Section 4 for a detailed list. In each test, we compare the accuracy of our classifier to a result obtained by state-of-the art classifiers. Now, taking into account that tabular data is still a challenge to deep learning methods [ 9, 4 ], we use both XGBoost [ 10 ] and Sklearn’s implementation of random forests [ 11 ] as reference methods. These models are widely recognized for their performance on tabular data [ 3, 9 ]. For the UCI datasets we additionally compare to the method of [ 12 ] based on the length of formulas. This method is computationally too ineficient to use on the other datasets we consider. In addition to the reference methods, in relation to 10 of the experiments, we also report results found in the literature for the used datasets.

For 11 datasets we use standard ten-fold cross-validation and report the average accuracy as well as the standard deviation over the ten splits. For the high-dimensional dataset Colon with less than 100 data points we instead use leave-one-out cross-validation.

All formulas encountered in our experiments are small enough to be easily locally interpretable. In the local interpretation process, the interpreter has a row of classified data which is then compared to the conjunctions of the DNF-classifier our method produces. A positively classified row will match with precisely one conjunction of the DNF-classifier, and a negatively interpreted one will clash with at least one attribute of each conjunction.

In addition to local interpretability, most of the formulas produced in the experiments are in fact so simple that they can be readily globally interpreted, meaning that their behaviour on any input is clear simply by the form of the formula. Globally interpreting a short DNF-formula involves looking at each of the few conjunctions separately and interpreting the meaning of their particular combination of attributes. The behaviour of the entire formula is then given by the fact that it accepts the cases given by the conjunctions and rejects everything else.

As a concrete example of an interpretable classifier, we get the formula

¬ 1 ∨ (¬ 2 ∧ ¬ 3) from an experiment on a Breast Cancer dataset concerning the benignity of breast tumors. The attributes 1, 2 and 3 relate to measures of uniformity of cell size, bare nuclei and bland chromatin. The accuracy of this formula on the corresponding test data is 94.1 percent. The average accuracy of our method over ten splits of the Breast Cancer data is 95.9 percent, while XGBoost obtains 97.1 percent and random forests 97.4 percent. We stress that our formula is, indeed, highly interpretable, while the classifiers obtained by XGBoost and random forests are of very large and of a black box nature.

As another example, from a Colon dataset we get the formula

1 ∨ 2 from the majority of our tests. The attributes 1 and 2 have been selected from a total of 5997 Booleanized attributes related to genes. The accuracy of our method on this dataset is 80.6 percent, while random forests get 83.9 percent and XGBoost 72.6 percent.

In relation to accuracy, the experiments are summarized in Figure 1. In Table 1 we report the average number of attributes (features) used by the final formulas and the dimensions of the datasets.

In general, our method has similar accuracy to the reference methods. For the UCI datasets reported on the left in Figure 1, our method obtains accuracies on par with and sometimes better than the state-of-the-art reference methods. For example, on the Hepatitis dataset we obtain an average accuracy of 80.7 percent compared to the 78.2 percent of random forests, 79 percent of XGBoost and 79.4 percent of the formula-size method of [ 12 ]. The four benchmark datasets from [ 9 ] reported on the right in Figure 1 were the most dificult for our method, but even for these we obtained some surprisingly accurate and reasonably interpretable formulas. For example we obtain a formula of the form ( 1 ∧ 2 ∧ 3 ∧ ¬ 4) ∨ ( 2 ∧ ¬ 1 ∧ ¬ 5 ∧ ¬ 4)

∨ ( 4 ∧ 3 ∧ ¬ 1 ∧ ¬ 5 ∧ ¬ 2) from an experiment on the RoadSafety dataset. This formula is quite short and has a test accuracy of 73.2 percent compared to the average 83.2 percent achieved by black box classifiers.

Concerning our experiments, we stress once more that the main advantage of our method is interpretability. As Table 1 indicates, most of our classifiers use very small numbers of features, thus being interpretable. Some classifiers obtained in the tests are longer, but still within the bounds of local (while not necessarily global) interpretability.

Finally, a key insight behind our method is the idea of computing the ideal classifiers (i.e., the above mentioned best possible Boolean classifiers) and the fact that this can be done fast using the (| || || |) algorithm (where | | ≤ min(| |, 2 || )) when is small. Since small is often suficient—which is another key insight in the method—the approach indeed works quite accurately and fast. The ideal classifiers being central to the approach, we call the method the ideal classifier method . A possible alternative for this is would be the ideal DNF-method.

In addition to experiments, we also prove a novel theoretical sample bound result that can be used for estimating whether a Boolean classifier obtained from data is in fact an ideal Boolean classifier with respect to the background distribution the data comes from. See Section 3.1 for the related theorem and the Appendix for the proof. Our result is in flavour similar to the various results in statistics that estimate the sizes of samples needed for obtaining a given confidence interval; see, e.g., [ 13 ] for further details. Results of this form can indeed be useful for estimating if the classifiers we obtain from datasets are in fact best possible classifiers (for the given features) with respect to the underlying probability distribution the data originates from. In Section 3.1 we illustrate how our result can be used in practice in relation to the datasets of the current paper.

1.2. Further related work

Concerning further related work, while the literature on explainability is rather extensive, only a relatively small part of it is primarily based on logic. See [ 14 ] for a survey on logic-based explainability, called formal explainable AI (or FXAI) there. We mention here the two prominent works dealing with minimality notions for Boolean classifiers and pioneering much of the recent work in FXAI, [ 15 ] and [ 16 ].

Like the articles [ 15 ] and [ 16 ], in fact most of logic-based explainability difers significantly from the current paper. Firstly, most papers in the field concern local rather than global explanations, and also inherent interpretability (as opposed to explainability of existing classifiers) is rarely the main focus. However, there is one method that is close enough to ours to require an explicit and direct analysis in the current paper. That method—let us here call it the formula-size method (FSM)—is investigated in [ 12 ]. Just like the current work, FSM uses a validation-based approach to avoid overfitting and find Boolean formulas with a small error over real-life datasets. A major diference between our approach and FSM is that in our method, we investigate increasing numbers of features as opposed to increasing length bounds on formulas as in FSM. The algorithm of FSM searches through the space of all possible formulas of increasing lengths, making it impossible to use with larger datasets. This contrasts with the fixed-parameter linear (| |2 || ) algorithm we use with | | being a constant.2 Our algorithm outputs classifiers with the minimum error in relation to the set of input features used, whereas FSM optimizes with respect to formula size. Together with other experiments, we also describe in Section 4.2 tests we ran to compare FSM with our method.

Concerning yet further related work on interpretable AI, the articles [ 17, 18 ] investigate the use of sparse rule lists and scoring systems which are optimized with respect to their error and size. These models are sparse in the sense that they try to use a small number of features, which makes them interpretable. The empirical results reported in these papers also demonstrate the surprising efectiveness of these interpretable models on real-world tabular data. Using the methods proposed in the articles [ 17, 18 ] requires—as in the case of FSM in [ 12 ]—solving very resource-consuming combinatorial optimization problems, since in addition to their error, classifiers are also optimized with respect to their size. For yet further related work on interpretable AI, see [ 7 ]. 2In our experiments we constrain | | to be at most 10.

2. Preliminaries

denoted , ⊧ of . For , ∈ , ⊨

A vocabulary is a finite set of symbols referred to as proposition symbols. For a vocabulary = { 1, … , } the syntax of propositional logic PL[ ] over is given by the grammar ∶∶= ∣ ¬ ∣ ∧ ∣ ∨

, where ∈ . We also define the exclusive or ⊕ ∶= ( ∨ ) ∧ ¬( ∧ ) as an abbreviation. A formula ∈

PL[ ] is in disjunctive normal form (DNF) if = where each is a conjunction of literals (i.e., formulas or ¬ where ∈ ).

A -model is a structure = ( , )

, where is a finite non-empty set called the domain of and ∶ → ( ) is a valuation which assigns to each ∈ the set of points () ⊆ where is true. Such a valuation extends in the usual way into a valuation ∶ PL[ ] → ( ) with ∧, ∨ and ¬ corresponding to the intersection, union and complementation (with respect to ) operations. A formula ∈

PL[ ] is true in the point ∈ of a -model = ( , )

⋁ =1 , , if ∈ ()

. Note that thereby each formula corresponds to a subset () PL[ ] , we define ⊨ if for all models and all ∈ , , ⊨ implies

A -type is a conjunction such that for each ∈ , precisely one of the literals and ¬ is a conjunct of and has no other conjuncts. We assume some standard bracketing and ordering of literals so that there are exactly 2|| -types. We denote the set of -types by . Note that each point ∈ of a -model = ( , )

satisfies exactly one -type. Thus, -types induce a partition of the domain . On the other hand, from the truth table of a formula ∈ PL[ ] one can obtain an equivalent formula that is a disjunction of types and thus in DNF.

For a vocabulary , let ∶ ∪{}

→ [ 0, 1 ] be a probability distribution. Here is the separate target attribute of the classification task. For ∈ , we define () = ( ∧ ) + ( ∧ ¬) The distribution corresponds to the real-world phenomenon that gives rise to practical data. Thus, we generally assume that is unknown and define the true error (or risk) of a formula ∈ PL[ ] with respect to as , . err () ∶= Pr [ ⊧ ⊕ ] = ∼

∑ ∈ ∪{} ⊨⊕ ().

This is the probability that disagrees with . Our goal is to obtain formulas with a small true error. This is made dificult by the fact that is unknown. We can, however, estimate the true error via available data.

Let = ( , )

be a ∪ {} -model. For us, corresponds to the available tabular data. Given a propositional formula ∈

PL[ ] , we define the empirical error (or empirical risk) of with respect to as err () ∶= | ( ⊕ )| | | .

The empirical error err () is easily computable as the proportion of points where disagrees with . If is fairly sampled from , then by the law of large numbers err () → err () almost surely when | | → ∞ .

Given a distribution , the formula id ∶= ⋁ { ∈ ∣ ( ∧ ) () ≥ 1/2}, which we call the ideal classifier , has the smallest true error with respect to among the formulas in PL[ ] . This is a syntactic, logic-based representation of what is known as the Bayes classifier in the literature [ 19 ], not to be confused with naive Bayesian classifiers. A Bayes classifier always gives the best possible prediction, and clearly so does an ideal classifier. Now, as is unknown, the ideal classifier is again a theoretical goal for us to approximate.

Given a ∪ {} -model = ( , )

, the formula id ∶= ⋁ { ∈ ∣ | ( ∧ )| | ()| which we call the empirical ideal classifier , has the smallest empirical error with respect to among the formulas in PL[ ] . This formula is easily computable and an essential tool of our study.

3. Feature selection and overfitting

Our general goal is to use, for a suitable set of attributes, the empirical ideal classifier to approximate the ideal classifier. In this section, we specify our methodology and show bounds on suficient sample size to guarantee that the two classifiers are identical with high probability.

Let be a small set of promising attributes chosen from the initially possibly large set of all attributes. We describe a quadratic time algorithm to obtain the empirical ideal classifier. The pseudocode Algorithm 1 below describes a formal implementation of this algorithm. Basically, we scan the points ∈ once. For every -type that is realized in = ( , ) , we initiate and maintain two counters, and . The first counter counts how many times is realized in , while counts how many times ∧ is realized in . The number / is then the probability | ( ∧ )|/| ()|

. The empirical ideal classifier over all the types which are realized in and for which / ≥ 1/2. can be constructed by taking a disjunction Algorithm 1 Compute the ideal classifier

Input: a ( ∪ {}) -model = ( , ) 1: 2: for ∈ 3: 4: 5: 6: 7: 8: 9:

do ← the -type of if ∉ then

← ∪ {} , ← 0, 0 ← + 1 if ∈ ()

then ← + 1 if / ≥ 1/2 then

← ∨ 10:

12: 13:

← ⊥ 11: for ∈ do 14: return ← ∅ {All the -types realized in will be stored in the set }

It is clear that this algorithm runs in polynomial time with respect to the size of , the size being (| || |) . A more precise analysis shows that the running time of this algorithm is (| || || |), where | | counts the number of -types that are realized in . Since | | ≤ | | , this gives a worst case time complexity of (| | 2| |) . If has a fixed size bound, as in our experiments below, then this reduces to a linear time algorithm. Moreover, we note that clearly the size | | is (| || |) .

A full step-by-step description of our method follows: 1. We begin with a tabular dataset with features 1, … , , , where is a Boolean target attribute. We denote this full training data by 0. 2. We randomly separate 30 percent of 0 as validation data val. The remaining part we call the training data train. We then discretize (or Booleanize) both of these datasets, ending up with tabular datasets with strictly Boolean attributes 1, … , , . To this end, we simply chop the numerical attributes at the median value of the training data train, above median meaning “yes” and at most median corresponding to “no”. For categorical attributes we use one-hot encoding. 3. We iterate steps 4 and 5 for increasing numbers ℓ of features from 1 to 10. 4. We run three feature selection procedures on the training data, each selecting ℓ of the attributes 1, … , to be used for classification. 5. Suppose one procedure selected the set = { 1, … , ℓ} of features. We use Algorithm 1 to compute the empirical ideal classifier for on the data train using the attributes in . Note that this is the best possible classifier for over , given in DNF-form. We do this step for all three sets of features given by the procedures. We select the formula with the highest validation accuracy on the data val and record the tuple (ℓ, , ) for future steps. 6. Once step 5 has halted, we look back at the tuples (ℓ, , ) recorded. We select the feature set ℓ corresponding to the smallest number ℓ with validation accuracy ℓ within one percentage point of the best accuracy in the full sequence

(1, 1, 1), … , (10, 10, 10).

We run the Booleanization again using the median of the full training data 0 and compute the best possible classifier one last time using the selected set ℓ of features and the data 0. 7. Finally, we simplify the selected DNF-formula via simplify_logic from SymPy. This gives a potentially simpler logically equivalent DNF-formula.

To demonstrate the robustness of our method, the discretization steps are performed very roughly, using simply the medians. Also the specific feature selection procedures are not in any way custom-made for our method; we use three readily available ones (see Sect. 4 for details) and use the one with the best accuracy.

Regarding step 5, choosing the formula with the best validation accuracy is done to avoid overfitting to the training data. With the increase of the number of features, the validation accuracy generally first improves and at some point starts to decline. This decline is a sign of overfitting. As overfitting often occurs already for small numbers of selected attributes, we may regard overfitting as useful for the method, leading to short formulas. However, in some cases, the accuracy stagnates and the overfitting point seems hard to find. In these cases, we nevertheless stop at ten features, the maximum considered in the method. For some datasets it could be necessary to go further before the accuracy stagnates, but for our experiments, ten features was (more than) enough.

While the last formula obtained in step 5 can be quite long, step 6 helps to obtain shorter formulas. By choosing an earlier formula in the obtained sequence with almost the same accuracy, we can reduce the number of features used, drastically improving the interpretability of our formulas without sacrificing much accuracy.

The formula obtained from step 6 is often very short and in most cases even globally interpretable. Nevertheless, as a final step, a standard formula simplifying tool simplify_logic from SymPy is used, and this gives a potentially even shorter DNF-formula. We note that even without using simplify_logic, all formulas encountered in our experiments are readily locally interpretable. Recall from the Introduction that in the local interpretation process, a positively classified input will match with precisely one conjunction of the DNF-classifier, and a negatively interpreted one will clash with at least one attribute of each conjunction.

3.1. Bounds on sample size

As our method consists of using the empirical ideal classifier as an approximation of the ideal classifier, we would like to have some guarantees on when the two classifiers are the same. We next present a theorem which tells us how large samples we need in order to be confident that the empirical ideal classifier is also the true ideal classifier.

The lower bound provided by our theorem depends in a crucial way on how “dificult” the underlying distribution is. More formally, we say that a probability distribution ∶ ∪{} → [ 0, 1 ] -separates ∈ , if we have that |( ∧ ) − ( ∧ ¬)| ≥ . The larger the is, the easier it is to detect via sampling which of the types ∧ and ∧ ¬ has the higher probability of occurring.

The formal statement of the theorem is now as follows. See Appendix 6.1 for the proof. Theorem 3.1. Fix a vocabulary , a proposition symbol ∉ and a probability distribution ∶ ∪{} → [ 0, 1 ]. Let , > 0 and ≥

Then with probability at least 1 − , the empirical ideal classifier with respect to a sample of size agrees with the ideal classifier with respect to on every ∈ which is -separated by . In particular, if -separates every ∈ , then the empirical ideal classifier is the ideal classifier with probability at least 1 − .

Corollary 3.2. Fix a vocabulary , a proposition symbol ∉ and a probability distribution ∶ ∪{}

→ [ 0, 1 ]. Let , > 0 and Then with probability at least 1 − , we have that -separated by , then Proof. Fix , > 0 . Given > 0 we know that if agrees with on every ∈ which is err ( ) − err ( ) =

|( ∧ ) − ( ∧ ¬)| ≥ 22||+1 ln(2||+1 / ) 2

err ( ) < err ( ) + .

∑ ∈ does not -separate < | |.

Setting ∶= /| | and applying Theorem 3.1 gives us the desired result.

To illustrate the use of this latter bound, suppose that | | = 3, = 0.01 and = 0.05 . Corollary 3.2 shows that we need a sample

of size at least 18887 to know that with probability at least 0.99 the theoretical error of is less than that of in Section 4 contain three datasets (Covertype, Electricity, RoadSafety) that have at least 18887 data points. Thus we can be fairly confident that for these datasets any empirical ideal classifier that uses at most three features obtains an accuracy which is similar to that of the true ideal plus 0.05. The datasets that we consider classifier on those features.

4. Experiments 4.1. Experimental setup

We compared our method empirically to random forests and XGBoost. These two comparison methods are very commonly used and state-of-the-art for tabular data. We tested all three methods on 12 tabular datasets. The datasets can be categorized as follows. 1. 7 binary classification tasks from the UCI machine learning repository. Five of these were selected arbitrarily, while two further ones (BreastCancer and GermanCredit) were randomly chosen among the ones used in [ 12 ]. One of the 7 datasets (StudentDropout) was originally a ternary classification task; we converted it into a binary one. 2. 4 tabular-data benchmarks out of 7 binary classification benchmarks that were presented in the paper [ 9 ]. These datasets were also originally from the UCI machine learning repository. 3. a high-dimensional binary classification task containing biological data from an opensource feature selection repository presented in [ 20 ].

BankMarketing BreastCancer CongressionalVoting GermanCredit HeartDisease Hepatitis StudentDropout

also include a comparison to the formula size method. When available, we have also included accuracies reported in literature, though these are not directly comparable due to diferent technical particularities in the experiments.

All rows with missing values were removed from these datasets.3 Most of the datasets contained both categorical and numerical features. While our method includes a rough median-based Booleanization of the data as discussed in Section 3, random forests and XGBoost used the original non-Booleanized data. 3In the case of the Hepatitis dataset we removed two columns that had several missing values.

For all but one of the datasets we use ten-fold cross-validation for all methods. That is, we split the data into ten equal parts and for each such 10 percent part we input the remaining 90 percent into the method being tested, obtain a classifier (using the chosen 90 percent of the data) and record its accuracy on the 10 percent part. We report the averages and standard deviations of the accuracies over the ten 90/10 splits.

For the high-dimensional dataset Colon with less than 100 data points, we use leave-one-out cross-validation. That is, for each data point, we set the point aside and input the remaining data into the method. We test whether the obtained classifier classifies the omitted point correctly or not. We report the average accuracy over all data points. This is a well-known standard practice for small datasets.

As mentioned in Section 3, our method utilizes readily available feature selection methods. We used Scikit-learn’s SelectKBest, which returns features that have the highest scores based on a given univariate statistical test. SelectKBest supports three methods for calculating the scores for the features: F-test, mutual information, and 2-test. When generating formulas, we tested all three of these methods and selected the best in terms of validation accuracy. We emphasize that feature selection was performed after Booleanizing the data.

If the empirical ideal classifier is allowed to use too many features, it will most likely overfit on its training data. As already discussed, we used nested cross-validation to determine how many features the empirical ideal classifier can use without overfitting. For nested cross-validation we used a 70/30-split. In all of the experiments we allowed the empirical ideal classifier to use at most ten attributes. This is a hyperparameter one could optimize for each dataset separately. For simplicity we used the same limit for all datasets. In all of our experiments, the validation accuracy either declined or stagnated well before ten features, indicating that a larger number is not needed. On the other hand, for interpretability, ten features is perhaps even a bit excessive.

For random forests we used Scikit-learn’s implementation. For both random forests and XGBoost, nested cross-validation was used for tuning the hyperparameters. As for our method, we used a 70/30-split. For hyperparameter optimization, we used Optuna [ 21 ] with the number of trials being 100. The hyperparameter spaces used for random forests and XGBoost were the same as in [ 9 ]. For the readers convenience, we also report the used hyperparameter spaces in Appendix 6.2.

We also ran the method of [ 12 ] for some datasets. As in the Introduction, we refer to their approach as the formula-size method. This method is computationally resource consuming, so we only ran it for the seven UCI repository datasets. The high-dimensional biological data and benchmark data would be too dificult for the method.

For the formula-size method we used the openly available implementation from https:// github.com/asptools/benchmarks. The runs were conducted on a Linux cluster featuring Intel Xeon 2.40 GHz CPUs with 8 CPU cores per run, employing a timeout of 72 hours per instance and a memory limit of 64 GB.

4.2. Results

Figure 1 contains the average test accuracies and standard deviations that were obtained by using our new method, random forests and XGBoost using ten-fold cross-validation. The left side of Figure 1 also contains results obtained with the formula-size method of [ 12 ]. We also report accuracies that we found in the literature for these datasets. For the left side of Figure 1, we include the accuracies that were reported for these datasets in the UCI machine learning repository. For the right side of Figure 1, for the case of Colon, we used the result reported in [ 22 ], while for remaining datasets we reported the accuries given in the full-version of [ 9 ].

We emphasize that the accuracies found from the literature are not directly comparable to the accuracies that we obtained with our method. For example, in some cases a single 70/30-split was used instead of our ten-fold cross-validation. The conventions concerning the handling of missing values could also difer from ours. Furthermore, the accuracies in [ 9 ] are given to the nearest percentage point.

From Figure 1 we see that our method produces classifiers that mostly have accuracies similar to the ones obtained by the alternative methods. In particular, on medical data such as BreastCancer and Hepatitis, our method compares well to the reference methods. Perhaps the biggest gaps are found in the cases of CoverType and Electricity, two of the benchmark datasets. However, we emphasize that in [ 9 ], it was mentioned that all of the benchmark datasets were specifically selected to be “not too easy”. Hence, one should not perhaps expect that “simple” classifiers with focus on interpretability will perform too well with these datasets.

We list the average runtime of our method on a single split of the data in the Appendix 6.3. For six of the datasets, the average runtime was less than 30 seconds. Of the remaining datasets, four had runtimes of less than 12 minutes. Even for the two largest datasets, Covertype and RoadSafety, the runtime is still less than an hour and, notably, systematically less than that of the hyperparameter optimizations of random forests and XGBoost. The experiments on our method, random forests and XBGoost were run on a laptop.

We turn our attention to our main goal of interpretability. In Table 1 we report the average of the smallest number of features suficient to reach the reported accuracy. Table 1 also gives the number of rows and attributes in the Booleanized datasets as well as the original datasets. We see from Table 1 that for eight out of 12 datasets, the best accuracy was reached already with on average less than four features, leading to very short formulas. Such short DNF-formulas are globally interpretable as follows. To interpret a short DNF-formula, one looks at each of the few conjunctions in the formula individually to determine their meaning. This is easy as each conjunction has very few attributes. The meaning of the entire formula is then given by the fact that it accepts only the cases given by the conjunctions and rejects everything else.

Next we discuss the results obtained using the formula-size method of [ 12 ]. By the size of a propositional formula they mean the number of proposition symbols, conjunctions, disjunctions and negations in the formula. The results are given in Figure 1.

We see that for most datasets on the left side in Figure 1, the accuracies obtained by the formula-size method are similar to our method. A meaningful diference can be seen in the HeartDisease dataset, where both the accuracy of 73.6 percent and standard deviation of 12.2 percent are significantly behind other methods. The formulas obtained are short, with formula size mostly less than 10.

The downside of the formula-size method is computational resources. As described in the previous subsection, we used a Linux cluster featuring Intel Xeon 2.40 GHz CPUs with 8 CPU cores per run, employing a timeout of 72 hours per instance and a memory limit of 64 GB. For all but one dataset we ran the formula-size method on, the computation took more than 24 hours per split of the data with this high eficiency setup. Even for the easiest one, BreastCancer, the computation time was in the order of 12 hours per split. Furthermore, for the benchmark datasets and the high-dimensional biological dataset Colon, the formula-size method is unfeasible already for very small formula sizes and thus we do not report any results for those datasets.

In contrast to the formula-size method, for all but the two largest datasets, the runs of our method took less than 12 minutes per split, and in six cases only seconds. More detailed runtimes are reported in Appendix 6.3. We conclude that our method obtains similar or better results than the formula-size method with a fraction of the computational resources on datasets where both methods are feasible. Our method is also usable on larger datasets, where the formula-size method is is not.

4.3. Examples of obtained formulas

Here we present selected examples of formulas that were generated by our method. Firstly, we refer the reader back to the example formulas concerning the BreastCancer and Colon datasets presented in the Introduction. Both of these formulas were very short and had high accuracies.

Another example formula is from the HeartDisease dataset, where the task is to determine, whether a patient has a heart disease. We obtain the formula

( 1 ∧ 2) ∨ ( 1 ∧ ¬ 3) ∨ ( 2 ∧ ¬ 3), where the attributes 1, 2 and 3 relate to a color test of blood vessels, fixed thallium defects and chest pain, respectively. This formula is obtained from two diferent splits of the data and the respective accuracies are 76.7 percent and 86.7 percent. The average accuracy of our method on this dataset is 81.2 percent. Random forests average at 81.8 percent, XGBoost at 80.8 percent and the formula size method at 73.6 percent.

The Hepatitis dataset concerns the mortality of patients with hepatitis. The formula is obtained from five of the ten splits of the data. Here 1, 2 and 3 relate to albumin, bilirubin and a histology test respectively. The accuracies of this formula on the five diferent test datasets range from 61.5 percent to 92.3 percent. The average accuracy of our method on the Hepatitis dataset is 80.7 while random forests get 78.2 percent and XGBoost 79.0 percent. The high variance seems to be due to the data only consisting of 155 data points. This is supported by the fact that all of the tested methods obtain a high standard deviation of accuracy on this dataset.

In the Appendix 6.4, there are further examples of formulas obtained as outputs of our method. For each dataset, we list the formula obtained for the first 90/10 split of our ten-fold cross-validation. While some of these formulas are quite long, they should still be rather fast to use for local explanations explicating why a particular input was classified in a particular way (cf. the Introduction for details on local explanations). For these experiments, we used the same maximum number 10 of features and the same limit of 1 percentage point to choose an accurate enough formula. These are hyperparameters that one could optimize to obtain a better balance of interpretability and accuracy on any specific dataset. For some datasets, raising the number of maximum features could lead to more accurate, but longer, formulas. Raising the percentage threshold on the other hand would lead to shorter, more interpretable formulas at the cost of some accuracy.

5. Conclusions

We have introduced a new method to compute immediately interpretable Boolean classifiers for tabular data. While the main point is interpretability, even the accuracy of our formulas is similar to the ones obtained via widely recognized current methods. We have also established a theoretical result for estimating if the obtained formulas actually correspond to ideally accurate ones in relation to the background distribution. In the future, we will especially consider more custom-made procedures of discretization in the context of our method, as this time discretization was carried out in a very rough way to demonstrate the efectiveness and potential of our approach. We expect this to significantly improve our method over a notable class of tabular datasets.

Our approach need not limit to Boolean formulas only, as we can naturally extend our work to general relational data. We can use, e.g., description logics and compute concepts 1, … , and then perform our procedure using 1, … , , finding short Boolean combinations of concepts. This of course difers from the approach of computing empirical ideal classifiers in the original description logic, but can nevertheless be fruitful and interesting. We leave this for future work.

Acknowledgments

T. Janhunen, A. Kuusisto, M. F. Rankooh and M. Vilander were supported by the Academy of Finland consortium project Explaining AI via Logic (XAILOG), grant numbers 345633 (Janhunen) and 345612 (Kuusisto). A. Kuusisto and M. Vilander were also supported by the Academy of Finland project Theory of computational logics, grant numbers 352419, 352420, 353027, 324435, 328987.

6. Appendix. 6.1. Proof of Theorem 3.1

Let be a propositional vocabulary and ∉ the proposition symbol that we need to explain. Let + ∶= ∪ {} and fix a probability distribution ∶ + → [ 0, 1 ]. In what follows, for notational simplicity we will assume that ( ∧ ) > ( ∧ ¬) , for every -type . then and the number of times is realized in . The main idea of the proof is to show that if Let be a sample of size , i.e., a +-model of size . For each +-type we use J K to denote ≥ 2 , Pr [for every -type for which ( ∧ ) − ( ∧ ¬) ≥ we have that J∧ K > J∧¬ K ] ≥ 1−.

Indeed, the above clearly implies that also with probability at least 1 − we have that the empirical ideal classifier with respect to agrees with the ideal classifier with respect to all -types with ( ∧ ) − ( ∧ ¬) ≥ holds provided that Setting ∶= /2 , we have by assumption that

J ∧ K > J ∧ ¬ K ( ∧ ) − ( ∧ ¬) ≥ 2. ≥ Pr[(( ∧ ) − ) < Pr[J ∧ K ≥ J ∧ ¬ K ] ≥ 1 − (Pr[(( ∧ ) − ) ≥

J ∧ K and J ∧ ¬ K

< (( ∧ ¬) + )]

J ∧ K ] + Pr[J ∧ ¬ K ≥ (( ∧ ¬) + )]) For the second inequality we used the union bound. To get a lower bound of 1 − on the latter probability, it sufices to show that Hoefding’s inequalities [ 23 ] imply that these bounds hold provided that

Pr[(( ∧ ) − ) ≥

J ∧ K ] ≤ Pr[J ∧ ¬ K ≥ (( ∧ ¬) + )] ≤ 2 2 ≥ ln(2/) 2 2 = 2 ln(2/) 2

. 2 ln(2/) 2 , Thus, if we take sample of at least size then with probability at least 1 − we have that J ∧ K > J ∧ ¬ K . provided that In particular, we will then have that Pr [for every -type for which ( ∧ ) − ( ∧ ¬) ≥ which is what we wanted to show.

Pr[J ∧ K ≤ J ∧ K ] ≤ /2 || , ≥ ln(2||+1 /) 2 2 = 2 .

So far we focused on a single -type . Using union bound again we see that where the sum is performed with respect to -types for which ( ∧ ) − ( ∧ ¬) ≥ . Setting again ∶= /2 it follows from our previous calculations that for every ∈ for which ( ∧ ) − ( ∧ ¬) ≥ we have that

6.2. Hyperparameter spaces for Random Forest and XGBoost

Hyperparameters not listed here were kept at their default values. 6.2.1. Random Forest • max_depth: None, 2, 3, 4 • criterion: gini, entropy • n_estimators: Integer sampled from [9.5, 3000.5] using log domain • max_features: sqrt, log2, None, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 • min_samples_split: 2, 3 • min_samples_leaf: Integer sampled from [1.5, 50.5] • bootstrap: True, False • min_impurity_decrease: 0.0, 0.01, 0.02, 0.05 we have that J∧ K > J∧¬ K ] ≥ 1−, 6.2.2. XGBoost • max_depth: Integer sampled from [ 1,11 ] • n_estimators: Integer sampled from [100,5900] with step-size being 200. • min_child_weight: Float sampled from [1.0, 100.0] using log domain • subsample: Float sampled from [0.5, 1.0] • learning_rate: Float sampled from [1e-5, 0.7] using log domain • colsample_bylevel: Float sampled from [0.5, 1.0] • gamma: Float sampled from [1e-8, 7.0] using log domain • reg_lambda: Float sampled from [1.0, 4.0] using log domain • reg_alpha: Float sampled from [1e-8, 100.0] using log domain

6.3. Runtimes

We report the average runtime of our method over the diferent splits of the ten-fold (or in the case of Colon leave-one-out) cross-validation. The experiments were run on a standard laptop.

6.4. Example formula for each dataset

Here we report for each dataset the formula given by the first of the ten splits in our ten-fold cross-validation. 6.4.1. BankMarketing _ _ _ ∧ ¬ ¬ 6 . 4 . 3 .

C o n g r e s s i o n a l V o t i n

g 6 . 4 . 4 .

G e r m a n

C r e d i t ℎ _ _ _ _ _ _ _ _ _ _ _ ( _ 0 ∧ ℎ _ 2 ∧ ¬ _ _ ) _ _ ∨ ¬ ℎ 6 . 4 . 5 .

H e a r t D i s e a s e 6 . 4 . 6 .

H e p a t i t i s 6 . 4 . 7 .

S t u d e n t D r o p o u

C o v e r t y p

e _ _ ( _ _ 1 _ _ ∧ ¬

_ _ 2 _ 1 _ _ _ _ _ _ 2 _ _ 6 . 4 . 1 0 .

E y e

M o v e m e n

t ∧ ¬ . 4 . 1 1 .

R o a d

S a f e t y _ . 4 . 1 2 .

C o l o

n F o r t h e h i g h d i m e n s i o n a l d a t a s e t

C o l o n u s e d l e a v e o n e o u t c r o s s v a l i d a t i o n ; f o r m u l a o b t a i n e d f r o m l e a v i n g t ifr s t d a t a p o i n t o u t

[1] EU-GDPR , Regulation (EU) 2016/679 of the European parliament and of the council , 2016 . URL: https://www.privacy-regulation.eu/en/index.htm.

[2] California-OAG , CCPA regulations: Final regulation text , 2021 . URL: https://oag.ca.gov/ privacy/ccpa.

[3]

Shwartz-Ziv ,

Armon , Tabular data: Deep learning is not all you need , Inf. Fusion 81 ( 2022 ) 84 - 90 .

[4]

Borisov ,

Leemann ,

Seßler ,

Haug ,

Pawelczyk , G. Kasneci, Deep neural networks and tabular data: A survey , CoRR abs/2110 . 01889 ( 2021 ). URL: https://arxiv.org/abs/2110. 01889 .

[5]

Pawelczyk ,

Broelemann ,

Kasneci , Learning model-agnostic counterfactual explanations for tabular data , in: Y. Huang , I. King , T. Liu, M. van Steen (Eds.), WWW '20: The Web Conference 2020 , Taipei, Taiwan, April 20-24 , 2020 , ACM / IW3C2, 2020 , pp. 3126 - 3132 .

[6]

Sahakyan ,

Aung , T. Rahwan, Explainable artificial intelligence for tabular data: A survey , IEEE Access 9 ( 2021 ) 135392 - 135422 .

[7]

Rudin , Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , Nature Machine Intelligence 1 ( 2019 ) 206 - 215 .

[8]

R. C.

Holte , Very simple classification rules perform well on most commonly used datasets , Mach. Learn . 11 ( 1993 ) 63 - 91 .

[9]

Grinsztajn , E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep learning on typical tabular data? , in: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2022 . URL: https://openreview.net/ forum?id=Fp7__phQszn.

[10]

Chen ,

Guestrin , Xgboost: A scalable tree boosting system , in: B. Krishnapuram , M.

Shah , A. J.

Smola , C. C.

Aggarwal , D.

Shen , R. Rastogi (Eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , San Francisco, CA, USA, August 13- 17 , 2016 , ACM, 2016 , pp. 785 - 794 .

[11]

Breiman , Random forests, Machine Learning 45 ( 2001 ) 5 - 32 . doi: 10 .1023/A: 1010933404324 .

[12]

Jaakkola ,

Janhunen ,

Kuusisto ,

M. F.

Rankooh ,

Vilander , Short Boolean formulas as explanations in practice, in: S. A . Gaggl , M. V. Martinez , M. Ortiz (Eds.), Logics in Artificial Intelligence - 18th European Conference, JELIA 2023 , Dresden, Germany, September 20-22 , 2023 , Proceedings, volume 14281 of Lecture Notes in Computer Science, Springer, 2023 , pp. 90 - 105 .

[13]

F. M.

Dekking ,

Kraaikamp ,

H. P.

Lopuhaä ,

L. E.

Meester , A Modern Introduction to Probability and Statistics: Understanding Why and How , Springer Texts in Statistics, Springer, 2005 .

[14]

Marques-Silva , Logic-based explainability in machine learning , in: L. E. Bertossi, G. Xiao (Eds.), Reasoning Web. Causality, Explanations and Declarative Knowledge - 18th International Summer School 2022 , Berlin, Germany, September 27-30 , 2022 , Tutorial Lectures, volume 13759 of Lecture Notes in Computer Science, Springer, 2022 , pp. 24 - 104 .

[15]

Shih ,

Choi ,

Darwiche , A symbolic approach to explaining Bayesian network classifiers , in: J. Lang (Ed.), IJCAI, 2018 , pp. 5103 - 5111 .

[16]

Ignatiev ,

Narodytska ,

Marques-Silva , Abduction-based explanations for machine learning models , in: The Thirty-Third AAAI Conference on Artificial Intelligence , AAAI 2019 , The Thirty-First Innovative Applications of Artificial Intelligence Conference , IAAI 2019 , The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019 , Honolulu, Hawaii, USA, January 27 - February 1, 2019 , AAAI Press, 2019 , pp. 1511 - 1519 .

[17]

Angelino ,

Larus-Stone ,

Alabi ,

Seltzer ,

Rudin , Learning certifiably optimal rule lists for categorical data , Journal of Machine Learning Research 18 ( 2018 ) 1 - 78 .

[18]

Rudin ,

Ustun , Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice , Interfaces 48 ( 2018 ) 449 - 466 . doi: 10 .1287/inte. 2018 . 0957 .

[19]

Devroye ,

Györfi , G. Lugosi, A Probabilistic Theory of Pattern Recognition , volume 31 of Stochastic Modelling and Applied Probability , Springer, 1996 .

[20]

Li , K. Cheng,

Wang ,

Morstatter ,

R. P.

Trevino ,

Tang , H. Liu, Feature selection: A data perspective , ACM Computing Surveys (CSUR) 50 ( 2018 ) 94 .

[21]

Akiba ,

Sano ,

Yanase ,

Ohta ,

Koyama , Optuna: A next-generation hyperparameter optimization framework , in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2019 , p. 2623 - 2631 . doi: 10 .1145/3292500.3330701.

[22]

Yang ,

Lindenbaum ,

Kluger , Locally sparse neural networks for tabular biomedical data , in: International Conference on Machine Learning , 2021 .

[23]

Hoefding , Probability inequalities for sums of bounded random variables , Journal of the American Statistical Association 58 ( 1963 ) 13 - 30 . doi: 10 .1080/01621459. 1963 . 10500830 .