<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Interpretable classifiers for tabular data via feature selection and discretization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Reijo Jaakkola</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomi Janhunen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antti Kuusisto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Masood Feyzbakhsh Rankooh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miikka Vilander</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Interpretable AI, Boolean logic, Overfitting</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Santiago de Compostela</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tampere University</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We introduce a method for computing immediately human interpretable yet accurate classifiers from tabular data. The classifiers obtained are short Boolean formulas, computed via first discretizing the original data and then using feature selection coupled with a very fast algorithm for producing the best possible Boolean classifier for the setting. We demonstrate the approach via 12 experiments, obtaining results with accuracies comparable to ones obtained via random forests, XGBoost, and existing results for the same datasets in the literature. In most cases, the accuracy of our method is in fact similar to that of the reference methods, even though the main objective of our study is the immediate interpretability of our classifiers. We also prove a new result on the probability that the classifier we obtain from real-life data corresponds to the ideally best classifier with respect to the background distribution the data comes from.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Explainability and human interpretability are becoming an increasingly important part of
research on machine learning. In addition to the immediate benefits of explanations and
interpretability in scientific contexts, the capacity to provide explanations behind automated
decisions has already been widely addressed also on the level of legislation. For example, the
from the classifier itself, which contrasts with explaining the operation of an external, black-box
classifier [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].1 For most of our experiments, the classifiers we obtain are globally interpretable,
meaning that the classifier itself immediately reveals how it works on every possible input.
Global interpretability stems from the classifiers being extremely short logical formulas and
thereby directly intelligible. Now, globally interpretable classifiers obtained by our method are
automatically also locally interpretable, meaning that they provide an easy way of explicating
why any particular input was classified in a particular way. For a reasonably small part of our
experiments, the classifiers obtained are only locally but not quite globally interpretable.
      </p>
      <p>
        The classifiers we produce are given in the form of short Boolean DNF-formulas which
can naturally be conceived as simply Boolean concepts. One of the key issues that makes
our approach possible is the surprising power of successful feature selection. Indeed, in our
formulas, we use very small numbers of attributes, making them short and thus interpretable.
Already in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], it was observed that by using a single attribute, one could get, for many of the 16
datasets studied in that paper, an accuracy not drastically diferent from the then state-of-the-art
decision trees. We apply this “simplicity first” approach in our method, together with a very
rough discretization of numerical attributes to Boolean form. While we focus on interpretability,
we obtain also relatively accurate classifiers. Indeed, our classifiers perform surprisingly well
in comparison to widely recognized methods despite the fact that generally, there exists an
obvious trade-of between accuracy and interpretability.
      </p>
      <sec id="sec-2-1">
        <title>1.1. Overview of our method and contributions</title>
        <p>Our method works as follows. We have a tabular dataset  with numerical and categorical
attributes  1, … ,   and a binary target attribute  . Our goal is to produce a classifier for  based
on  1, … ,   . Our method is based on the following steps; see Section 3 for more details.
1. We first discretize the data to Boolean form using a very rough method of chopping
numerical predicates at the median.
2. Then, for increasing numbers ℓ, we use feature selection to choose ℓ suitable attributes.</p>
        <p>After that, we compute the best possible Boolean formula for predicting  . By “best
possible” we mean the Boolean formula has the least percentage of misclassified points
among all Boolean formulas using the ℓ chosen attributes. In other words, the formula
has the least empirical error with respect to the 0-1 loss function. The formula is given in
DNF-form.
3. We use nested cross-validation to test the accuracy of the formulas and find the most
accurate formula using up to 10 attributes (the parameter 10 can be adjusted, but the
choice 10 turned out suficient for our experiments: see Section 3 for a discussion). We
look through the sequence of formulas obtained above for diferent numbers ℓ of features.
Among the formulas with accuracy within one percentage point of the most accurate
formula in the sequence, we select the one with the least number of features. We use the
1However, our method could of course also be used to explicate existing black-box models in a model agnostic way
via first generating data with the black-box model to be explained and then using the method to produce a closely
corresponding interpretable classifier.</p>
        <p>attributes in this formula to train the final formula via the entire training data. Finally,
we slightly simplify the final obtained formula—keeping it in DNF-form—via using a
standard method from the literature.</p>
        <p>To demonstrate the robustness of our method, the discretization step is performed very
roughly. Also the feature selection procedures are not critical to our approach; we use three
readily available ones and choose the best final formula.</p>
        <p>Concerning step 2, for computing the best possible DNF-formula, we present a very eficient
algorithm running in time  (| || || |) , where | | is the number of rows in the data, | | the
number of selected features and | | ≤ min(| |, 2 || ) the number of diferent row types realized in
the data with the selected features. The algorithm is fixed-parameter linear when | | is bounded
by a constant—which it is in our scenario. If | | was not constant, the algorithm would have
running time  (| | 2). Now, all our tests ran fast and smoothly on a laptop even with large
datasets containing up to 423 680 rows. For six of the datasets, the runs took less than 30
seconds per split. Of the remaining datasets, five took less than 12 minutes per split. Even the
two largest datasets had a runtime of less than an hour. The hyperparameter optimizations of
random forests and XGBoost took systematically longer than running our method. For example,
for the largest dataset, the hyperparameter optimization for XGBoost took roughly two and
half hours per split.</p>
        <p>
          As already indicated, we test our method on 12 tabular datasets: seven binary classification
tasks from the UCI machine learning repository; a high-dimensional binary classification task
with biological data; and four benchmark sets from [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. See Section 4 for a detailed list. In each
test, we compare the accuracy of our classifier to a result obtained by state-of-the art classifiers.
Now, taking into account that tabular data is still a challenge to deep learning methods [
          <xref ref-type="bibr" rid="ref4 ref9">9, 4</xref>
          ],
we use both XGBoost [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Sklearn’s implementation of random forests [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] as reference
methods. These models are widely recognized for their performance on tabular data [
          <xref ref-type="bibr" rid="ref3 ref9">3, 9</xref>
          ]. For
the UCI datasets we additionally compare to the method of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] based on the length of formulas.
This method is computationally too ineficient to use on the other datasets we consider. In
addition to the reference methods, in relation to 10 of the experiments, we also report results
found in the literature for the used datasets.
        </p>
        <p>For 11 datasets we use standard ten-fold cross-validation and report the average accuracy as
well as the standard deviation over the ten splits. For the high-dimensional dataset Colon with
less than 100 data points we instead use leave-one-out cross-validation.</p>
        <p>All formulas encountered in our experiments are small enough to be easily locally
interpretable. In the local interpretation process, the interpreter has a row of classified data which is
then compared to the conjunctions of the DNF-classifier our method produces. A positively
classified row will match with precisely one conjunction of the DNF-classifier, and a negatively
interpreted one will clash with at least one attribute of each conjunction.</p>
        <p>In addition to local interpretability, most of the formulas produced in the experiments are in
fact so simple that they can be readily globally interpreted, meaning that their behaviour on
any input is clear simply by the form of the formula. Globally interpreting a short DNF-formula
involves looking at each of the few conjunctions separately and interpreting the meaning of
their particular combination of attributes. The behaviour of the entire formula is then given by
the fact that it accepts the cases given by the conjunctions and rejects everything else.</p>
        <p>As a concrete example of an interpretable classifier, we get the formula</p>
        <p>¬ 1 ∨ (¬ 2 ∧ ¬ 3)
from an experiment on a Breast Cancer dataset concerning the benignity of breast tumors. The
attributes  1,  2 and  3 relate to measures of uniformity of cell size, bare nuclei and bland
chromatin. The accuracy of this formula on the corresponding test data is 94.1 percent. The
average accuracy of our method over ten splits of the Breast Cancer data is 95.9 percent, while
XGBoost obtains 97.1 percent and random forests 97.4 percent. We stress that our formula is,
indeed, highly interpretable, while the classifiers obtained by XGBoost and random forests are
of very large and of a black box nature.</p>
        <p>As another example, from a Colon dataset we get the formula</p>
        <p>1 ∨  2
from the majority of our tests. The attributes  1 and  2 have been selected from a total of 5997
Booleanized attributes related to genes. The accuracy of our method on this dataset is 80.6
percent, while random forests get 83.9 percent and XGBoost 72.6 percent.</p>
        <p>In relation to accuracy, the experiments are summarized in Figure 1. In Table 1 we
report the average number of attributes (features) used by the final formulas and the dimensions
of the datasets.</p>
        <p>
          In general, our method has similar accuracy to the reference methods. For the UCI datasets
reported on the left in Figure 1, our method obtains accuracies on par with and sometimes
better than the state-of-the-art reference methods. For example, on the Hepatitis dataset we
obtain an average accuracy of 80.7 percent compared to the 78.2 percent of random forests, 79
percent of XGBoost and 79.4 percent of the formula-size method of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The four benchmark
datasets from [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] reported on the right in Figure 1 were the most dificult for our method, but
even for these we obtained some surprisingly accurate and reasonably interpretable formulas.
For example we obtain a formula of the form
( 1 ∧  2 ∧  3 ∧ ¬ 4) ∨ ( 2 ∧ ¬ 1 ∧ ¬ 5 ∧ ¬ 4)
        </p>
        <p>∨ ( 4 ∧  3 ∧ ¬ 1 ∧ ¬ 5 ∧ ¬ 2)
from an experiment on the RoadSafety dataset. This formula is quite short and has a test
accuracy of 73.2 percent compared to the average 83.2 percent achieved by black box classifiers.</p>
        <p>Concerning our experiments, we stress once more that the main advantage of our method is
interpretability. As Table 1 indicates, most of our classifiers use very small numbers of features,
thus being interpretable. Some classifiers obtained in the tests are longer, but still within the
bounds of local (while not necessarily global) interpretability.</p>
        <p>Finally, a key insight behind our method is the idea of computing the ideal classifiers (i.e.,
the above mentioned best possible Boolean classifiers) and the fact that this can be done fast
using the  (| || || |) algorithm (where | | ≤ min(| |, 2 || )) when  is small. Since small  is
often suficient—which is another key insight in the method—the approach indeed works quite
accurately and fast. The ideal classifiers being central to the approach, we call the method the
ideal classifier method . A possible alternative for this is would be the ideal DNF-method.</p>
        <p>
          In addition to experiments, we also prove a novel theoretical sample bound result that can be
used for estimating whether a Boolean classifier obtained from data is in fact an ideal Boolean
classifier with respect to the background distribution the data comes from. See Section 3.1
for the related theorem and the Appendix for the proof. Our result is in flavour similar to
the various results in statistics that estimate the sizes of samples needed for obtaining a given
confidence interval; see, e.g., [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for further details. Results of this form can indeed be useful
for estimating if the classifiers we obtain from datasets are in fact best possible classifiers (for the
given features) with respect to the underlying probability distribution the data originates from.
In Section 3.1 we illustrate how our result can be used in practice in relation to the datasets of
the current paper.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>1.2. Further related work</title>
        <p>
          Concerning further related work, while the literature on explainability is rather extensive, only
a relatively small part of it is primarily based on logic. See [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] for a survey on logic-based
explainability, called formal explainable AI (or FXAI) there. We mention here the two prominent
works dealing with minimality notions for Boolean classifiers and pioneering much of the recent
work in FXAI, [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>
          Like the articles [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], in fact most of logic-based explainability difers significantly
from the current paper. Firstly, most papers in the field concern local rather than global
explanations, and also inherent interpretability (as opposed to explainability of existing classifiers) is
rarely the main focus. However, there is one method that is close enough to ours to require an
explicit and direct analysis in the current paper. That method—let us here call it the formula-size
method (FSM)—is investigated in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Just like the current work, FSM uses a validation-based
approach to avoid overfitting and find Boolean formulas with a small error over real-life datasets.
A major diference between our approach and FSM is that in our method, we investigate
increasing numbers of features as opposed to increasing length bounds on formulas as in FSM.
The algorithm of FSM searches through the space of all possible formulas of increasing lengths,
making it impossible to use with larger datasets. This contrasts with the fixed-parameter linear
 (| |2 || ) algorithm we use with | | being a constant.2 Our algorithm outputs classifiers with
the minimum error in relation to the set of input features used, whereas FSM optimizes with
respect to formula size. Together with other experiments, we also describe in Section 4.2 tests
we ran to compare FSM with our method.
        </p>
        <p>
          Concerning yet further related work on interpretable AI, the articles [
          <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
          ] investigate
the use of sparse rule lists and scoring systems which are optimized with respect to their
error and size. These models are sparse in the sense that they try to use a small number of
features, which makes them interpretable. The empirical results reported in these papers also
demonstrate the surprising efectiveness of these interpretable models on real-world tabular
data. Using the methods proposed in the articles [
          <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
          ] requires—as in the case of FSM in
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]—solving very resource-consuming combinatorial optimization problems, since in addition
to their error, classifiers are also optimized with respect to their size. For yet further related
work on interpretable AI, see [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
2In our experiments we constrain | | to be at most 10.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Preliminaries</title>
      <p>denoted  ,  ⊧ 
of  . For ,  ∈
 ,  ⊨</p>
      <p>.</p>
      <p>A vocabulary is a finite set of symbols   referred to as proposition symbols. For a vocabulary
 = { 1, … ,   } the syntax of propositional logic PL[ ] over  is given by the grammar  ∶∶=
 ∣ ¬ ∣  ∧  ∣  ∨</p>
      <p>, where  ∈  . We also define the exclusive or  ⊕  ∶= ( ∨  ) ∧ ¬( ∧  )
as an abbreviation. A formula  ∈</p>
      <p>PL[ ] is in disjunctive normal form (DNF) if  =
where each   is a conjunction of literals (i.e., formulas  or ¬ where  ∈  ).</p>
      <p>A  -model is a structure  = ( ,  )</p>
      <p>, where  is a finite non-empty set called the domain of
 and  ∶  →  ( )
is a valuation which assigns to each  ∈ 
the set of points  () ⊆ 
where  is true. Such a valuation extends in the usual way into a valuation  ∶
PL[ ] →  ( )
with ∧, ∨ and ¬ corresponding to the intersection, union and complementation (with respect
to  ) operations. A formula  ∈</p>
      <p>PL[ ] is true in the point  ∈ 
of a  -model  = ( ,  )</p>
      <p>⋁
=1   ,
, if  ∈  ()</p>
      <p>. Note that thereby each formula  corresponds to a subset  ()
PL[ ] , we define  ⊨  if for all models  and all  ∈ 
,  ,  ⊨ 
implies</p>
      <p>A  -type  is a conjunction such that for each  ∈  , precisely one of the literals  and ¬ is a
conjunct of  and  has no other conjuncts. We assume some standard bracketing and ordering
of literals so that there are exactly 2||  -types. We denote the set of  -types by   . Note that
each point  ∈ 
of a  -model  = ( ,  )</p>
      <p>satisfies exactly one  -type. Thus,  -types induce a
partition of the domain  . On the other hand, from the truth table of a formula  ∈
PL[ ] one
can obtain an equivalent formula  that is a disjunction of types and thus in DNF.</p>
      <p>For a vocabulary  , let  ∶  ∪{}</p>
      <p>
        → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] be a probability distribution. Here  is the separate
target attribute of the classification task. For  ∈   , we define () = ( ∧ ) + ( ∧ ¬)
The distribution  corresponds to the real-world phenomenon that gives rise to practical data.
Thus, we generally assume that  is unknown and define the true error (or risk) of a formula
 ∈ PL[ ] with respect to  as
,
.
err () ∶= Pr [ ⊧  ⊕ ] =
∼
      </p>
      <p>∑
 ∈  ∪{}
⊨⊕
().</p>
      <p>This is the probability that  disagrees with  . Our goal is to obtain formulas with a small true
error. This is made dificult by the fact that  is unknown. We can, however, estimate the true
error via available data.</p>
      <p>Let  = ( ,  )</p>
      <p>be a  ∪ {} -model. For us,  corresponds to the available tabular data. Given
a propositional formula  ∈</p>
      <p>PL[ ] , we define the empirical error (or empirical risk) of  with
respect to  as
err () ∶=
| ( ⊕ )|
| |
.</p>
      <p>The empirical error err () is easily computable as the proportion of points where  disagrees
with  . If  is fairly sampled from  , then by the law of large numbers err () →
err () almost
surely when | | → ∞ .</p>
      <p>
        Given a distribution  , the formula
 id ∶= ⋁ { ∈   ∣
( ∧ )
()
≥ 1/2},
which we call the ideal classifier , has the smallest true error with respect to  among the
formulas in PL[ ] . This is a syntactic, logic-based representation of what is known as the Bayes
classifier in the literature [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], not to be confused with naive Bayesian classifiers. A Bayes
classifier always gives the best possible prediction, and clearly so does an ideal classifier. Now,
as  is unknown, the ideal classifier is again a theoretical goal for us to approximate.
      </p>
      <p>Given a  ∪ {} -model  = ( ,  )</p>
      <p>, the formula
 id ∶= ⋁ { ∈   ∣
| ( ∧ )|
| ()|
which we call the empirical ideal classifier , has the smallest empirical error with respect to
 among the formulas in PL[ ] . This formula is easily computable and an essential tool of our
study.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Feature selection and overfitting</title>
      <p>Our general goal is to use, for a suitable set  of attributes, the empirical ideal classifier to
approximate the ideal classifier. In this section, we specify our methodology and show bounds
on suficient sample size to guarantee that the two classifiers are identical with high probability.</p>
      <p>Let  be a small set of promising attributes chosen from the initially possibly large set of all
attributes. We describe a quadratic time algorithm to obtain the empirical ideal classifier. The
pseudocode Algorithm 1 below describes a formal implementation of this algorithm. Basically,
we scan the points  ∈ 
once. For every  -type  that is realized in  = ( ,  )
, we initiate and
maintain two counters,   and   . The first counter   counts how many times  is realized in  ,
while   counts how many times  ∧  is realized in  . The number   /  is then the probability
| ( ∧ )|/| ()|</p>
      <p>. The empirical ideal classifier 
over all the types  which are realized in  and for which   /  ≥ 1/2.

 can be constructed by taking a disjunction
Algorithm 1 Compute the ideal classifier</p>
      <p>Input: a ( ∪ {}) -model  = ( ,  )
1:  
2: for  ∈ 
3:
4:
5:
6:
7:
8:
9:</p>
      <p>do
 ← the  -type of 
if  ∉   then</p>
      <p>←   ∪ {}
  ,   ← 0, 0
  ←   + 1
if  ∈  ()</p>
      <p>then
 ←   + 1
if   /  ≥ 1/2 then</p>
      <p>←   ∨ 
10:</p>
      <p>12:
13:</p>
      <p>← ⊥
11: for  ∈   do
14: return  



← ∅ {All the  -types realized in  will be stored in the set   }</p>
      <p>It is clear that this algorithm runs in polynomial time with respect to the size of  , the
size being  (| || |) . A more precise analysis shows that the running time of this algorithm is
 (| ||  || |), where |  | counts the number of  -types that are realized in  . Since |  | ≤ | | ,
this gives a worst case time complexity of  (| | 2| |) . If  has a fixed size bound, as in our
experiments below, then this reduces to a linear time algorithm. Moreover, we note that clearly
the size |  | is  (| || |) .</p>
      <p>A full step-by-step description of our method follows:
1. We begin with a tabular dataset with features  1, … ,   ,  , where  is a Boolean target
attribute. We denote this full training data by  0.
2. We randomly separate 30 percent of  0 as validation data  val. The remaining part we
call the training data  train. We then discretize (or Booleanize) both of these datasets,
ending up with tabular datasets with strictly Boolean attributes  1, … ,   ,  . To this end,
we simply chop the numerical attributes at the median value of the training data  train,
above median meaning “yes” and at most median corresponding to “no”. For categorical
attributes we use one-hot encoding.
3. We iterate steps 4 and 5 for increasing numbers ℓ of features from 1 to 10.
4. We run three feature selection procedures on the training data, each selecting ℓ of the
attributes  1, … ,   to be used for classification.
5. Suppose one procedure selected the set  = {  1, … ,   ℓ} of features. We use Algorithm 1
to compute the empirical ideal classifier for  on the data  train using the attributes in  .
Note that this is the best possible classifier for  over  , given in DNF-form. We do this
step for all three sets of features given by the procedures. We select the formula with
the highest validation accuracy  on the data  val and record the tuple (ℓ,  ,  ) for future
steps.
6. Once step 5 has halted, we look back at the tuples (ℓ,  ,  ) recorded. We select the feature
set  ℓ corresponding to the smallest number ℓ with validation accuracy  ℓ within one
percentage point of the best accuracy in the full sequence</p>
      <p>(1,  1,  1), … , (10,  10,  10).</p>
      <p>We run the Booleanization again using the median of the full training data  0 and compute
the best possible classifier one last time using the selected set  ℓ of features and the data
 0.
7. Finally, we simplify the selected DNF-formula via simplify_logic from SymPy. This
gives a potentially simpler logically equivalent DNF-formula.</p>
      <p>To demonstrate the robustness of our method, the discretization steps are performed very
roughly, using simply the medians. Also the specific feature selection procedures are not in any
way custom-made for our method; we use three readily available ones (see Sect. 4 for details)
and use the one with the best accuracy.</p>
      <p>Regarding step 5, choosing the formula with the best validation accuracy is done to avoid
overfitting to the training data. With the increase of the number of features, the validation
accuracy generally first improves and at some point starts to decline. This decline is a sign of
overfitting. As overfitting often occurs already for small numbers of selected attributes, we
may regard overfitting as useful for the method, leading to short formulas. However, in some
cases, the accuracy stagnates and the overfitting point seems hard to find. In these cases, we
nevertheless stop at ten features, the maximum considered in the method. For some datasets it
could be necessary to go further before the accuracy stagnates, but for our experiments, ten
features was (more than) enough.</p>
      <p>While the last formula obtained in step 5 can be quite long, step 6 helps to obtain shorter
formulas. By choosing an earlier formula in the obtained sequence with almost the same
accuracy, we can reduce the number of features used, drastically improving the interpretability
of our formulas without sacrificing much accuracy.</p>
      <p>The formula obtained from step 6 is often very short and in most cases even globally
interpretable. Nevertheless, as a final step, a standard formula simplifying tool simplify_logic
from SymPy is used, and this gives a potentially even shorter DNF-formula. We note that even
without using simplify_logic, all formulas encountered in our experiments are readily locally
interpretable. Recall from the Introduction that in the local interpretation process, a positively
classified input will match with precisely one conjunction of the DNF-classifier, and a negatively
interpreted one will clash with at least one attribute of each conjunction.</p>
      <sec id="sec-4-1">
        <title>3.1. Bounds on sample size</title>
        <p>As our method consists of using the empirical ideal classifier as an approximation of the ideal
classifier, we would like to have some guarantees on when the two classifiers are the same. We
next present a theorem which tells us how large samples we need in order to be confident that
the empirical ideal classifier is also the true ideal classifier.</p>
        <p>
          The lower bound provided by our theorem depends in a crucial way on how “dificult” the
underlying distribution  is. More formally, we say that a probability distribution  ∶  ∪{} →
[
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]  -separates  ∈   , if we have that |( ∧ ) − ( ∧ ¬)| ≥  . The larger the  is, the easier
it is to detect via sampling which of the types  ∧  and  ∧ ¬ has the higher probability of
occurring.
        </p>
        <p>
          The formal statement of the theorem is now as follows. See Appendix 6.1 for the proof.
Theorem 3.1. Fix a vocabulary  , a proposition symbol  ∉  and a probability distribution
 ∶  ∪{} → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. Let ,  &gt; 0 and
 ≥
        </p>
        <p>Then with probability at least 1 −  , the empirical ideal classifier with respect to a sample  of
size  agrees with the ideal classifier with respect to  on every  ∈   which is  -separated by  . In
particular, if   -separates every  ∈   , then the empirical ideal classifier is the ideal classifier with
probability at least 1 −  .</p>
        <p>Corollary 3.2. Fix a vocabulary  , a proposition symbol  ∉  and a probability distribution
 ∶  ∪{}</p>
        <p>
          → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. Let ,  &gt; 0 and
Then with probability at least 1 −  , we have that
 -separated by  , then
Proof. Fix ,  &gt; 0 . Given  &gt; 0 we know that if   agrees with   on every  ∈   which is

err (  ) − err (  ) =
        </p>
        <p>|( ∧ ) − ( ∧ ¬)|
 ≥
22||+1 ln(2||+1 / )
 2</p>
        <p>.</p>
        <p>err (  ) &lt; err (  ) + .</p>
        <p>∑
∈ 
 does not  -separate 
&lt; |  |.</p>
        <p>Setting  ∶= /|  | and applying Theorem 3.1 gives us the desired result.</p>
        <p>To illustrate the use of this latter bound, suppose that | | = 3,  = 0.01
and  = 0.05 . Corollary
3.2 shows that we need a sample</p>
        <p>of size at least 18887 to know that with probability at least
0.99 the theoretical error of   is less than that of  
in Section 4 contain three datasets (Covertype, Electricity, RoadSafety) that have at least 18887
data points. Thus we can be fairly confident that for these datasets any empirical ideal classifier
that uses at most three features obtains an accuracy which is similar to that of the true ideal
 plus 0.05. The datasets that we consider
classifier on those features.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <sec id="sec-5-1">
        <title>4.1. Experimental setup</title>
        <p>
          We compared our method empirically to random forests and XGBoost. These two comparison
methods are very commonly used and state-of-the-art for tabular data. We tested all three
methods on 12 tabular datasets. The datasets can be categorized as follows.
1. 7 binary classification tasks from the UCI machine learning repository. Five of these
were selected arbitrarily, while two further ones (BreastCancer and GermanCredit) were
randomly chosen among the ones used in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. One of the 7 datasets (StudentDropout)
was originally a ternary classification task; we converted it into a binary one.
2. 4 tabular-data benchmarks out of 7 binary classification benchmarks that were presented
in the paper [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. These datasets were also originally from the UCI machine learning
repository.
3. a high-dimensional binary classification task containing biological data from an
opensource feature selection repository presented in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <sec id="sec-5-1-1">
          <title>BankMarketing</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>BreastCancer</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>CongressionalVoting</title>
        </sec>
        <sec id="sec-5-1-4">
          <title>GermanCredit</title>
        </sec>
        <sec id="sec-5-1-5">
          <title>HeartDisease</title>
        </sec>
        <sec id="sec-5-1-6">
          <title>Hepatitis StudentDropout</title>
          <p>also include a comparison to the formula size method. When available, we have also included accuracies
reported in literature, though these are not directly comparable due to diferent technical particularities
in the experiments.</p>
          <p>All rows with missing values were removed from these datasets.3 Most of the datasets contained
both categorical and numerical features. While our method includes a rough median-based
Booleanization of the data as discussed in Section 3, random forests and XGBoost used the
original non-Booleanized data.
3In the case of the Hepatitis dataset we removed two columns that had several missing values.</p>
          <p>For all but one of the datasets we use ten-fold cross-validation for all methods. That is, we
split the data into ten equal parts and for each such 10 percent part we input the remaining
90 percent into the method being tested, obtain a classifier (using the chosen 90 percent of
the data) and record its accuracy on the 10 percent part. We report the averages and standard
deviations of the accuracies over the ten 90/10 splits.</p>
          <p>For the high-dimensional dataset Colon with less than 100 data points, we use leave-one-out
cross-validation. That is, for each data point, we set the point aside and input the remaining data
into the method. We test whether the obtained classifier classifies the omitted point correctly
or not. We report the average accuracy over all data points. This is a well-known standard
practice for small datasets.</p>
          <p>As mentioned in Section 3, our method utilizes readily available feature selection methods.
We used Scikit-learn’s SelectKBest, which returns  features that have the highest scores based
on a given univariate statistical test. SelectKBest supports three methods for calculating the
scores for the features: F-test, mutual information, and  2-test. When generating formulas,
we tested all three of these methods and selected the best in terms of validation accuracy. We
emphasize that feature selection was performed after Booleanizing the data.</p>
          <p>If the empirical ideal classifier is allowed to use too many features, it will most likely overfit on
its training data. As already discussed, we used nested cross-validation to determine how many
features the empirical ideal classifier can use without overfitting. For nested cross-validation
we used a 70/30-split. In all of the experiments we allowed the empirical ideal classifier to use
at most ten attributes. This is a hyperparameter one could optimize for each dataset separately.
For simplicity we used the same limit for all datasets. In all of our experiments, the validation
accuracy either declined or stagnated well before ten features, indicating that a larger number is
not needed. On the other hand, for interpretability, ten features is perhaps even a bit excessive.</p>
          <p>
            For random forests we used Scikit-learn’s implementation. For both random forests and
XGBoost, nested cross-validation was used for tuning the hyperparameters. As for our method,
we used a 70/30-split. For hyperparameter optimization, we used Optuna [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] with the number
of trials being 100. The hyperparameter spaces used for random forests and XGBoost were the
same as in [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. For the readers convenience, we also report the used hyperparameter spaces in
Appendix 6.2.
          </p>
          <p>
            We also ran the method of [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] for some datasets. As in the Introduction, we refer to their
approach as the formula-size method. This method is computationally resource consuming, so
we only ran it for the seven UCI repository datasets. The high-dimensional biological data and
benchmark data would be too dificult for the method.
          </p>
          <p>For the formula-size method we used the openly available implementation from https://
github.com/asptools/benchmarks. The runs were conducted on a Linux cluster featuring Intel
Xeon 2.40 GHz CPUs with 8 CPU cores per run, employing a timeout of 72 hours per instance
and a memory limit of 64 GB.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Results</title>
        <p>
          Figure 1 contains the average test accuracies and standard deviations that were obtained by
using our new method, random forests and XGBoost using ten-fold cross-validation. The left
side of Figure 1 also contains results obtained with the formula-size method of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We also
report accuracies that we found in the literature for these datasets. For the left side of Figure 1,
we include the accuracies that were reported for these datasets in the UCI machine learning
repository. For the right side of Figure 1, for the case of Colon, we used the result reported in
[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], while for remaining datasets we reported the accuries given in the full-version of [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          We emphasize that the accuracies found from the literature are not directly comparable to the
accuracies that we obtained with our method. For example, in some cases a single 70/30-split
was used instead of our ten-fold cross-validation. The conventions concerning the handling of
missing values could also difer from ours. Furthermore, the accuracies in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] are given to the
nearest percentage point.
        </p>
        <p>
          From Figure 1 we see that our method produces classifiers that mostly have accuracies
similar to the ones obtained by the alternative methods. In particular, on medical data such as
BreastCancer and Hepatitis, our method compares well to the reference methods. Perhaps the
biggest gaps are found in the cases of CoverType and Electricity, two of the benchmark datasets.
However, we emphasize that in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], it was mentioned that all of the benchmark datasets were
specifically selected to be “not too easy”. Hence, one should not perhaps expect that “simple”
classifiers with focus on interpretability will perform too well with these datasets.
        </p>
        <p>We list the average runtime of our method on a single split of the data in the Appendix 6.3.
For six of the datasets, the average runtime was less than 30 seconds. Of the remaining datasets,
four had runtimes of less than 12 minutes. Even for the two largest datasets, Covertype and
RoadSafety, the runtime is still less than an hour and, notably, systematically less than that of
the hyperparameter optimizations of random forests and XGBoost. The experiments on our
method, random forests and XBGoost were run on a laptop.</p>
        <p>We turn our attention to our main goal of interpretability. In Table 1 we report the average of
the smallest number of features suficient to reach the reported accuracy. Table 1 also gives the
number of rows and attributes in the Booleanized datasets as well as the original datasets. We
see from Table 1 that for eight out of 12 datasets, the best accuracy was reached already with
on average less than four features, leading to very short formulas. Such short DNF-formulas
are globally interpretable as follows. To interpret a short DNF-formula, one looks at each of the
few conjunctions in the formula individually to determine their meaning. This is easy as each
conjunction has very few attributes. The meaning of the entire formula is then given by the
fact that it accepts only the cases given by the conjunctions and rejects everything else.</p>
        <p>
          Next we discuss the results obtained using the formula-size method of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. By the size of a
propositional formula they mean the number of proposition symbols, conjunctions, disjunctions
and negations in the formula. The results are given in Figure 1.
        </p>
        <p>We see that for most datasets on the left side in Figure 1, the accuracies obtained by the
formula-size method are similar to our method. A meaningful diference can be seen in the
HeartDisease dataset, where both the accuracy of 73.6 percent and standard deviation of 12.2
percent are significantly behind other methods. The formulas obtained are short, with formula
size mostly less than 10.</p>
        <p>The downside of the formula-size method is computational resources. As described in the
previous subsection, we used a Linux cluster featuring Intel Xeon 2.40 GHz CPUs with 8
CPU cores per run, employing a timeout of 72 hours per instance and a memory limit of 64
GB. For all but one dataset we ran the formula-size method on, the computation took more
than 24 hours per split of the data with this high eficiency setup. Even for the easiest one,
BreastCancer, the computation time was in the order of 12 hours per split. Furthermore, for
the benchmark datasets and the high-dimensional biological dataset Colon, the formula-size
method is unfeasible already for very small formula sizes and thus we do not report any results
for those datasets.</p>
        <p>In contrast to the formula-size method, for all but the two largest datasets, the runs of our
method took less than 12 minutes per split, and in six cases only seconds. More detailed runtimes
are reported in Appendix 6.3. We conclude that our method obtains similar or better results
than the formula-size method with a fraction of the computational resources on datasets where
both methods are feasible. Our method is also usable on larger datasets, where the formula-size
method is is not.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Examples of obtained formulas</title>
        <p>Here we present selected examples of formulas that were generated by our method. Firstly, we
refer the reader back to the example formulas concerning the BreastCancer and Colon datasets
presented in the Introduction. Both of these formulas were very short and had high accuracies.</p>
        <p>Another example formula is from the HeartDisease dataset, where the task is to determine,
whether a patient has a heart disease. We obtain the formula</p>
        <p>( 1 ∧  2) ∨ ( 1 ∧ ¬ 3) ∨ ( 2 ∧ ¬ 3),
where the attributes  1,  2 and  3 relate to a color test of blood vessels, fixed thallium defects
and chest pain, respectively. This formula is obtained from two diferent splits of the data and
the respective accuracies are 76.7 percent and 86.7 percent. The average accuracy of our method
on this dataset is 81.2 percent. Random forests average at 81.8 percent, XGBoost at 80.8 percent
and the formula size method at 73.6 percent.</p>
        <p>The Hepatitis dataset concerns the mortality of patients with hepatitis. The formula
is obtained from five of the ten splits of the data. Here  1,  2 and  3 relate to albumin, bilirubin
and a histology test respectively. The accuracies of this formula on the five diferent test datasets
range from 61.5 percent to 92.3 percent. The average accuracy of our method on the Hepatitis
dataset is 80.7 while random forests get 78.2 percent and XGBoost 79.0 percent. The high
variance seems to be due to the data only consisting of 155 data points. This is supported by the
fact that all of the tested methods obtain a high standard deviation of accuracy on this dataset.</p>
        <p>In the Appendix 6.4, there are further examples of formulas obtained as outputs of our
method. For each dataset, we list the formula obtained for the first 90/10 split of our ten-fold
cross-validation. While some of these formulas are quite long, they should still be rather fast to
use for local explanations explicating why a particular input was classified in a particular way
(cf. the Introduction for details on local explanations). For these experiments, we used the same
maximum number 10 of features and the same limit of 1 percentage point to choose an accurate
enough formula. These are hyperparameters that one could optimize to obtain a better balance
of interpretability and accuracy on any specific dataset. For some datasets, raising the number
of maximum features could lead to more accurate, but longer, formulas. Raising the percentage
threshold on the other hand would lead to shorter, more interpretable formulas at the cost of
some accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions</title>
      <p>We have introduced a new method to compute immediately interpretable Boolean classifiers
for tabular data. While the main point is interpretability, even the accuracy of our formulas is
similar to the ones obtained via widely recognized current methods. We have also established a
theoretical result for estimating if the obtained formulas actually correspond to ideally accurate
ones in relation to the background distribution. In the future, we will especially consider
more custom-made procedures of discretization in the context of our method, as this time
discretization was carried out in a very rough way to demonstrate the efectiveness and potential
of our approach. We expect this to significantly improve our method over a notable class of
tabular datasets.</p>
      <p>Our approach need not limit to Boolean formulas only, as we can naturally extend our work to
general relational data. We can use, e.g., description logics and compute concepts  1, … ,   and
then perform our procedure using  1, … ,   , finding short Boolean combinations of concepts.
This of course difers from the approach of computing empirical ideal classifiers in the original
description logic, but can nevertheless be fruitful and interesting. We leave this for future work.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>T. Janhunen, A. Kuusisto, M. F. Rankooh and M. Vilander were supported by the Academy of
Finland consortium project Explaining AI via Logic (XAILOG), grant numbers 345633 (Janhunen)
and 345612 (Kuusisto). A. Kuusisto and M. Vilander were also supported by the Academy of
Finland project Theory of computational logics, grant numbers 352419, 352420, 353027, 324435,
328987.</p>
    </sec>
    <sec id="sec-8">
      <title>6. Appendix.</title>
      <sec id="sec-8-1">
        <title>6.1. Proof of Theorem 3.1</title>
        <p>
          Let  be a propositional vocabulary and  ∉  the proposition symbol that we need to explain. Let
 + ∶=  ∪ {} and fix a probability distribution  ∶   + → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. In what follows, for notational
simplicity we will assume that ( ∧ ) &gt; ( ∧ ¬) , for every  -type  .
then
and
the number of times  is realized in  . The main idea of the proof is to show that if
Let  be a sample of size  , i.e., a  +-model of size  . For each  +-type  we use J K to denote
 ≥
 2
,
Pr [for every  -type  for which ( ∧ ) − ( ∧ ¬) ≥ 
we have that J∧ K &gt; J∧¬ K ] ≥ 1−.
        </p>
        <p>Indeed, the above clearly implies that also with probability at least 1 −  we have that the
empirical ideal classifier with respect to  agrees with the ideal classifier with respect to all
 -types  with ( ∧ ) − ( ∧ ¬) ≥ 
holds provided that
Setting  ∶= /2 , we have by assumption that</p>
        <p>J ∧  K &gt; J ∧ ¬ K
( ∧ ) − ( ∧ ¬) ≥ 2.
≥ Pr[(( ∧ ) − ) &lt;
Pr[J ∧  K ≥ J ∧ ¬ K ]
≥ 1 − (Pr[(( ∧ ) − ) ≥</p>
        <p>J ∧  K and J ∧ ¬ K</p>
        <p>&lt; (( ∧ ¬) + )]</p>
        <p>
          J ∧  K ] + Pr[J ∧ ¬ K ≥ (( ∧ ¬) + )])
For the second inequality we used the union bound. To get a lower bound of 1 −  on the latter
probability, it sufices to show that
Hoefding’s inequalities [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] imply that these bounds hold provided that
        </p>
        <p>Pr[(( ∧ ) − ) ≥</p>
        <p>J ∧  K ] ≤
Pr[J ∧ ¬ K ≥ (( ∧ ¬) + )] ≤

2

2
 ≥
ln(2/)
2 2
=
2 ln(2/)
 2</p>
        <p>.
2 ln(2/)
 2
,
Thus, if we take sample  of at least size
then with probability at least 1 −  we have that J ∧  K &gt; J ∧ ¬ K .
provided that
In particular, we will then have that
Pr [for every  -type  for which ( ∧ ) − ( ∧ ¬) ≥ 
which is what we wanted to show.</p>
        <p>Pr[J ∧  K ≤ J ∧  K ] ≤ /2 || ,
 ≥
ln(2||+1 /)
2 2
=
 2
.</p>
        <p>So far we focused on a single  -type  . Using union bound again we see that
where the sum is performed with respect to  -types  for which ( ∧ ) − ( ∧ ¬) ≥  . Setting
again  ∶= /2 it follows from our previous calculations that for every  ∈   for which
( ∧ ) − ( ∧ ¬) ≥  we have that</p>
      </sec>
      <sec id="sec-8-2">
        <title>6.2. Hyperparameter spaces for Random Forest and XGBoost</title>
        <p>
          Hyperparameters not listed here were kept at their default values.
6.2.1. Random Forest
• max_depth: None, 2, 3, 4
• criterion: gini, entropy
• n_estimators: Integer sampled from [9.5, 3000.5] using log domain
• max_features: sqrt, log2, None, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9
• min_samples_split: 2, 3
• min_samples_leaf: Integer sampled from [1.5, 50.5]
• bootstrap: True, False
• min_impurity_decrease: 0.0, 0.01, 0.02, 0.05
we have that J∧ K &gt; J∧¬ K ] ≥ 1−,
6.2.2. XGBoost
• max_depth: Integer sampled from [
          <xref ref-type="bibr" rid="ref1 ref11">1,11</xref>
          ]
• n_estimators: Integer sampled from [100,5900] with step-size being 200.
• min_child_weight: Float sampled from [1.0, 100.0] using log domain
• subsample: Float sampled from [0.5, 1.0]
• learning_rate: Float sampled from [1e-5, 0.7] using log domain
• colsample_bylevel: Float sampled from [0.5, 1.0]
• gamma: Float sampled from [1e-8, 7.0] using log domain
• reg_lambda: Float sampled from [1.0, 4.0] using log domain
• reg_alpha: Float sampled from [1e-8, 100.0] using log domain
        </p>
      </sec>
      <sec id="sec-8-3">
        <title>6.3. Runtimes</title>
        <p>We report the average runtime of our method over the diferent splits of the ten-fold (or in the
case of Colon leave-one-out) cross-validation. The experiments were run on a standard laptop.</p>
      </sec>
      <sec id="sec-8-4">
        <title>6.4. Example formula for each dataset</title>
        <p>Here we report for each dataset the formula given by the first of the ten splits in our ten-fold
cross-validation.
6.4.1. BankMarketing

_
_ 
_ 
∧
¬       
¬       
6
.
4
.
3
.</p>
        <p>C
o
n
g
r
e
s
s
i
o
n
a
l
V
o
t
i
n</p>
        <p>g
6
.
4
.
4
.</p>
        <p>G
e
r
m
a
n</p>
        <p>C
r
e
d
i
t

ℎ
    
_
 
_
    
_
_ 
_ 
_   
_   
_   
_  
_  
_  
( 
_
0
∧
ℎ 
_
2
∧
¬
      
_
    
_
     )
     
_
    
_
    
∨
¬
ℎ
      
6
.
4
.
5
.</p>
        <p>H
e
a
r
t
D
i
s
e
a
s
e
6
.
4
.
6
.</p>
        <p>H
e
p
a
t
i
t
i
s
6
.
4
.
7
.</p>
        <p>S
t
u
d
e
n
t
D
r
o
p
o
u</p>
        <p>t</p>
        <p>C
o
v
e
r
t
y
p</p>
        <p>e
_   
_   
(       
_   
_
1 
_  
_    
∧
¬</p>
        <p>_  
_
2 
_
1 
_  
_  
_      
_      
_   
_
2 
_  
_      
6
.
4
.
1
0
.</p>
        <p>E
y
e</p>
        <p>M
o
v
e
m
e
n</p>
        <p>t
∧
¬     
.
4
.
1
1
.</p>
        <p>R
o
a
d</p>
        <p>S
a
f
e
t
y
_
.
4
.
1
2
.</p>
        <p>C
o
l
o</p>
        <p>n
F
o
r
t
h
e
h
i
g
h
d
i
m
e
n
s
i
o
n
a
l
d
a
t
a
s
e
t</p>
        <p>C
o
l
o
n
u
s
e
d
l
e
a
v
e
o
n
e
o
u
t
c
r
o
s
s
v
a
l
i
d
a
t
i
o
n
;
f
o
r
m
u
l
a
o
b
t
a
i
n
e
d
f
r
o
m
l
e
a
v
i
n
g
t
ifr
s
t
d
a
t
a
p
o
i
n
t
o
u
t</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>EU-GDPR</surname>
          </string-name>
          ,
          <article-title>Regulation (EU) 2016/679 of the European parliament and of the council</article-title>
          ,
          <year>2016</year>
          . URL: https://www.privacy-regulation.eu/en/index.htm.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>California-OAG</surname>
          </string-name>
          ,
          <article-title>CCPA regulations: Final regulation text</article-title>
          ,
          <year>2021</year>
          . URL: https://oag.ca.gov/ privacy/ccpa.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shwartz-Ziv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Armon</surname>
          </string-name>
          ,
          <article-title>Tabular data: Deep learning is not all you need</article-title>
          ,
          <source>Inf. Fusion</source>
          <volume>81</volume>
          (
          <year>2022</year>
          )
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Borisov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seßler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Haug</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pawelczyk</surname>
          </string-name>
          , G. Kasneci,
          <article-title>Deep neural networks and tabular data: A survey</article-title>
          ,
          <source>CoRR abs/2110</source>
          .
          <year>01889</year>
          (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2110.
          <year>01889</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pawelczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Broelemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          ,
          <article-title>Learning model-agnostic counterfactual explanations for tabular data</article-title>
          , in: Y.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>King</surname>
          </string-name>
          , T. Liu, M. van Steen (Eds.),
          <source>WWW '20: The Web Conference</source>
          <year>2020</year>
          , Taipei, Taiwan,
          <source>April 20-24</source>
          ,
          <year>2020</year>
          , ACM / IW3C2,
          <year>2020</year>
          , pp.
          <fpage>3126</fpage>
          -
          <lpage>3132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahakyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Aung</surname>
          </string-name>
          , T. Rahwan,
          <article-title>Explainable artificial intelligence for tabular data: A survey</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>135392</fpage>
          -
          <lpage>135422</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <article-title>Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>1</volume>
          (
          <year>2019</year>
          )
          <fpage>206</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Holte</surname>
          </string-name>
          ,
          <article-title>Very simple classification rules perform well on most commonly used datasets</article-title>
          ,
          <source>Mach. Learn</source>
          .
          <volume>11</volume>
          (
          <year>1993</year>
          )
          <fpage>63</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Grinsztajn</surname>
          </string-name>
          , E. Oyallon, G. Varoquaux,
          <article-title>Why do tree-based models still outperform deep learning on typical tabular data?</article-title>
          ,
          <source>in: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/ forum?id=Fp7__phQszn.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          , in: B.
          <string-name>
            <surname>Krishnapuram</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>C. C.</given-names>
          </string-name>
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
          </string-name>
          , R. Rastogi (Eds.),
          <source>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , San Francisco, CA, USA,
          <year>August</year>
          13-
          <issue>17</issue>
          ,
          <year>2016</year>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          , Random forests,
          <source>Machine Learning</source>
          <volume>45</volume>
          (
          <year>2001</year>
          )
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          . doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1010933404324</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jaakkola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Janhunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuusisto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Rankooh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vilander</surname>
          </string-name>
          ,
          <article-title>Short Boolean formulas as explanations in practice, in: S. A</article-title>
          .
          <string-name>
            <surname>Gaggl</surname>
            ,
            <given-names>M. V.</given-names>
          </string-name>
          <string-name>
            <surname>Martinez</surname>
          </string-name>
          , M. Ortiz (Eds.),
          <source>Logics in Artificial Intelligence - 18th European Conference, JELIA</source>
          <year>2023</year>
          , Dresden, Germany,
          <source>September 20-22</source>
          ,
          <year>2023</year>
          , Proceedings, volume
          <volume>14281</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2023</year>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Dekking</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kraaikamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Lopuhaä</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Meester</surname>
          </string-name>
          , A Modern Introduction to Probability and Statistics:
          <source>Understanding Why and How</source>
          , Springer Texts in Statistics, Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Marques-Silva</surname>
          </string-name>
          ,
          <article-title>Logic-based explainability in machine learning</article-title>
          , in: L. E. Bertossi, G. Xiao (Eds.),
          <source>Reasoning Web. Causality, Explanations and Declarative Knowledge - 18th International Summer School</source>
          <year>2022</year>
          , Berlin, Germany,
          <source>September 27-30</source>
          ,
          <year>2022</year>
          , Tutorial Lectures, volume
          <volume>13759</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2022</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Darwiche</surname>
          </string-name>
          ,
          <article-title>A symbolic approach to explaining Bayesian network classifiers</article-title>
          , in: J.
          <string-name>
            <surname>Lang</surname>
          </string-name>
          (Ed.), IJCAI,
          <year>2018</year>
          , pp.
          <fpage>5103</fpage>
          -
          <lpage>5111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ignatiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Narodytska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Marques-Silva</surname>
          </string-name>
          ,
          <article-title>Abduction-based explanations for machine learning models</article-title>
          ,
          <source>in: The Thirty-Third AAAI Conference on Artificial Intelligence</source>
          ,
          <source>AAAI</source>
          <year>2019</year>
          ,
          <source>The Thirty-First Innovative Applications of Artificial Intelligence Conference</source>
          ,
          <string-name>
            <surname>IAAI</surname>
          </string-name>
          <year>2019</year>
          ,
          <source>The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI</source>
          <year>2019</year>
          , Honolulu, Hawaii, USA, January 27 - February 1,
          <year>2019</year>
          , AAAI Press,
          <year>2019</year>
          , pp.
          <fpage>1511</fpage>
          -
          <lpage>1519</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Angelino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Larus-Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seltzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <article-title>Learning certifiably optimal rule lists for categorical data</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ustun</surname>
          </string-name>
          ,
          <article-title>Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice</article-title>
          ,
          <source>Interfaces</source>
          <volume>48</volume>
          (
          <year>2018</year>
          )
          <fpage>449</fpage>
          -
          <lpage>466</lpage>
          . doi:
          <volume>10</volume>
          .1287/inte.
          <year>2018</year>
          .
          <volume>0957</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Devroye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Györfi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Lugosi, A Probabilistic Theory of Pattern Recognition</article-title>
          , volume
          <volume>31</volume>
          <source>of Stochastic Modelling and Applied Probability</source>
          , Springer,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          , K. Cheng,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morstatter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Trevino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , H. Liu,
          <article-title>Feature selection: A data perspective</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 50</source>
          (
          <year>2018</year>
          )
          <fpage>94</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Akiba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yanase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koyama</surname>
          </string-name>
          ,
          <article-title>Optuna: A next-generation hyperparameter optimization framework</article-title>
          ,
          <source>in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          ,
          <year>2019</year>
          , p.
          <fpage>2623</fpage>
          -
          <lpage>2631</lpage>
          . doi:
          <volume>10</volume>
          .1145/3292500.3330701.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lindenbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kluger</surname>
          </string-name>
          ,
          <article-title>Locally sparse neural networks for tabular biomedical data</article-title>
          ,
          <source>in: International Conference on Machine Learning</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hoefding</surname>
          </string-name>
          ,
          <article-title>Probability inequalities for sums of bounded random variables</article-title>
          ,
          <source>Journal of the American Statistical Association</source>
          <volume>58</volume>
          (
          <year>1963</year>
          )
          <fpage>13</fpage>
          -
          <lpage>30</lpage>
          . doi:
          <volume>10</volume>
          .1080/01621459.
          <year>1963</year>
          .
          <volume>10500830</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>