<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GLEAMS: Bridging the Gap Between Local and Global Explanations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio Visani</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Stanzione</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damien Garreau</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Bologna University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Julius-Maximilians-Universität Würzburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leithà S.r.l. - Unipol Group</institution>
          ,
          <addr-line>via Stalingrado 53, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The explainability of machine learning algorithms is crucial, and numerous methods have emerged recently. Local, post-hoc methods assign an attribution score to each feature, indicating its importance for the prediction. However, these methods require recalculating explanations for each example. On the other side, while there exist global approaches they often produce explanations that are either overly simplistic and unreliable or excessively complex. To bridge this gap, we propose GLEAMS, a novel method that partitions the input space and learns an interpretable model within each sub-region, thereby providing both faithful local and global surrogates. We demonstrate GLEAMS' efectiveness on both synthetic and real-world data, highlighting its desirable properties and human-understandable insights.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable AI</kwd>
        <kwd>Global / local</kwd>
        <kwd>Partition-based machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In numerous applications, the data present themselves as large arrays—tabular data, where each
line corresponds to a data point and each column to a feature. In this setting, deep learning is
becoming as popular as it already is for more structured data types, and deep neural networks
achieve state-of-the-art in many scenarios [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A caveat of this approach is the dificulty to
obtain insights on a specific prediction: the intricate architectures and the sheer number of
parameters prevent the user of getting a clear picture of what is actually important to the model.
      </p>
      <p>
        Since getting such explanations is desirable in many applications, and could even become
mandated by law in certain regions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], many interpretability methods have been proposed.
Without getting into details (we refer to the recent surveys [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for this purpose), we
want to make two distinctions clear. First, it is challenging for interpretable models to have
the same accuracy as non-interpretable ones. As a consequence, many explainability methods
rely on a post hoc approach, i.e., producing explanations in a second step, exploiting already
trained black-box models. This is the road we take, since post-hoc methods allow us to deal with
any given model and not being dependent on a specific architecture. Second, the explanations
can either be local (related to a specific example), or global (for all the input space). A problem
with local explanations is that they have to be computed individually for each new example,
which can be time-consuming. For instance, LIME [5] generates 5, 000 new datapoints for each
example to explain. The black-box model then has to be queried on each of these points.
      </p>
      <p>In this paper, we propose GLEAMS (Global &amp; Local ExplainAbility of black-box Models
through Space partitioning), a post hoc interpretability method providing both global and local
explanations. Our goal is to build a global surrogate ^ of  that shares the global shape of the
black-box model  but, at the same time, has a simple local structure. Our main motivation
in doing so is to extract simultaneously from ^ : i) local explanations for any instance; ii)
overall importance of the features. Secondary objectives are the production of “what-if”-type
explanations, and visual summaries. In more details, the core idea of GLEAMS is to recursively
split the input space in rectangular cells on which the black-box model can be well-approximated
by linear models. We describe the method in Section 2. The local explanations are then simply
given by the feature coeficients of these local surrogate models, while global indicators can be
readily computed from those, without querying  model further. We describe how to obtain
the other indicators in Section 2.4, and show experimentally that GLEAMS compares favorably
with existing methods in Section 3. GLEAMS code as well as the experiments for this paper are
publicly available.1 Related work can be found in Appendix A.</p>
      <p>2
0
−2
−4
−6
−8
2
0
−2
−4
−6
−8
4
2
0
−2
−4
−6
−8
−10</p>
    </sec>
    <sec id="sec-2">
      <title>2. Description of the method</title>
      <p>We consider the following general setting:  is a mapping from the input space  to the output
space . This corresponds to the usual situation in machine learning, where  =  () is a good
prediction of the true label of . In this paper, we assume that  ⊆ R. We want to stress that
this can correspond both to the classification and regression setting. In the regression setting, for
any input  ∈  ,  () is simply a real-valued quantity of interest, whereas in the classification
case  () corresponds to the pseudo-probability to fall into a given class.</p>
      <p>The construction of the global surrogate on which GLEAMS relies to provide explanations is
a two-step process. The first step is to sample  points 1, . . . ,  , evenly spread in  . For
1https://github.com/giorgiovisani/GLEAMS
each of these points, we query  to obtain  =  (). Thus we efectively create a dataset
(, ), which we call measurement points.</p>
      <p>Subsequently, the idea is to recursively split the input space  and to approximate the model 
by a linear model on each cell of the partition until the linear model in each rectangular leaf
ifts the measurement points with a suficiently good accuracy. This process is summarized in
Figure 1. Once this global surrogate constructed, GLEAMS can provide both local and global
explanations for any new example to explain. Intuitively, given a partition with not too many
cells, it is convenient for the user to look at individual features and to visualize globally the
model as a piecewise-linear function. Concurrently, any new example to explain will fall into a
cell, and the user can get a local explanation solely by looking at the coeficients of the local
linear model. We now give more details on each of these steps.</p>
      <sec id="sec-2-1">
        <title>2.1. Measurement points</title>
        <p>Our main assumption, going into the description of the method, is the following:
Assumption 1. The input space  is a hyper-rectangle of R. That is, there exist reals numbers
 ,  ∈ R for any  ∈ [] such that  = ∏︀</p>
        <p>=1[ ,  ].</p>
        <p>Intuitively, this corresponds to a situation where each variable has a prescribed range (for
instance, age takes values in [0, 120]). In particular, we assume that further examples to explain
all fall within this range. If they do not, we attribute to these points the explanations
corresponding to the projection of these points on  . Note that we do not assume normalized data:
 and  can be arbitrary. In practice, the boundaries are either inferred from a training set, or
directly provided by the user.</p>
        <p>
          A consequence of Assumption 1 is that we can easily find a set of points covering  eficiently.
To achieve this, we use Sobol sequences [6]. In a nutshell, a Sobol sequence is a quasi-random
sequence of points 1, . . . ,  filling the unit cube [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] as  grows. We use Sobol sequences
because (a) we want to be able to provide explanations for the whole input space  , thus the
measurement points should cover  as  grows; and (b) the decision surface of the black-box
model can be quite irregular, thus we want low discrepancy.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Splitting criterion</title>
        <p>A very convenient way to partition the input space is to recursively split  , cutting a cell if
some numerical criterion is satisfied. Additionally, we consider splits that are parallel to the
axes. The main reason for doing so is the tree structure that emerges, allowing later on for fast
query time of new inputs (linear in the depth of the final tree). In this section, we describe the
criterion used by GLEAMS.</p>
        <p>A popular approach is to fit a constant model on each cell [ 7, CART]. We see two problems
with constant fits. First, this tends to produce large and thus hard to interpret trees (see, e.g.,
Chan and Loh [8]). Second, the local interpretability would be lost, since the local models would
not depend on the features in that case. Thus we prefer to fit linear models on each leaf, and
our numerical criterion should indicate, on any given leaf, if such simple model is a good fit
for  . GLEAMS relies on a variant of model-based recursive partitioning [9, MOB] based on
score computations, described in detail in Appendix B.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Global surrogate model</title>
        <p>To build the global surrogate model ^ , starting from the hyper-rectangle  , we iteratively
split along the best dimension. This process is stopped once one of two criteria is met: i)
leaves have 2 higher than 0.95 (chosen empirically) or ii) number of points in the leaf is less
than min threshold, set to min = min(20,  + 1). This recursive procedure is described in
Algorithm 2, in Appendix B. ^ is then achieved by sampling Sobol points on  , computing the
value of the black-box model at these points, and then using Algorithm 2, Appendix B.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. GLEAMS</title>
        <p>Let us now assume that we computed the global surrogate model ^ , that is, a piecewise-linear
model based on a partition of  . At query time, each new example to explain  ∈  is rooted
to the corresponding hyper-rectangle  in the tree structure, and thus associated to a linear

model ^ . From this local linear model, we can directly get diferent sorts of explanations.

^
Local explanations. Simply looking at the coeficient   tells us how important feature  is to
predict  ( ), locally. The intuition here is that  is well-approximated by  ↦→ (^ )⊤ on the cell
, thus local variations of the model along specific axes are well-described by the coeficients of

^ . These local explanations can be obtained in constant time, without re-sampling and querying
the model. Moreover, we have precise insights regarding the scale of these explanations, which
is given by the size of . We see this as an advantage of using GLEAMS: the portion of space
on which the black-box model is locally approximated is well-identified and accessible to the
user. This is not the case with other methods such as LIME.</p>
        <sec id="sec-2-4-1">
          <title>Global explanations.</title>
          <p>We define the global importance of feature  as the aggregation of
the local importance of feature  on all cells of the partition  of  . There are several ways
to aggregate those, GLEAMS outputs the weighted average of the absolute value of the local
importance. More precisely, let us define 
the global importance of feature  is given by</p>
          <p>^ the local model on hyper-rectangle  ∈ , then
 :=
1</p>
          <p>∑︁
( ) ∈
() ⃒⃒⃒ ^⃒⃒⃒ ,
(1)
where () denotes the -dimensional volume of . The motivation for weighting by the cells’
volume in Eq. (1) is to take into account the fact that even though feature  is very important
in a tiny region of the space, if it is not on all other cells then it is not globally important.
We take the absolute value to take into account the possible sign variations: if feature  is
positively important on many cells but negatively important on a similar number of cells, then
simply taking the average would tell us that the feature is not globally important. Such global
attributions provide a faithful variables ranking, based on the impact they show on  predictions.
grade
lirsveaab sqft_living15
yr_built
long
More precisely, for a given example  and a feature , we can look at the evolution of ^ when all
coordinates of the input are fixed to  , except the th: it sufices to travel in the cells intersecting
the line of equation  =   and to read the coeficients of the associated linear models. To be
precise, we compute
 () := ^ ( 1, . . . ,  − 1, ,  +1, . . . ,  ) .
(2)
This can be computed quickly, since, again, there is no sampling and further calls to  . This
allows us to present the user with a cursor which can be moved to set the feature value and
present explanations in real time (see Figure 2).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments on real data</title>
      <p>Our goal here is to show that GLEAMS provides explanations whose quality is comparable
to that of other attribution approaches, while providing these attributions in constant time
and additional insights. We benchmark the various methods on the Wine dataset [10], House
sell dataset2 and Parkinson telemonitoring dataset [11]. We refer to Appendix D for more
experiments.</p>
      <sec id="sec-3-1">
        <title>Models.</title>
        <p>We considered two diferent models, trained similarly across all datasets. The first is
an XGBoost model [12]. XGBoost iteratively aggregates base regressors greedily minimizing
a proxy loss. Following Friedman [13] prescriptions, we employ simple CART trees with
maximum depth 2 as base regressors, learning rate of 0.05, and early stopping rounds set to
100. The second is a multi-layer perceptron (MLP), composed by two hidden dense layers of 264
neurons each, trained by adaptive stochastic gradient descent [14, ADAM] and early stopping on
a hold-out validation set. Both these models achieve state-of-the-art accuracy in many settings
and are not inherently interpretable.
2available at https://www.openml.org/search?type=data&amp;status=active&amp;id=42092
Methods. We compared GLEAMS to three methods, namely LIME for tabular data, SHAP,
and PDP. We used default parameters, except for the number of feature combinations tested
in SHAP for the recall metric, which were shrinked to 500 to limit computing time. GLEAMS
results have been obtained using  = 15 (number of Sobol points), which guaranteed a running
time in the order of 5 minutes for each of the three datasets at hand.</p>
        <p>Evaluation. We compare GLEAMS to existing methodology using monotonicity and recall.
For each interpretability method to evaluate, for each point in the test set, we compute the
vectors of attributions  ∈ R and expected loss  ∈ R restricted to the corresponding
feature. Monotonicity is then defined as the Spearman rank correlation between | | and  [15].
Depending on the region where the  values are computed we obtain local monotonicity,
i.e., inside each specific GLEAMS leaf, or global monotonicity when considering the whole
input space. Let us now consider a model trained to use only a subset  of the features. We can
actually be certain that some feature attributions should be 0. For any given example  , we can
define   the set of the top | | features, ranked by |  |. Recall of Important Features [5] is
the fraction of  features retrieved in   , averaged among the test units. More on this in the
Appendix C.</p>
        <p>Results. We present our results in Table 1, reporting the average metrics taken on the test set.
All evaluation metrics are described in Appendix C.</p>
        <p>PDP generally achieves better monotonicity metrics, since it is designed precisely to obtain
feature attributions related to the feature impact on the predictions, but it performs worse on
the recall task. We notice that GLEAMS shows better results on local and global monotonicity
compared to LIME and SHAP on the wine dataset, while GLEAMS performances start to degrade
with an increased number of features. In particular we notice the poor performance on local
monotonicity of the MLP model for the Parkinson dataset. The degradation is due to the
higher dimensionality of the two datasets, which makes models surface more complex (MLP in
particular). Increasing  would likely improve the results, enabling smaller partitions and more
accurate local models. Unfortunately, we did not have time to run experiments with higher  .
Regarding the recall metric, SHAP stands as the best in class, while GLEAMS is usually slightly
worse but still comparable with LIME, and systematically better than PDP.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we presented GLEAMS, a new interpretability method that approximates a
blackbox model by recursively partitioning the input space and fitting linear models on the elements
of the partition. Both local and global explanations can then be obtained in constant time, for
any new example, in sharp contrast with existing post hoc methods such as LIME and SHAP.</p>
      <p>The main limitation of the method is the assumption that all features are continuous. As
future work, we aim to extend GLEAMS to mixed features (both categorical and continuous).
Another limitation of the method is the initial number of model queries needed. One possible
way to reduce it would be to evaluate the local regularity of  and to sample less when the
model is smooth.
[5] M. T. Ribeiro, S. Singh, C. Guestrin, Why Should I Trust You?: Explaining the Predictions
of Any Classifier, in: Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, ACM, 2016, pp. 1135–1144.
[6] I. M. Sobol, Points which uniformly fill a multidimensional cube, Mathematics, cybernetics
series (1985) 32.
[7] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Classification and regression trees,</p>
      <p>Wadsworth &amp; Brooks / Cole Advanced Books &amp; Software, 1984.
[8] K.-Y. Chan, W.-Y. Loh, LOTUS: An algorithm for building accurate and comprehensible
logistic regression trees, Journal of Computational and Graphical Statistics 13 (2004)
826–852.
[9] A. Zeileis, T. Hothorn, K. Hornik, Model-based recursive partitioning, Journal of
Computational and Graphical Statistics 17 (2008) 492–514.
[10] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis, Modeling wine preferences by data
mining from physicochemical properties, Decision support systems 47 (2009) 547–553.
[11] A. Tsanas, M. Little, P. McSharry, L. Ramig, Accurate telemonitoring of parkinson’s disease
progression by non-invasive speech tests, Nature Precedings (2009) 1–1.
[12] T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System, in: Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
2016, pp. 785–794.
[13] J. H. Friedman, Greedy function approximation: a gradient boosting machine, The Annals
of Statistics (2001) 1189–1232.
[14] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980 (2014).
[15] A.-P. Nguyen, M. R. Martínez, On quantitative aspects of model interpretability, arXiv
preprint arXiv:2007.07584 (2020).
[16] M. Craven, J. W. Shavlik, Extracting tree-structured representations of trained networks,</p>
      <p>Advances in Neural Information Processing Systems (1996) 24–30.
[17] J. R. Quinlan, C4.5: Programs for Machine Learning, Elsevier, 1993.
[18] R. D. Gibbons, G. Hooker, M. D. Finkelman, D. J. Weiss, P. A. Pilkonis, E. Frank, T. Moore,
D. J. Kupfer, The CAD-MDD: A computerized adaptive diagnostic screening tool for
depression, The Journal of clinical psychiatry 74 (2013) 669.
[19] Y. Zhou, G. Hooker, Interpreting Models via Single Tree Approximation (2016).
[20] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, L. Wasserman, Distribution-free predictive
inference for regression, Journal of the American Statistical Association 113 (2018) 1094–
1111.
[21] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances
in neural information processing systems 30 (2017).
[22] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors: High-precision model-agnostic explanations,
in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[23] M. Setzu, R. Guidotti, A. Monreale, F. Turini, D. Pedreschi, F. Giannotti, Glocalx-from local
to global explanations of black box AI models, Artificial Intelligence 294 (2021) 103457.
[24] F. Harder, M. Bauer, M. Park, Interpretable and diferentially private predictions, in:
Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 4083–
4090.
[25] E. C. Merkle, A. Zeileis, Tests of Measurement Invariance without Subgroups: A
Generalization of Classical Methods, Psychometrika 78 (2013) 59–82.
[26] J. Zhou, A. H. Gandomi, F. Chen, A. Holzinger, Evaluating the quality of machine learning
explanations: A survey on methods and metrics, Electronics 10 (2021) 593.
[27] F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning,
arXiv preprint arXiv:1702.08608 (2017).
[28] C. Spearman, The proof and measurement of association between two things, The</p>
      <p>American journal of psychology 100 (1904) 441–471.
[29] G. F. Montufar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep
neural networks, Advances in neural information processing systems 27 (2014).</p>
    </sec>
    <sec id="sec-5">
      <title>A. Related work</title>
      <p>The idea of approximating globally a complex model such as a deep neural network by a simpler
surrogate model relying on a tree-based partition of the space can be dated back at Craven and
Shavlik [16]. The proposed method, TREPAN, approximates the outputs of the network on
the training set by a decision tree, choosing the splits using the gain ratio criterion [17]. More
recently, inspired by Gibbons et al. [18], Zhou and Hooker [19] proposed to use another split
criterion, based upon an asymptotic expansion of the Gini criterion. Both these approaches
are limited to the classification setting, and also require access to the training data (which is
implicitly assumed to have a good covering of the input space).</p>
      <p>From the interpretability side, a standard way to explain  , which is closely related to our
approach, is to compute attributions scores for each feature. Lei et al. [20] exclude a specific
feature from  and define its impact as the deterioration of model predictions. On the contrary,
Partial Dependence Plots (PDP) developed by Friedman [13] measure the sensitivity of  to
changes in the specific feature, isolating its efect from the ones of the other features. Ribeiro
et al. [5] propose LIME, which proposes as attribution for a specific example the coeficients of
a linear model approximating  locally. Similarly, Lundberg and Lee [21] introduce the idea
of decomposing  predictions into single variables attributions, using SHAP. The technique
samples combinations of features  and average changes in single instance prediction between
the restricted  ( ∪ ) against  (), obtaining local attributions for the feature . Global SHAP
attributions arise by averaging local attributions over the entire dataset.</p>
      <p>Attribution methods usually have to be recomputed for each new example, a caveat that we
propose to overcome with GLEAMS, computing a global surrogate and relying on it to provide
explanations. We are not the first to propose to provide explanations on a global scale: Ribeiro
et al. [22] propose Anchors, a method extracting local subsets of the features (anchors) that
are suficient to recover the same prediction, while having a good global coverage. A caveat of
the method is that local explanations are not quantitative since all members of a given anchor
are equivalent. Additionally, the anchors can have many elements if the example at hand is
near the decision boundary. Let us also mention Setzu et al. [23], which proposes to aggregate
local explanations: starting from local decision rules, GlocalX combines them in a hierarchical
manner to form a global explanation of the model. In the ad hoc setting, Harder et al. [24]
proposes to train an interpretable and diferentially private model by using several local linear
maps per class: the pseudo-probability of belonging to a given class is the softmax of a mixture
of linear models. This approach is limited to the classification setting, as with GlocalX and
Anchors.</p>
    </sec>
    <sec id="sec-6">
      <title>B. Splitting criterion</title>
      <p>Intuitively, any well-behaved function can be locally approximated by a linear component,
yielding a statistical model which we now describe. On any given leaf, for any given feature ,
we can reorder the measurement points such that the points contained in the leaf are 1, . . . , ,
with 1, ≤ · · · ≤ , . To these points correspond model values 1, . . . , . In accordance
with the intuition given above, we write  =  ⊤˜ +   for all  ∈ [], where  ∈ R+1 are the
1
,
.
a single observation is given by
where  2 &gt; 0 is the variance of   for all  ∈ []. Taking the log in Eq. (3), the log-likelihood of
coeficients of our local linear model,</p>
      <p>˜ ∈ R+1 is the concatenation of  with a leading 1 to
take the intercept into account, and   is an error term. Let us now follow Merkle and Zeileis
[25] and make an i.i.d. Gaussian assumption on the  s, giving rise to the likelihood
(3)
(4)
(5)
The individual scores are obtained by taking the partial derivatives of the log-likelihood with
respect to the parameters, that is, for any  ∈ [],
( ; ) =
︂( ℓ( ; ) , ℓ( ; ) , . . . ,
 0
 1
ℓ( ; ) )︂ ⊤</p>
      <p>.</p>
      <p>measurement points
||
500
0
0
25 50 75 100 125 150 175 200
t
piecewise-linear model (dark dots) visualized along one axis. Bottom panel: evolution of ‖()‖1 as a
function of  (solid blue line). The process presents a clear maximum, which gives us a candidate split
for this axis (vertical red line).</p>
      <p>A straightforward computation yields
( ; ) =
︁(
 · ˜ −</p>
      <p>︁)
˜˜⊤  − 2 .</p>
      <p>
        (6)
The maximum likelihood estimate ^ is obtained by solving ∑︀ ( ; ) = 0: in this case this
coincides with the ordinary least-squares estimate, in a sense the best linear fit on the entire leaf.
However, what we want is to split this leaf if the deviations from the linear model computed on
the whole leaf are too strong. Therefore, we compute the cumulative score process defined by
1 ∑⌊︁⌋
∀ ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], ()(; ^) = √ =1 (^; ) ,
(7)
and take as a criterion the maximum of the 1 norm of ()(), across all features. The
computation of the best split is described in Algorithm 1 and we provide an example in Figure 3.
      </p>
      <p>Note that Eq. (7) corresponds to Eq. (13) in Merkle and Zeileis [25], with the notable diference
that Eq. (13) is normalized (by the square root of the estimated covariance matrix of the scores).
We observe empirically that normalizing yields worse results when detecting the splits.
Algorithm 1 GetBestSplit: Get the best split for a given leaf.</p>
      <p>
        Require: 1, . . . , : Sobol points in current leaf; 1, . . . , : corresponding  values
1: for  = 1 to  do
2: re-order 1, . . . , . . . ,  according to feature 
3: re-order 1, . . . ,  accordingly
4: compute ^ = OLS(; )
5: compute predictions ^
6: set  2 = 1 ∑︀=1( − ^)2
7: compute scores (^; ) according to Eq. (6)
8: compute () according to Eq. (7)
9: store  = max∈[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] ⃦⃦ ()()⃦⃦ 1
10:
      </p>
      <p>
        store  = arg max∈[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] ⃦⃦ ()()⃦⃦ 1
11: return ⋆ = arg max∈[]  , feature to split
12: return ⋆ , value of the split
13: return ^, local linear model
Algorithm 3 GlobalSurrogate: Compute the global surrogate model used by GLEAMS.
Require:  : black-box model;  or its boundaries;  : number of Sobol points;  : 2 threshold;
min: min. number of points per leaf
1: generate  = {1, . . . ,  } Sobol points in 
2: compute  =  ()
3:  ← { 1, . . . ,  }
4: return  ← RecTree( , ,  )
      </p>
    </sec>
    <sec id="sec-7">
      <title>C. Evaluation</title>
      <p>Evaluating interpretability methods is quite challenging: for a given machine learning model,
there is rarely a ground-truth available, telling us which features were used for a given prediction.
Zhou et al. [26] outline two main evaluation methodologies: human-centered evaluations, and
functionality-grounded (quantitative, note that the terminology was introduced by Doshi-Velez
and Kim [27]). We focus on the latter, since human-centered evaluations are hard to obtain
and can be unreliable. Among those, we choose to compare GLEAMS to existing methodology
using monotonicity and recall. Below we briefly describe these two metrics.</p>
      <p>Monotonicity is defined as the correlation between the absolute value of the attributions
at a given point and the expected loss of the model restricted to the corresponding feature.
Intuitively, this takes high values if the model is very imprecise when it does not know the
feature at hand. This is Metric 2.3 in Nguyen and Martínez [15]. Following their notation, for
each interpretability method to evaluate, for each point in the test set, we compute on the one
hand a vector of attributions  ∈ R and on the other hand a vector of expected loss  ∈ R.
Monotonicity is then defined as the Spearman rank correlation [ 28] between | | and . Formally,
for all  ∈ [],  is defined as
 =
∫︁ 

ℓ( ( ),  ())()d ,
(8)
where  is the restriction of  to coordinate  (keeping all other coordinates equal to those of
 ),  is a probability distribution on [ ,  ] (recall that  and  denote the boundaries of 
along dimension , see Assumption 1), and ℓ is the squared loss.</p>
      <p>Let us define  () the uniform distribution on a compact set . In our experiments, we
consider two natural choices of  in Eq. (8). First,  ∼  ([ℓ ,  ]), where ℓ (resp.  ) are the left
(resp. right) boundaries of  along feature , where  is the cell containing  . This corresponds
to local monotonicity: how well the attributions reflect the variations in prediction locally,
that is, from the point of view of GLEAMS, on . Second,  ∼  ([ ,  ]), which corresponds
to global monotonicity: how well the attributions reflect the variations in prediction on the
whole input space along feature .</p>
      <p>Let us conclude this section by defining formally the Recall of Important Features. For
certain models, we can actually be certain that some feature attributions should be 0. This is
the case, for instance, if we force our model  to use only a subset of the features. As a measure
of the quality of explanations, one can then compare the top features selected by an attribution
method to the set of true features, which we know. Let us call  the set of true features. For any
given example  , we can define   the set of the top | | features, ranked by |  |. Originally
proposed by Ribeiro et al. [5], the recall of important features is then defined as
Intuitively,  ( ) is large (close to one) if the considered attribution method gives large attributions
to the true features. In our experiments, we report the average recall over all points of the test
set unless otherwise mentioned.</p>
      <p>D. Toy data experiments
 ( ) := ⃒⃒  ∩   ⃒⃒ .</p>
      <p>| |
(9)
15
10
5
0
−5
−10
−15
−20</p>
      <p>As a first sanity check, we ran the partition scheme of GLEAMS on simple models and
checked whether the surrogate model ^ was indeed close to the true model  . In order to have
a ground-truth for the partition, we chose partition-based models, i.e., there exist a partition 
of  such that the parameters of  are constant on each rectangular cell  ∈ . We considered
a simple setting where the model is piecewise-linear with rectangular cells, a situation in which
GLEAMS could, in theory, recover perfectly both the underlying partition and the coeficients of
the local models. We present here two examples: (i) a discontinuous piecewise-linear model with
arbitrary coeficients, and (ii) a piecewise-linear model with continuity constraints. Example
(i) is a generalization of the surface produced by a Random Forest regressor, while example
(ii) is reminiscent of a fully-connected, feedforward, ReLU activated neural network. Note
however that for the latter, the cells of the partition are more complicated than simple rectangle,
see Montufar et al. [29] (in particular Figure 2). In each case, GLEAMS recovers perfectly the
underlying model, as demonstrated in Figure 2.</p>
      <p>It is interesting to notice that, despite having a stopping criterion on the coeficient of
determination, not all leaves satisfy this criterion since the other stopping criterion is a minimal
number of points per leaf. Thus it is possible to end up in situations where the 2 on a given leaf
is sub par. Nevertheless, these leaves are by construction quite small. For instance, in example
(i), the average 2 is equal to 1, and in example (ii) it is equal to .98.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Borisov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seßler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Haug</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pawelczyk</surname>
          </string-name>
          , G. Kasneci,
          <article-title>Deep neural networks and tabular data: A survey</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wachter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mittelstadt</surname>
          </string-name>
          , L. Floridi,
          <article-title>Why a Right to Explanation of Automated DecisionMaking Does Not Exist in the General Data Protection Regulation</article-title>
          ,
          <source>International Data Privacy Law</source>
          <volume>7</volume>
          (
          <year>2017</year>
          )
          <fpage>76</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Adadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Berrada</surname>
          </string-name>
          ,
          <article-title>Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)</article-title>
          ,
          <source>IEEE Access 6</source>
          (
          <year>2018</year>
          )
          <fpage>52138</fpage>
          -
          <lpage>52160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Linardatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Papastefanopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kotsiantis</surname>
          </string-name>
          ,
          <string-name>
            <surname>Explainable</surname>
            <given-names>AI</given-names>
          </string-name>
          :
          <article-title>A review of machine learning interpretability methods</article-title>
          ,
          <source>Entropy</source>
          <volume>23</volume>
          (
          <year>2020</year>
          )
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>