A comparative study of additive local explanation methods
                   based on feature influences
           Emmanuel Doumard                                            Julien Aligon                             Elodie Escriva
        emmanuel.doumard@irit.fr                                  julien.aligon@irit.fr                  elodie.escriva@kaduceo.com
        Université de Toulouse-Paul                         Université de Toulouse-Capitole,                        Kaduceo
      Sabatier, IRIT, (CNRS/UMR 5505)                           IRIT, (CNRS/UMR 5505)                   Université de Toulouse-Capitole,
              Toulouse, France                                      Toulouse, France                        IRIT, (CNRS/UMR 5505)
                                                                                                               Toulouse, France

          Jean-Baptiste Excoffier                                    Paul Monsarrat                         Chantal Soulé-Dupuy
      jeanbaptiste.excoffier@kaduceo.                      paul.monsarrat@univ-tlse3.fr                   chantal.soule-dupuy@irit.fr
                    com                                     RESTORE Research Center &                   Université de Toulouse-Capitole,
                  Kaduceo                                 Artificial and Natural Intelligence               IRIT, (CNRS/UMR 5505)
             Toulouse, France                                Toulouse Institute ANITI &                        Toulouse, France
                                                             Oral Medicine Department
                                                                   Toulouse, France
ABSTRACT                                                                         each year. The additive methods include LIME [19], SHAP [15]
Local additive explanation methods are increasingly used to un-                  and more recently the coalitional-based methods [8]. The user-
derstand the predictions of complex Machine Learning (ML) mod-                   friendly representation of explanations, based on feature influ-
els. The most used additive methods, SHAP and LIME, suffer from                  ences, allows domain and non-domain experts to better under-
limitations that are rarely measured in the literature. This pa-                 stand models predictions [18]. Existing explanation methods are
per aims to measure these limitations on a wide range (304) of                   model-specific or model-agnostic depending on whether they
OpenML datasets, and also evaluate emergent coalitional-based                    can be applied to some or all types of machine learning models,
methods to tackle the weaknesses of other methods. We illustrate                 with local or global explanations to understand either an individ-
and validate results on a specific medical dataset, SA-Heart. Our                ual prediction or the behaviour of the model as a whole. While
findings reveal that LIME and SHAP’s approximations are partic-                  these methods have been evaluated in a number of contexts, no
ularly efficient in high dimension and generate intelligible global              in-depth evaluation is available for a rational choice of one tech-
explanations, but they suffer from a lack of precision regarding                 nique over another. The objective of this work is to study the
local explanations. Coalitional-based methods are computation-                   advantages and disadvantages of using each additive method to
ally expensive in high dimension, but offer higher quality local                 provide pertinent insights. In particular, we study the effects of
explanations. Finally, we present a roadmap summarizing our                      the models used and the type of dataset considered on the feature
work by pointing out the most appropriate method depending                       influences (both at the instance and feature level).
on dataset dimensionality and user’s objectives.                                     The paper is organised as follows. Section 2 reviews the ex-
                                                                                 isting work, classifying and comparing explanation methods for
KEYWORDS                                                                         tabular data. Section 3 describes the four additive methods to be
                                                                                 compared in this paper. The experiments are presented in Section
Explainable Artificial Intelligence (XAI), Prediction explanation,
                                                                                 4 where we study the explanation characteristics, the impact of
Machine learning,
                                                                                 the predictive model on explanation profiles, highlighting the
                                                                                 behavior of explanation methods based on a practical medical
1    INTRODUCTION                                                                use case. Conclusive lessons-learned are then detailed in Section
Machine Learning (ML) represents a real revolution in various                    5.
domains, such as finance, insurance, healthcare, biomedical. How-
ever, machine learning models give a prediction without necessar-                2   RELATED WORKS
ily being accompanied by an understandable explanation. These                    Few works [3, 27] exist in the literature to classify and catego-
models, often referred as "black-boxes", raise the challenging                   rize machine learning explanation methods. In [18], a complete
question of how humans can understand the determinants of                        description of explanation approaches from literature is given.
the prediction. Explainability is also more than a technological                 In particular, the authors explain their advantages and disadvan-
problem, it involves among other ethical, societal and legal issues.             tages, giving an overview of their limits. For example, even if
In healthcare, this may involve the professional being able to ex-               the LIME and SHAP approaches are model-agnostic and human-
plain to the patient how the algorithm works and the criteria for                friendly, they suffer from no consideration of feature correla-
the decision process. The results of ML models must therefore be                 tion and possible instability of the explanations. Another paper
expressed in a way that can be understood by domain-experts,                     tackling the limits of the additive methods (LIME and SHAP) is
like medical practitioners [1, 5]. Since SHAP [15], machine learn-               presented in [23]. The paper shows that biased classifiers can fool
ing experts show a very clear interest for the additive methods                  explanation methods, whose problem is even more accentuated
as a huge number of works using these methods are published                      on LIME.
                                                                                    Comparative studies between local explanation methods are
© 2022 Copyright for this paper by its author(s). Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)                        also available, such as [6, 8, 16]. In [8], a new additive method
                                                                                 was proposed based on Shapley values and taking into account
feature correlation. This method was compared with LIME and               The full implementation of LIME is available on GitHub :
SHAP through computation time and accuracy score. For this last        https://github.com/marcotcr/lime.
measure, the authors consider as baseline the complete method,
computing all Shapley values with each possible coalition of           3.2    Shapley Values (complete method)
features. The authors show that their proposal is competitive
with the literature, both in accuracy and in computation time. In      To explain individual predictions, a method based on Shapley
[16], LIME and SHAP are used in a context of feature selection         values is described in [24, 25, 28]. Shapley values ’fairly’ weight
and compared to a Mean Decrease Accuracy (MDA) approach.               groups of features according to their relative importance to a
A stability measure indicates that a feature selection obtained        defined gain [21]. In machine learning, the gain can be linked to
with LIME or SHAP seems more stable than via MDA. In [6],              the prediction made by the model. Influences of each feature are
the authors compare 6 local model-agnostic techniques using            computed based on its impact on the prediction for each coalition
custom quantitative measures, such as similarity, bias detection,      of features. The explanation method based on Shapley values is
execution time, and trust. From these experiments, no single           called the complete method. All coalitions are evaluated with and
method stands out for all metrics and all data sets. Each ones         without each feature and the change on the prediction is used to
have strengths and weaknesses based on the metrics used and            compute the influence of the feature. The complete method can be
choices between methods can only be made based on the users’           used as a baseline to compare other methods as it is an exhaustive
goal and dataset.                                                      method close to the original intuition behind feature influence
   The latest results are therefore indications for the absence        [8]. This method is however very expensive to compute, with an
of a single method that would provide the best explanations in         exponential complexity in relation to the number of features in
all situations. However, none of these previous works clearly          the dataset.
indicates in which situation a method should be preferred to              Several more recent methods, including SHAP [15] and coali-
another one. Consequently, our aim is to give the key factors to       tional methods [8], are based on Shapley values with the aim to
make an informed decision among the existing additive methods.         solve limitations of the complete method.
As indicated in [17], evaluating explanations methods is very
subjective and no consensus yet exists to propose relevant metrics.    3.3    SHAP
As all additive methods give an influence score for each feature,      SHAP (SHapley Additive exPlanations) [15] method worked on
we propose to compare them based on these influences. From             improving computation time and explanation precision, espe-
there, we want to analyse and compare the effects of different         cially for tree-based models [14]. It combines LIME [19] and
predictive models and dataset on these influence scores.               Shapley values [28], along with other methods from the literature
                                                                       [2, 4, 13, 22], in a unique framework to produce local explanations.
3     ADDITIVE METHODS TO COMPARE                                      The main idea is to create perturbations to simulate the absence
                                                                       of a feature and to use a linear local model to approximate the
Additive methods are described as explanation models that pro-         change in the prediction, as in LIME. This avoids retraining the
duce a vector of weights to represent the influence of each feature,   complex model without the feature of interest. Local explana-
the sum of which approximates the output of the original model.        tions can be aggregated to explain the global behaviour of the
Explanations can be computed for a single instance, so for ev-         model. Global and local explanations are then consistent with
ery instance of a data set, hence the term "local". In this section,   each other as they have the same foundation. SHAP includes an
we explore several existing methods that fit this definition. We       agnostic explainer, KernelSHAP, as well as model-specific explain-
focus on post-hoc methods that deliver their explanations for          ers, such as TreeSHAP, LinearSHAP or DeepSHAP for tree-based
a given model already trained. Methods used in this study are          models, linear models and deep models respectively. While com-
all agnostic, meaning that they can be applied to any kind of          monly used in Machine Learning context [12], SHAP still suffers
machine learning model, except for the TreeSHAP method [14]            from lack of precision [10, 23] mostly due to their restrictive
that is designed specifically for tree-based models.                   hypothesis (local linearity and feature independance) as with
                                                                       LIME. Moreover, computation time is still high for other models
3.1    LIME                                                            than tree-based models [26].
                                                                          The full implementation of SHAP is available in GitHub :
LIME method is a well-known local explanation method described
                                                                       https://github.com/slundberg/shap.
in [19]. LIME uses explainable models to locally approximate
a complex black-box model and, for each instance, explain the
influence of each feature on the prediction. For each instance to be   3.4    Coalitional-based method
explained, LIME generates new data in a close neighborhood and         Another agnostic explainer based on Shapley values, the coali-
computes the predictions of these new instances with the black-        tional method, was introduced to take into account the interde-
box model. A regressor linear model, an interpretable model, is        pendence of features and solve some restrictions of SHAP. It uses
trained with the new dataset. This local model is then used to         grouping methods such as Principal Component Analysis (PCA),
explain the prediction of the instance of interest in the form of a    Spearman correlation factor (Spearman) and Variance Inflation
weight vector associating each feature with its influence on the       Factor (VIF) to pre-compute groups of features for explanations
prediction. A well-known limitation of LIME is the restrictive         [8]. These groups are then used as coalitions to compute Shapley
hypothesis on which LIME is based, such as local linearity and         values as in the complete method. The influence of each feature is
feature independence [9, 23]. Defining the locality around an          defined as its impact on the prediction only on the pre-computed
instance of interest can also be a challenge, as the fit of the        groups of features, approximating the complete method and reduc-
surrogate model has a significant impact on the accuracy of the        ing the computational time. Grouping methods are defined with
explanations [11] as well as their stability [7].                      a parameter that changes the number and size of feature groups
                   Number of datasets    Number of instances              Then, to be able to compare the explanations, we need to
    Number                               Min Max       Mean           define metrics of interest. In Section 4.2, we present three metrics
    of features                                                       that we will use for this study.
                                                                          Section 4.3 aims to compare the four additive methods intro-
    1                               5    130    9100      3079
                                                                      duced in Section 3. In particular, we use two distinct coalitional-
    2                              21     52    5456       901
                                                                      based methods: the Complete method, which serves as reference
    3                              43     60    9989      1729
                                                                      for an influence deviation measurement (second metric), and the
    4                              23     96    8641      1016
                                                                      Spearman method with a threshold of 25% of all groups of fea-
    5                              35     62    7129       941
                                                                      tures. Regarding SHAP, we use the model-agnostic KernelSHAP
    6                              27     51    9517       949
                                                                      on all datasets. As this method is very slow to execute if we use
    7                              33     54    4052       499
                                                                      the whole dataset as background samples for permutations, we
    8                              32     52    8192      1473
                                                                      choose to follow SHAP’s recommendation1 by doing a K-Means
    9                              23     52    1473       484
                                                                      clustering on the input dataset, and then taking the centroids as
    10                             37     57    5473       712
                                                                      background samples. We choose 𝐾 = 10 clusters for each dataset,
    11                              8     66    4898       942
                                                                      thus naming the method KernelSHAP10. In addition, for the two
    12                             12    123    8192      1175
                                                                      tree-based predictive models XGBoost and Random Forests, we
    13                              5    178     506       293
                                                                      use the model-specific explainer TreeSHAP by two implementa-
    Total                         304     51    9989      1035
                                                                      tions. The first one determines SHAP values with background
                  Table 1: Datasets description                       samples, similarly to KernelSHAP but optimised for tree-based
                                                                      methods. We use the whole dataset as background samples for
                                                                      this method. The second one approximates SHAP values by con-
                                                                      sidering the trees structures, and does not need background sam-
                                                                      ples in input, so we name it TreeSHAPapprox. Last, we consider
in order to prioritise a low computational time or an higher ac-      LIME, which requires a number of perturbed samples to be cre-
curacy. As for SHAP, local explanations can be aggregated into        ated for explaining each instance. We choose to set this number
global explanations with a common foundation to study global          to 100 samples for all datasets.
and local behavior of the model.                                          With similar methodology, Section 4.4 identifies the impact of
   The full implementation of Coalitional-based method is avail-      the predictive model on specific explanation methods.
able on GitHub :                                                          Lastly, we present in Section 4.5 a practical example of the
https://github.com/kaduceo/coalitional_explanation_methods.           different explanations methods applied to a specific dataset, SA-
                                                                      Heart. This dataset is chosen for its medical context (coronary
4     EXPERIMENTS                                                     heart diseases), a sufficient number of instances (462) and features
                                                                      (10) to train a coherent model and compute the explanations
In this section, we propose experiments comparing the explana-        in acceptable computational times. The underlying idea is to
tion methods presented in the previous section. The goal is to        illustrate the highlighted behaviors by taking a concrete example
identify the general behavior of each method and how this be-         as it could be used by an end user (e.g. a physician).
havior eventually differs according to a predictive model (learned
from data) and the dimensionality of the data (number of fea-         4.2     Metrics of interest
tures).
                                                                      Because of the subjective nature of explanations, there is no con-
                                                                      sensus on objective mathematical ways to evaluate the explana-
4.1    Experimental protocol                                          tions. Therefore, to evaluate explanation methods performances
All experiments are run on an Intel Xeon Gold 6230 processor          and compare them over a high number of datasets, we define
with 125 GB of RAM using Python 3.9.7. All runs are performed         three different metrics that only need the influence values given
on a single core of CPU for optimization and reproducibility. To      by the method. The first one is the computational time per in-
compare explanation methods, we apply them to a wide range            stance, which is the amount of time taken by a given method to
of 304 datasets available on OpenML (www.openml.org). Due             compute the local influences of a whole dataset, divided by the
to computational constraints of explanation methods, we only          number of instances in the dataset. The second one is a quan-
considered datasets with at most 13 features, and at most 10 000      tification of the average deviation of the influence given by a
instances. We also only considered classification tasks to use com-   method from the Complete method (see Section 3). This error rate
parable predictive models and metrics. We describe the amount         is defined as:
and size of datasets per number of features in Table 1.                                                𝑛       𝑝
                                                                                                    1Õ1Õ
   As an explanation method needs a model to be applied to, we                      𝑒𝑟𝑟 (𝐼, 𝑋 ) =           𝐼𝑘 (𝑋𝑖 ) − 𝐼𝑘𝐶 (𝑋𝑖 )
choose four widely used types of ML models for classification:                                      𝑛 𝑖=1 𝑝
                                                                                                              𝑘=1
Logistic Regression (LR), Support Vector Machines (SVM), Ran-         where, for a given a dataset, 𝑛 the number of instances, 𝑝 the
dom Forests (RF) and Gradient Boosted Machines (GBM). For the         number of features, 𝑋𝑖 the features vector for the instance 𝑖, 𝐼𝑘 (𝑥)
first three, we use the implementation of Python library scikit-      the influence of a feature 𝑘 for a given instance 𝑥, a given expla-
learn version 1.0.1. For GBM, we use the Python library XGBoost       nation method and a given machine learning model, and 𝐼𝑘𝐶 (𝑥)
version 1.5. We use default values for models hyperparameters.
                                                                      the influence given by the Complete method for the same model,
For explanations methods, we use Python libraries shap 0.40 and
lime 0.2.0.1.                                                         1 KernelSHAP documentation includes recommendation to use K-Means algorithm to
                                                                      speed up computation time https://shap-lrjball.readthedocs.io/en/latest/generated/
                                                                      shap.KernelExplainer.html
same feature and same instance. The third metric evaluates the
distribution of feature importance assigned by a given explana-
tion. The raw value being not necessarily comparable between
explanation methods, the cumulative importance proportion of
features given by a method was considered. This metric shows
whether an explanation method favours the attribution of great
importance to a few features or, on the contrary, a more homo-
geneous distribution among a larger number of features. The
importance of a feature is defined here as the mean absolute
value of influence assigned to instances for such feature. For
example, in a dataset with 2 features, if a method gives 80% of
the importance to the most important feature (and so 20% to
the second), it would have a cumulative importance proportion
vector of [0, 0.8, 1]. We can then compute the (normalised) Area
Under Curve (AUC) of such a vector 𝐶 with :

                                   𝑝−1
                                  1 Õ 𝐶𝑖 + 𝐶𝑖+1
                 𝐴𝑈 𝐶 (𝐼, 𝑋 ) =
                                  𝑝 𝑖=0   2
where 𝐶𝑖 is the total importance proportion taken by the 𝑖 most      Figure 1: Execution time of each method per instance, av-
important features.                                                  eraged by number of features, for each model
   As this cumulative sum is sorted by construction from most
important to least important features, this value is bound between
0.5 and 1. A value of 0.5 means that the explanation method gives    6 features), KernelSHAP is the closest to the Complete method,
the same importance to all features, while a value of 1 means        followed by Spearman, while LIME is the the farthest. In higher
that the explanation method gives non-zero influences only to a      dimensions, Spearman becomes more precise than KernelSHAP.
single feature, explaining the model’s predictions with a single     TreeSHAP (both the approximate and the data dependent ver-
feature.                                                             sion) is more precise than KernelSHAP, but still less precise than
                                                                     Spearman in high dimensions. Note that the approximate version
4.3    Additive methods comparison                                   of TreeSHAP is not showed on the graph for XGBoost because
                                                                     its implementation forces its SHAP values to be in log odds in-
We show in Figure 1 the evolution of the execution time of each      stead of probabilities, making it impossible to compare to other
method for each predictive model, averaged over datasets that        methods.
share the same number of features. LIME, having a linear com-
plexity with the number of features, is computationally expensive
compared to other methods in low dimension (few features), but
is less expensive than coalitional-based methods and KernelSHAP
in higher dimensions. LIME also seems to have very low inter-
dataset time variability, resulting in smaller error bars on the
graph. Coalitional-based methods show an exponential complex-
ity with the number of features, having high execution time in
high dimension, but have a similar execution time with other
methods in low dimension. Spearman method execution time
seems naturally correlated to the Complete method execution
time, taking a fraction of the time (roughly 25%) of the Com-
plete method. KernelSHAP, despite a limitation on the amount of
background samples, has a high execution time in high dimen-
sion, comparable to coalitional-based methods for non-tree based
methods. For tree-based methods, KernelSHAP is slower in low
dimension, but faster in high dimension than coalitional-based
methods. Last, tree-based explainers seem to have constant exe-
cution time per instance no matter the number of features, and
the approximate tree path dependent version of TreeSHAP has
the lowest execution time per instance.                              Figure 2: Mean absolute difference of each method with
    Regarding the second metric, Figure 2 shows the average ab-      the Complete, averaged by number of features, for each
solute difference in influence between each method and the Com-      model
plete method (reference). First, we can see that overall, the more
features there are in a dataset, the closest (measured by the sec-      Finally, we show in Figure 3 an example of the graphical repre-
ond metric) the influences are to the Complete method. This is       sentation of the cumulative feature importance proportion. The
probably due to the fact that usually, the more features there       figure shows the averaging of the cumulative importance propor-
are, the less influence amplitude each individual feature has in     tion of the most-important features for the 37 datasets having
the prediction. We also note that no matter the model, common        10 features. This way, for each predictive model and for each
methods are ranked in the same way. In low dimension (less than      method, we obtain a curve from which we compute the third
metric: the AUC of the curve. We see on the figure that some
methods present steeper curves than others. For example, with
Logistic Regression and SVM, LIME gives less proportion of the
total importance to the few first most-important features, com-
pared to coalitional-based and SHAP methods. For tree-based
models, we see that SHAP, no matter the method, gives much
more importance to the first few most-important features than
the other methods.


                                                                    Figure 4: AUC of each method, averaged by number of fea-
                                                                    tures, for each model


Figure 3: Most-important features cumulative importance
proportion by method, for each model. Only influences
computed on datasets with 10 features are shown.


   According to the method for computing AUC illustrated in
Figure 3, we represent the average values of AUC for datasets
from 2 to 13 features for each ML model and explanation method
in Figure 4. For all models, we can see that SHAP methods tend
to produce influences with a higher AUC compared to other
methods. This means that SHAP methods tend to assign most of
the feature importance to fewer most-important features, while
other methods tend to distribute the feature importance more
uniformly over all features. The two coalitional-based methods
seem to generate similar AUCs for the features importance. Fi-
nally, LIME tends to produce influences with lower AUCs for
non-tree-based methods, while it produces AUCs closer to the
coalitional-based methods for tree-based methods.

4.4    Machine Learning models explanations
       comparison
We show in Figure 5 the computational time per instance needed
to compute the explanations of each predictive model, for each      Figure 5: Execution time of each model per instance, aver-
explanation method.                                                 aged by number of features, for each method
   We can see that LIME’s execution time has almost no inter-
model variability: the computation time per instance is the same
no matter the model. For the other methods, the ranking of the      of method’s computation time in regards to the model used,
method’s computational performances according to the model          except for TreeSHAPapprox where Random Forests are faster to
is roughly the same, from slowest to fastest: Random Forests,       compute. This may be related to the fact that TreeSHAPapprox
XGBoost, SVM and Logistic Regression. SVM has overall higher        only considers tree structures, as Random Forests tree structures
variability, presenting steeper curves and higher error bars. SVM   are simpler than XGBoost’s. In general, the faster a model is
even presents outlying results when applied to KernelSHAP in        to train and predict values and the simpler it is, the faster the
higher dimensions. Overall, we do not observe specific behavior     explanations are to compute, no matter the method,
   We present in Figure 6 the mean absolute difference between
each method applied to each model and the Complete method
applied to each model. The figure does not present the results
for TreeSHAPapprox because the only relevant model for this
method is Random Forests, there is no other model to compare
the results with.
   For the three model-agnostic methods (LIME, KernelSHAP and
Spearman), the Logistic Regression and SVM models generate
the most precise explanations compared to the Complete method
on the same models. We can see that the explanations based
on Logistic Regression are usually more precise than SVM’s,
especially in low dimensions. XGBoost explanations are less
precise than Random Forest’s, except for the Spearman method
(similar results observed). Overall, it seems that the simpler the
model, the more precise it is in regards to the Complete method.


                                                                      Figure 7: AUC of each model, averaged by number of fea-
                                                                      tures, for each method


                                                                      retrospective sample of males in a heart-disease high-risk region
                                                                      of the Western Cape, South Africa. The dataset is composed of 462
                                                                      individuals for 10 features. The main objective is to predict the bi-
                                                                      nary target feature ’chd’, a coronary heart disease, according to 9
Figure 6: Mean absolute difference of each method with                explanatory factors: tobacco (cumulative consumption tobacco),
the Complete, averaged by number of features, for each                age (at the onset), ldl (low density lipoprotein cholesterol), adi-
model                                                                 posity (estimation of the body fat percentage), obesity (through
                                                                      the body mass index), family (family history of heart disease,
   Finally, regarding the AUC, we present all the results in Fig-     present or absent), alcohol (current alcohol consumption), sbp
ure 7. We observe that for LIME and KernelSHAP, there is no           (systolic blood pressure) and type-A (Type-A behavior scale).
significant difference between the AUC of the model’s explana-        After model training, the different explanatory profiles obtained
tions. However, for the coalitional-based methods, we can see a       between the different methods of explanation are compared. By
clear separation between tree-based methods and non tree-based        considering a reflection on the end-user side, the health care prac-
methods: the latter have higher AUC than the others. This means       titioners, explanatory profiles should be used 1) at the population
that, when using coalitional-based methods, one should be aware       level (global explanations), for example to highlight high-risk
that different models may yield a different importance distribu-      patient profiles, develop new prevention programs, develop new
tion over the features. For the tree-specific methods, we can see     physio-pathological hypotheses but also 2) at the instance level
that XGBoost generates explanations with slightly higher AUCs         (local explanations), for personalized medicine.
than Random Forests on average.                                           For conciseness in this paper, we limit the analysis to a single
                                                                      machine learning model. We choose Random Forests, as every
4.5    Example on a medical dataset                                   explanation method that we benchmark is applicable to it. We
Amongst the OpenML datasets previously studied, we choose a           present the results with SVM, Logistic Regression and XGBoost
medical dataset, SA-Heart, to compare the explanations given by       models in supplementary data.
the different additive methods on an example. This way, we aim            To compare the explanations of the different additive meth-
to both illustrate and validate the conclusions of the previous       ods, we look at global explanations given by each method. We
sections regarding explanation methods characteristics. We also       use SHAP-like representations to visualize global explanations
aim to highlight practical differences that we can see on the         by aggregating local explanations on the same representation.
influences of different methods for the same model and dataset.       This way, we build different figures. The first one, in Figure 8,
   SA-Heart is a dataset extracted from a larger database of South-   represents a global explanation of the predictive model, given by
Africans detailed in a 1983 study [20]. The extracted dataset is a    each explanation method, by plotting the explanation profile of
each feature on a separated line. For each method, the features           We also look at adiposity PDPs. Once again, the three SHAP
are sorted in decreasing feature importance, the top one being         explanations are close to each other. Interestingly, they capture
the most contributing feature on average, while the bottom one         a non-monotonic relationship between the feature and the out-
being the least contributing feature on average. For each feature,     come, giving people around 30% of adiposity a higher influence
each dot represents an individual from the dataset, its color rep-     for this feature (in absolute value) than people close to this value.
resenting the value of the associated feature. Its position on the     This relationship seems to be captured in a lesser extent by
x-axis represents the contribution of the feature to the prediction    coalitional-based methods, but not captured at all by LIME. We
of this individual, and overlapping dots are jittered on the y-axis.   also note that the Complete and Spearman influences are more
   We can see that most of the features have similar ranking           scattered, which means that more variance exists amongst sub-
among the different methods: tobacco and age are the two most          jects of the same adiposity for these methods than for the others.
important features except for the Spearman method which ranks             Lastly, looking at obesity PDPs, LIME and SHAP methods find
age 5th . On the opposite side, alcohol, spb, and type-A are al-       a negative relationship between obesity and the chd prediction.
ways in the 4 least important features. These features have also       This seems counter intuitive, as obesity is a strong known comor-
similar explanation profiles. Conversely, some other features ex-      bidity factor of heart diseases. As previously mentioned, obesity
hibit more marked difference depending on the methods. The             and adiposity are strongly correlated (r=0.72), and it may be the
most important difference is observed on the binary feature fam-       reason for such observation. Furthermore, we have mentioned
ily history of heart disease. This feature is assigned fairly low      in section 3 that SHAP works under the hypothesis that features
importance by the coalitional-based method, relatively high im-        are independent, but with such correlation, it is very unlikely
portance (3rd most important feature) by SHAP methods, and             that obesity and adiposity are independent. To better understand
very high importance by LIME (most important feature). Obesity         the relationship between these two features, as found by the
and adiposity have also different influences depending on the          methods, we plot in Figure 10 the influence values of adiposity
method: obesity is ranked second least contributing by LIME and        and obesity given by each method.
SHAP, but more important by the coalitional-based methods. It is          The Complete and Spearman methods seems to find a positive
important to note that obesity and adiposity are highly correlated     correlation between the influences of the two features: when an
(Pearson’s correlation r=0.72). We hypothesize that it may be the      individual is assigned a high influence value for obesity, a high
reason for such differences. Overall, the three SHAP methods           influence value for adiposity is usually assigned, and conversely.
give similar explanations and have almost identical ranking of         We can even distinguish two clusters of individuals: one for indi-
the features. From a global perspective, we can also see that SHAP     viduals that have a high influence value for both features, and one
and LIME present a more homogeneous "gradient" of colors for           for individuals that have a low influence value for both features.
the explanations, where coalitional-based methods present mixed        Such pattern is not found by LIME or SHAP, thus confirming the
up colors in the explanations. This means that LIME and SHAP’s         lack of ability of these methods to consider dependent features.
explanations are more locally monotonic, in the sense that the            On a more global scale, we see that LIME and SHAP produce
influence value of a feature for an individual is more locally cor-    explanations that are easier to read at a first glance compared to
related to the value of the feature for LIME and SHAP than it is       Complete and Spearman explanations. However, LIME and SHAP
for coalitional-based methods.                                         seem to capture different cut-offs and relationships, and it is hard
   The second visualization that we present are Partial Depen-         to confirm such values without further biological knowledge.
dence Plots (PDP). PDPs focus on the relationship between a            Coalitional-based methods seems to produce explanations that
feature and the influence of this feature on the model’s predic-       are harder to read on a global scale, but more precise at an individ-
tion by plotting each pair of feature value and influence value on     ual level and able to take into account the dependencies between
a 2-dimensional axis. We compare the PDPs of several important         features. PDPs for all features are available at https://github.com/
features in Figure 9.                                                  EmmanuelDoumard/local_explanation_comparative_study.
   Looking at the PDPs for the age feature, we show that LIME
seems to form clusters of points around specific cut-off age val-
ues. To a lesser extent, this phenomenon can also be seen on           5   LESSONS-LEARNED FOR THE USE OF
the other SHAP methods. Conversely, coalitional-based meth-                ADDITIVE LOCAL METHODS
ods have similar PDPs, and do not seem to find such cut-offs.
                                                                       Table 2 summarizes advantages and drawbacks of each method
However, it seems to be a special behavior of the explanation
                                                                       studied in this paper. Overall, we highlight the fact that coalitional-
at specific ages. For example, subjects around 50 years have a
                                                                       based methods should be better at producing precise local expla-
marked lower contribution of this feature to the prediction of
                                                                       nations while SHAP should be better at producing coherent and
the presence of coronary heart disease than people even slightly
                                                                       easily interpretable global explanations. It is also confirmed by
younger or older. This may hint at an over-fitting of the machine
                                                                       the fact that SHAP tends to assign more importance to few fea-
learning model that would not have been captured by the other
                                                                       tures than other methods, producing global explanations that are
explanation methods. The explanation of the tobacco feature also
                                                                       easier to read, but potentially hiding other features contributions
largely differs among explanation methods. Where all the meth-
                                                                       and inter-dependences. Technically, KernelSHAP gives access to
ods agree on attributing a low value to non-smoking individuals,
                                                                       hyper-parameters to balance between execution time and expla-
the evolution of the contribution varies with the quantity of to-
                                                                       nation precision, but they are less accessible than Spearman’s
bacco. Once again, LIME and SHAP explanations seem to find a
                                                                       and LIME’s parameters. Indeed, without extensive KernelSHAP
cut-off value for tobacco consumption, of around 7 and 9 respec-
                                                                       knowledge or documentation readout, users can easily miss on
tively, while coalitional-based methods capture a non-monotonic,
                                                                       these parameters.
more complex relationship.
                                                                          We use all the results presented in this paper to show a sim-
                                                                       plified roadmap in the form of a decision tree in Figure 11 with
                               Figure 8: Summary plots of each method on the SA-Heart dataset


                 Figure 9: Partial dependence plots of age, tobacco, adiposity and obesity for each method


the intent to help readers finding the most suitable explanation      On this figure, high dimension represents the number of fea-
method according to their datasets and objectives.                 tures present in the studied dataset. Indeed, there is no "hard"
                         Figure 10: Influence value of adiposity against the influence value of obesity

          Method name                                   Advantages                                         Drawbacks
                    Complete                                   Exact shapley values                Slow in high dimension
 Coalitional                          Consider feature
                                                              Parameter 𝛼 to control                 Global explanations
   based            Spearman          interdependence
                                                            the level of approximation                can be hard to read
                                                                                                    Slow in low dimension
                                                   Fast in high dimension
                                                                                                  Low quality explanations
               LIME                                 Parameters to control
                                                                                                 Tends to miss non linear and
                                                       approximation
                                                                                                  non monotonic influences
                  KernelSHAP                                                                                   Slow in high dimension
                                      Easy to interpret                                  Approximations may
    SHAP           TreeSHAP                                      Very fast in low                                 Tree-based models
                                     global explanations                                    be inprecise
                TreeSHAPapprox                                 and high dimensions                                      specific
                             Table 2: Summary table of advantages and drawbacks of each method


                                  Figure 11: Roadmap for the most appropriate use of methods


cut-off to define when it goes from low to high dimension, but         However, we warn the user about the loss of precision induced
with our experiments, we can consider this cut-off somewhere           by such method approximations.
between 11 and 15 features, depending on the dataset complexity           Finally, we show that SHAP and LIME can make important ap-
and the user computational time and material available. "Accurate      proximations in some cases, and that coalitional-based methods
tree-based model" represents the ability of training a satisfactory    cannot be executed in reasonable time in high dimension. This
(defined by the user’s objectives) tree-based model on the dataset.    leaves an empty space for high dimension precise explanations
The model can then be explained thanks to the optimization done        that is not yet addressed to our knowledge.
in TreeSHAP. If the desired model is not tree-based, we advise the
user to look at KernelSHAP and LIME’s parameters to reduce the         6    CONCLUSION AND PERSPECTIVES
number of background samples and perturbation samples respec-
tively, until the explanations are computed in a reasonable time.      In this paper we performed a practical analysis of several local
                                                                       explainability methods for tabular data. Our findings indicate
                                                                       that there is not a single method that is the most appropriate for
every usage. Such usages include the need of a high precision for                              Computer-Based Medical Systems (CBMS). 275–280. https://doi.org/10.1109/
local explanations or on the contrary the need of explanations                                 CBMS.2019.00065
                                                                                           [7] Radwa ElShawi, Youssef Sherif, Mouaz Al-Mallah, and Sherif Sakr. 2020. In-
that can be aggregated to produce a better and clearer global                                  terpretability in healthcare: A comparative study of local machine learning
understanding, while taking into account the complexity level of                               interpretability techniques. Computational Intelligence (2020).
                                                                                           [8] Gabriel Ferrettini, Elodie Escriva, Julien Aligon, Jean-Baptiste Excoffier, and
data especially concerning the high dimension case. Therefore,                                 Chantal Soulé-Dupuy. 2021. Coalitional Strategies for Efficient Individual
this thorough analysis allowed to identify strengths and limi-                                 Prediction Explanation. Information Systems Frontiers (2021). https://doi.org/
tations of each method along with practical recommendations                                    10.1007/s10796-021-10141-9
                                                                                           [9] D. Garreau and U. von Luxburg. 2020. Explaining the Explainer: A First Theo-
on which method is most suitable for the use case of the user.                                 retical Analysis of LIME. In Proceedings of the 23rd International Conference on
The Complete is of course the most accurate but suffer for very                                Artificial Intelligence and Statistics (AISTATS) (Proceedings of Machine Learning
long computational time. Nevertheless, Coalitional based meth-                                 Research, Vol. 108). PMLR, 1287–1296. http://proceedings.mlr.press/v108/
                                                                                               garreau20a.html
ods allow an acceptable computational time while maintaining                              [10] I Elizabeth Kumar, Suresh Venkatasubramanian, Carlos Scheidegger, and
a strong precision of explanations. On the contrary, LIME and                                  Sorelle Friedler. 2020. Problems with Shapley-value-based explanations as
                                                                                               feature importance measures. In International Conference on Machine Learning.
SHAP methods offer a more intelligible global view of feature                                  PMLR, 5491–5500.
effects. The greatest problem arises when high dimension (i.e.,                           [11] Thibault Laugel, Xavier Renard, Marie-Jeanne Lesot, Christophe Marsala,
high number of features) is involved, as it is often the case in                               and Marcin Detyniecki. 2018. Defining Locality for Surrogates in Post-hoc
                                                                                               Interpretablity. Workshop on Human Interpretability for Machine Learning
statistics and Machine Learning. In this case, the exponential                                 (WHI) - International Conference on Machine Learning (ICML) (2018). https:
complexity of Coalitional-based methods make them too long to                                  //hal.sorbonne-universite.fr/hal-01905924
compute. Indeed, the worst case scenario is the need for high                             [12] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. 2021.
                                                                                               Explainable ai: A review of machine learning interpretability methods. Entropy
precision local explanations in high dimension since there is a                                23, 1 (2021), 18.
clear lack of methods addressing this problem in the current lit-                         [13] Stan Lipovetsky and Michael Conklin. 2001. Analysis of regression in game
                                                                                               theory approach. Applied Stochastic Models in Business and Industry 17, 4
erature. However, it is still possible to have local explanations                              (2001), 319–330. https://doi.org/10.1002/asmb.446
with limited quality in high dimension, with the level of quality                         [14] Scott M Lundberg, Gabriel G Erion, and Su-In Lee. 2018. Consistent individu-
mostly depending on the time available for the user to gener-                                  alized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888
                                                                                               (2018).
ate such explanations. It is thus a very interesting future axis                          [15] Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpret-
of work to benchmark the performances, in terms of precision                                   ing Model Predictions. In Advances in Neural Information Processing Sys-
of local explanations, of every local explainability method in                                 tems 30. Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/
                                                                                               7062-a-unified-approach-to-interpreting-model-predictions.pdf
a high dimension context under the constraint of a time limit.                            [16] Xin Man and Ernest P. Chan. 2021. The Best Way to Select Features? Com-
This would add value to our recommendations by filling out the                                 paring MDA, LIME, and SHAP. The Journal of Financial Data Science 3, 1
                                                                                               (2021), 127–139. https://doi.org/10.3905/jfds.2020.1.047 arXiv:https://jfds.pm-
’high-precision in high-dimension’ gap identified in our study.                                research.com/content/3/1/127.full.pdf
It would also be interesting to look into other machine learning                          [17] Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social
models, especially deep neural networks which are more and                                     sciences. Artificial Intelligence 267 (2019), 1–38. https://doi.org/10.1016/j.
                                                                                               artint.2018.07.007
more used. The very high complexity of this type of models hints                          [18] Christoph Molnar. 2018. A guide for making black box models explainable.
at a different behavior for the explanation methods, but also an                               https://christophm.github.io/interpretable-ml-book/v
increase in computation time.                                                             [19] Marco Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I
                                                                                               Trust You?”: Explaining the Predictions of Any Classifier. 97–101. https:
                                                                                               //doi.org/10.18653/v1/N16-3020
ACKNOWLEDGMENTS                                                                           [20] Rossouw, du Plessis, Benade, Jordaan, Kotze, Jooste, and Ferreira. 1983. Coro-
                                                                                               nary risk factor screening in three rural communities-the CORIS baseline
This study has been partially supported through the grant EUR                                  study. South African medical journal 64, 12 (1983), 430–436.
CARe N°ANR-18-EURE-0003 in the framework of the Programme                                 [21] Lloyd S Shapley. 2016. 17. A value for n-person games. Princeton University
                                                                                               Press.
des Investissements d’Avenir.                                                             [22] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learn-
  We also thank the French National Association for Research                                   ing Important Features Through Propagating Activation Differences. In
and Technology (ANRT) and Kaduceo company for providing us                                     International Conference on Machine Learning. PMLR, 3145–3153. http:
                                                                                               //proceedings.mlr.press/v70/shrikumar17a.html
with PhD grants (no. 2020/0964).                                                          [23] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu
                                                                                               Lakkaraju. 2020. Fooling LIME and SHAP: Adversarial Attacks on Post hoc
                                                                                               Explanation Methods. Proceedings of the AAAI/ACM Conference on AI, Ethics,
REFERENCES                                                                                     and Society (2020).
 [1] Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, and Vince I.            [24] Erik Štrumbelj and Igor Kononenko. 2008. Towards a model independent
     Madai. 2020. Explainability for artificial intelligence in healthcare: a multidis-        method for explaining classification for individual instances. In International
     ciplinary perspective. BMC Medical Informatics and Decision Making 20 (Nov.               Conference on Data Warehousing and Knowledge Discovery. Springer, 273–282.
     2020), 310. https://doi.org/10.1186/s12911-020-01332-6                               [25] Erik Strumbelj and Igor Kononenko. 2010. An Efficient Explanation of Individ-
 [2] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen,                 ual Classifications Using Game Theory. J. Mach. Learn. Res. 11 (March 2010),
     Klaus-Robert Müller, and Wojciech Samek. 2015. On Pixel-Wise Explanations                 1–18. Publisher: JMLR.org.
     for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation.             [26] Guy Van den Broeck, Anton Lykov, Maximilian Schleich, and Dan Suciu. 2021.
     PLOS ONE 10, 7 (2015). https://doi.org/10.1371/journal.pone.0130140 Pub-                  On the tractability of SHAP explanations. In Proceedings of AAAI.
     lisher: Public Library of Science.                                                   [27] Giulia Vilone and Luca Longo. 2021. Notions of explainability and evaluation
 [3] Nadia Burkart and Marco F. Huber. 2021. A Survey on the Explainability                    approaches for explainable artificial intelligence. Information Fusion 76 (2021),
     of Supervised Machine Learning. J. Artif. Int. Res. 70 (May 2021), 245–317.               89–106. https://doi.org/10.1016/j.inffus.2021.05.009
     https://doi.org/10.1613/jair.1.12228                                                 [28] Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and
 [4] Anupam Datta, Shayak Sen, and Yair Zick. 2016. Algorithmic Transparency                   individual predictions with feature contributions. Knowledge and Information
     via Quantitative Input Influence: Theory and Experiments with Learning                    Systems 41, 3 (2014), 647–665. https://doi.org/10.1007/s10115-013-0679-x
     Systems. In 2016 IEEE Symposium on Security and Privacy (SP). 598–617. https:
     //doi.org/10.1109/SP.2016.42
 [5] William K Diprose, Nicholas Buist, Ning Hua, Quentin Thurier, George Shand,
     and Reece Robinson. 2020. Physician understanding, explainability, and trust
     in a hypothetical machine learning risk calculator. Journal of the American
     Medical Informatics Association : JAMIA 27, 4 (Feb. 2020), 592–600. https:
     //doi.org/10.1093/jamia/ocz229
 [6] Radwa El Shawi, Youssef Sherif, Mouaz Al-Mallah, and Sherif Sakr. 2019. In-
     terpretability in HealthCare A Comparative Study of Local Machine Learning
     Interpretability Techniques. In 2019 IEEE 32nd International Symposium on