A comparative study of additive local explanation methods based on feature influences Emmanuel Doumard Julien Aligon Elodie Escriva emmanuel.doumard@irit.fr julien.aligon@irit.fr elodie.escriva@kaduceo.com Université de Toulouse-Paul Université de Toulouse-Capitole, Kaduceo Sabatier, IRIT, (CNRS/UMR 5505) IRIT, (CNRS/UMR 5505) Université de Toulouse-Capitole, Toulouse, France Toulouse, France IRIT, (CNRS/UMR 5505) Toulouse, France Jean-Baptiste Excoffier Paul Monsarrat Chantal Soulé-Dupuy jeanbaptiste.excoffier@kaduceo. paul.monsarrat@univ-tlse3.fr chantal.soule-dupuy@irit.fr com RESTORE Research Center & Université de Toulouse-Capitole, Kaduceo Artificial and Natural Intelligence IRIT, (CNRS/UMR 5505) Toulouse, France Toulouse Institute ANITI & Toulouse, France Oral Medicine Department Toulouse, France ABSTRACT each year. The additive methods include LIME [19], SHAP [15] Local additive explanation methods are increasingly used to un- and more recently the coalitional-based methods [8]. The user- derstand the predictions of complex Machine Learning (ML) mod- friendly representation of explanations, based on feature influ- els. The most used additive methods, SHAP and LIME, suffer from ences, allows domain and non-domain experts to better under- limitations that are rarely measured in the literature. This pa- stand models predictions [18]. Existing explanation methods are per aims to measure these limitations on a wide range (304) of model-specific or model-agnostic depending on whether they OpenML datasets, and also evaluate emergent coalitional-based can be applied to some or all types of machine learning models, methods to tackle the weaknesses of other methods. We illustrate with local or global explanations to understand either an individ- and validate results on a specific medical dataset, SA-Heart. Our ual prediction or the behaviour of the model as a whole. While findings reveal that LIME and SHAP’s approximations are partic- these methods have been evaluated in a number of contexts, no ularly efficient in high dimension and generate intelligible global in-depth evaluation is available for a rational choice of one tech- explanations, but they suffer from a lack of precision regarding nique over another. The objective of this work is to study the local explanations. Coalitional-based methods are computation- advantages and disadvantages of using each additive method to ally expensive in high dimension, but offer higher quality local provide pertinent insights. In particular, we study the effects of explanations. Finally, we present a roadmap summarizing our the models used and the type of dataset considered on the feature work by pointing out the most appropriate method depending influences (both at the instance and feature level). on dataset dimensionality and user’s objectives. The paper is organised as follows. Section 2 reviews the ex- isting work, classifying and comparing explanation methods for KEYWORDS tabular data. Section 3 describes the four additive methods to be compared in this paper. The experiments are presented in Section Explainable Artificial Intelligence (XAI), Prediction explanation, 4 where we study the explanation characteristics, the impact of Machine learning, the predictive model on explanation profiles, highlighting the behavior of explanation methods based on a practical medical 1 INTRODUCTION use case. Conclusive lessons-learned are then detailed in Section Machine Learning (ML) represents a real revolution in various 5. domains, such as finance, insurance, healthcare, biomedical. How- ever, machine learning models give a prediction without necessar- 2 RELATED WORKS ily being accompanied by an understandable explanation. These Few works [3, 27] exist in the literature to classify and catego- models, often referred as "black-boxes", raise the challenging rize machine learning explanation methods. In [18], a complete question of how humans can understand the determinants of description of explanation approaches from literature is given. the prediction. Explainability is also more than a technological In particular, the authors explain their advantages and disadvan- problem, it involves among other ethical, societal and legal issues. tages, giving an overview of their limits. For example, even if In healthcare, this may involve the professional being able to ex- the LIME and SHAP approaches are model-agnostic and human- plain to the patient how the algorithm works and the criteria for friendly, they suffer from no consideration of feature correla- the decision process. The results of ML models must therefore be tion and possible instability of the explanations. Another paper expressed in a way that can be understood by domain-experts, tackling the limits of the additive methods (LIME and SHAP) is like medical practitioners [1, 5]. Since SHAP [15], machine learn- presented in [23]. The paper shows that biased classifiers can fool ing experts show a very clear interest for the additive methods explanation methods, whose problem is even more accentuated as a huge number of works using these methods are published on LIME. Comparative studies between local explanation methods are © 2022 Copyright for this paper by its author(s). Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) also available, such as [6, 8, 16]. In [8], a new additive method was proposed based on Shapley values and taking into account feature correlation. This method was compared with LIME and The full implementation of LIME is available on GitHub : SHAP through computation time and accuracy score. For this last https://github.com/marcotcr/lime. measure, the authors consider as baseline the complete method, computing all Shapley values with each possible coalition of 3.2 Shapley Values (complete method) features. The authors show that their proposal is competitive with the literature, both in accuracy and in computation time. In To explain individual predictions, a method based on Shapley [16], LIME and SHAP are used in a context of feature selection values is described in [24, 25, 28]. Shapley values ’fairly’ weight and compared to a Mean Decrease Accuracy (MDA) approach. groups of features according to their relative importance to a A stability measure indicates that a feature selection obtained defined gain [21]. In machine learning, the gain can be linked to with LIME or SHAP seems more stable than via MDA. In [6], the prediction made by the model. Influences of each feature are the authors compare 6 local model-agnostic techniques using computed based on its impact on the prediction for each coalition custom quantitative measures, such as similarity, bias detection, of features. The explanation method based on Shapley values is execution time, and trust. From these experiments, no single called the complete method. All coalitions are evaluated with and method stands out for all metrics and all data sets. Each ones without each feature and the change on the prediction is used to have strengths and weaknesses based on the metrics used and compute the influence of the feature. The complete method can be choices between methods can only be made based on the users’ used as a baseline to compare other methods as it is an exhaustive goal and dataset. method close to the original intuition behind feature influence The latest results are therefore indications for the absence [8]. This method is however very expensive to compute, with an of a single method that would provide the best explanations in exponential complexity in relation to the number of features in all situations. However, none of these previous works clearly the dataset. indicates in which situation a method should be preferred to Several more recent methods, including SHAP [15] and coali- another one. Consequently, our aim is to give the key factors to tional methods [8], are based on Shapley values with the aim to make an informed decision among the existing additive methods. solve limitations of the complete method. As indicated in [17], evaluating explanations methods is very subjective and no consensus yet exists to propose relevant metrics. 3.3 SHAP As all additive methods give an influence score for each feature, SHAP (SHapley Additive exPlanations) [15] method worked on we propose to compare them based on these influences. From improving computation time and explanation precision, espe- there, we want to analyse and compare the effects of different cially for tree-based models [14]. It combines LIME [19] and predictive models and dataset on these influence scores. Shapley values [28], along with other methods from the literature [2, 4, 13, 22], in a unique framework to produce local explanations. 3 ADDITIVE METHODS TO COMPARE The main idea is to create perturbations to simulate the absence of a feature and to use a linear local model to approximate the Additive methods are described as explanation models that pro- change in the prediction, as in LIME. This avoids retraining the duce a vector of weights to represent the influence of each feature, complex model without the feature of interest. Local explana- the sum of which approximates the output of the original model. tions can be aggregated to explain the global behaviour of the Explanations can be computed for a single instance, so for ev- model. Global and local explanations are then consistent with ery instance of a data set, hence the term "local". In this section, each other as they have the same foundation. SHAP includes an we explore several existing methods that fit this definition. We agnostic explainer, KernelSHAP, as well as model-specific explain- focus on post-hoc methods that deliver their explanations for ers, such as TreeSHAP, LinearSHAP or DeepSHAP for tree-based a given model already trained. Methods used in this study are models, linear models and deep models respectively. While com- all agnostic, meaning that they can be applied to any kind of monly used in Machine Learning context [12], SHAP still suffers machine learning model, except for the TreeSHAP method [14] from lack of precision [10, 23] mostly due to their restrictive that is designed specifically for tree-based models. hypothesis (local linearity and feature independance) as with LIME. Moreover, computation time is still high for other models 3.1 LIME than tree-based models [26]. The full implementation of SHAP is available in GitHub : LIME method is a well-known local explanation method described https://github.com/slundberg/shap. in [19]. LIME uses explainable models to locally approximate a complex black-box model and, for each instance, explain the influence of each feature on the prediction. For each instance to be 3.4 Coalitional-based method explained, LIME generates new data in a close neighborhood and Another agnostic explainer based on Shapley values, the coali- computes the predictions of these new instances with the black- tional method, was introduced to take into account the interde- box model. A regressor linear model, an interpretable model, is pendence of features and solve some restrictions of SHAP. It uses trained with the new dataset. This local model is then used to grouping methods such as Principal Component Analysis (PCA), explain the prediction of the instance of interest in the form of a Spearman correlation factor (Spearman) and Variance Inflation weight vector associating each feature with its influence on the Factor (VIF) to pre-compute groups of features for explanations prediction. A well-known limitation of LIME is the restrictive [8]. These groups are then used as coalitions to compute Shapley hypothesis on which LIME is based, such as local linearity and values as in the complete method. The influence of each feature is feature independence [9, 23]. Defining the locality around an defined as its impact on the prediction only on the pre-computed instance of interest can also be a challenge, as the fit of the groups of features, approximating the complete method and reduc- surrogate model has a significant impact on the accuracy of the ing the computational time. Grouping methods are defined with explanations [11] as well as their stability [7]. a parameter that changes the number and size of feature groups Number of datasets Number of instances Then, to be able to compare the explanations, we need to Number Min Max Mean define metrics of interest. In Section 4.2, we present three metrics of features that we will use for this study. Section 4.3 aims to compare the four additive methods intro- 1 5 130 9100 3079 duced in Section 3. In particular, we use two distinct coalitional- 2 21 52 5456 901 based methods: the Complete method, which serves as reference 3 43 60 9989 1729 for an influence deviation measurement (second metric), and the 4 23 96 8641 1016 Spearman method with a threshold of 25% of all groups of fea- 5 35 62 7129 941 tures. Regarding SHAP, we use the model-agnostic KernelSHAP 6 27 51 9517 949 on all datasets. As this method is very slow to execute if we use 7 33 54 4052 499 the whole dataset as background samples for permutations, we 8 32 52 8192 1473 choose to follow SHAP’s recommendation1 by doing a K-Means 9 23 52 1473 484 clustering on the input dataset, and then taking the centroids as 10 37 57 5473 712 background samples. We choose 𝐾 = 10 clusters for each dataset, 11 8 66 4898 942 thus naming the method KernelSHAP10. In addition, for the two 12 12 123 8192 1175 tree-based predictive models XGBoost and Random Forests, we 13 5 178 506 293 use the model-specific explainer TreeSHAP by two implementa- Total 304 51 9989 1035 tions. The first one determines SHAP values with background Table 1: Datasets description samples, similarly to KernelSHAP but optimised for tree-based methods. We use the whole dataset as background samples for this method. The second one approximates SHAP values by con- sidering the trees structures, and does not need background sam- ples in input, so we name it TreeSHAPapprox. Last, we consider in order to prioritise a low computational time or an higher ac- LIME, which requires a number of perturbed samples to be cre- curacy. As for SHAP, local explanations can be aggregated into ated for explaining each instance. We choose to set this number global explanations with a common foundation to study global to 100 samples for all datasets. and local behavior of the model. With similar methodology, Section 4.4 identifies the impact of The full implementation of Coalitional-based method is avail- the predictive model on specific explanation methods. able on GitHub : Lastly, we present in Section 4.5 a practical example of the https://github.com/kaduceo/coalitional_explanation_methods. different explanations methods applied to a specific dataset, SA- Heart. This dataset is chosen for its medical context (coronary 4 EXPERIMENTS heart diseases), a sufficient number of instances (462) and features (10) to train a coherent model and compute the explanations In this section, we propose experiments comparing the explana- in acceptable computational times. The underlying idea is to tion methods presented in the previous section. The goal is to illustrate the highlighted behaviors by taking a concrete example identify the general behavior of each method and how this be- as it could be used by an end user (e.g. a physician). havior eventually differs according to a predictive model (learned from data) and the dimensionality of the data (number of fea- 4.2 Metrics of interest tures). Because of the subjective nature of explanations, there is no con- sensus on objective mathematical ways to evaluate the explana- 4.1 Experimental protocol tions. Therefore, to evaluate explanation methods performances All experiments are run on an Intel Xeon Gold 6230 processor and compare them over a high number of datasets, we define with 125 GB of RAM using Python 3.9.7. All runs are performed three different metrics that only need the influence values given on a single core of CPU for optimization and reproducibility. To by the method. The first one is the computational time per in- compare explanation methods, we apply them to a wide range stance, which is the amount of time taken by a given method to of 304 datasets available on OpenML (www.openml.org). Due compute the local influences of a whole dataset, divided by the to computational constraints of explanation methods, we only number of instances in the dataset. The second one is a quan- considered datasets with at most 13 features, and at most 10 000 tification of the average deviation of the influence given by a instances. We also only considered classification tasks to use com- method from the Complete method (see Section 3). This error rate parable predictive models and metrics. We describe the amount is defined as: and size of datasets per number of features in Table 1. 𝑛 𝑝 1Õ1Õ As an explanation method needs a model to be applied to, we 𝑒𝑟𝑟 (𝐼, 𝑋 ) = 𝐼𝑘 (𝑋𝑖 ) − 𝐼𝑘𝐶 (𝑋𝑖 ) choose four widely used types of ML models for classification: 𝑛 𝑖=1 𝑝 𝑘=1 Logistic Regression (LR), Support Vector Machines (SVM), Ran- where, for a given a dataset, 𝑛 the number of instances, 𝑝 the dom Forests (RF) and Gradient Boosted Machines (GBM). For the number of features, 𝑋𝑖 the features vector for the instance 𝑖, 𝐼𝑘 (𝑥) first three, we use the implementation of Python library scikit- the influence of a feature 𝑘 for a given instance 𝑥, a given expla- learn version 1.0.1. For GBM, we use the Python library XGBoost nation method and a given machine learning model, and 𝐼𝑘𝐶 (𝑥) version 1.5. We use default values for models hyperparameters. the influence given by the Complete method for the same model, For explanations methods, we use Python libraries shap 0.40 and lime 0.2.0.1. 1 KernelSHAP documentation includes recommendation to use K-Means algorithm to speed up computation time https://shap-lrjball.readthedocs.io/en/latest/generated/ shap.KernelExplainer.html same feature and same instance. The third metric evaluates the distribution of feature importance assigned by a given explana- tion. The raw value being not necessarily comparable between explanation methods, the cumulative importance proportion of features given by a method was considered. This metric shows whether an explanation method favours the attribution of great importance to a few features or, on the contrary, a more homo- geneous distribution among a larger number of features. The importance of a feature is defined here as the mean absolute value of influence assigned to instances for such feature. For example, in a dataset with 2 features, if a method gives 80% of the importance to the most important feature (and so 20% to the second), it would have a cumulative importance proportion vector of [0, 0.8, 1]. We can then compute the (normalised) Area Under Curve (AUC) of such a vector 𝐶 with : 𝑝−1 1 Õ 𝐶𝑖 + 𝐶𝑖+1 𝐴𝑈 𝐶 (𝐼, 𝑋 ) = 𝑝 𝑖=0 2 where 𝐶𝑖 is the total importance proportion taken by the 𝑖 most Figure 1: Execution time of each method per instance, av- important features. eraged by number of features, for each model As this cumulative sum is sorted by construction from most important to least important features, this value is bound between 0.5 and 1. A value of 0.5 means that the explanation method gives 6 features), KernelSHAP is the closest to the Complete method, the same importance to all features, while a value of 1 means followed by Spearman, while LIME is the the farthest. In higher that the explanation method gives non-zero influences only to a dimensions, Spearman becomes more precise than KernelSHAP. single feature, explaining the model’s predictions with a single TreeSHAP (both the approximate and the data dependent ver- feature. sion) is more precise than KernelSHAP, but still less precise than Spearman in high dimensions. Note that the approximate version 4.3 Additive methods comparison of TreeSHAP is not showed on the graph for XGBoost because its implementation forces its SHAP values to be in log odds in- We show in Figure 1 the evolution of the execution time of each stead of probabilities, making it impossible to compare to other method for each predictive model, averaged over datasets that methods. share the same number of features. LIME, having a linear com- plexity with the number of features, is computationally expensive compared to other methods in low dimension (few features), but is less expensive than coalitional-based methods and KernelSHAP in higher dimensions. LIME also seems to have very low inter- dataset time variability, resulting in smaller error bars on the graph. Coalitional-based methods show an exponential complex- ity with the number of features, having high execution time in high dimension, but have a similar execution time with other methods in low dimension. Spearman method execution time seems naturally correlated to the Complete method execution time, taking a fraction of the time (roughly 25%) of the Com- plete method. KernelSHAP, despite a limitation on the amount of background samples, has a high execution time in high dimen- sion, comparable to coalitional-based methods for non-tree based methods. For tree-based methods, KernelSHAP is slower in low dimension, but faster in high dimension than coalitional-based methods. Last, tree-based explainers seem to have constant exe- cution time per instance no matter the number of features, and the approximate tree path dependent version of TreeSHAP has the lowest execution time per instance. Figure 2: Mean absolute difference of each method with Regarding the second metric, Figure 2 shows the average ab- the Complete, averaged by number of features, for each solute difference in influence between each method and the Com- model plete method (reference). First, we can see that overall, the more features there are in a dataset, the closest (measured by the sec- Finally, we show in Figure 3 an example of the graphical repre- ond metric) the influences are to the Complete method. This is sentation of the cumulative feature importance proportion. The probably due to the fact that usually, the more features there figure shows the averaging of the cumulative importance propor- are, the less influence amplitude each individual feature has in tion of the most-important features for the 37 datasets having the prediction. We also note that no matter the model, common 10 features. This way, for each predictive model and for each methods are ranked in the same way. In low dimension (less than method, we obtain a curve from which we compute the third metric: the AUC of the curve. We see on the figure that some methods present steeper curves than others. For example, with Logistic Regression and SVM, LIME gives less proportion of the total importance to the few first most-important features, com- pared to coalitional-based and SHAP methods. For tree-based models, we see that SHAP, no matter the method, gives much more importance to the first few most-important features than the other methods. Figure 4: AUC of each method, averaged by number of fea- tures, for each model Figure 3: Most-important features cumulative importance proportion by method, for each model. Only influences computed on datasets with 10 features are shown. According to the method for computing AUC illustrated in Figure 3, we represent the average values of AUC for datasets from 2 to 13 features for each ML model and explanation method in Figure 4. For all models, we can see that SHAP methods tend to produce influences with a higher AUC compared to other methods. This means that SHAP methods tend to assign most of the feature importance to fewer most-important features, while other methods tend to distribute the feature importance more uniformly over all features. The two coalitional-based methods seem to generate similar AUCs for the features importance. Fi- nally, LIME tends to produce influences with lower AUCs for non-tree-based methods, while it produces AUCs closer to the coalitional-based methods for tree-based methods. 4.4 Machine Learning models explanations comparison We show in Figure 5 the computational time per instance needed to compute the explanations of each predictive model, for each Figure 5: Execution time of each model per instance, aver- explanation method. aged by number of features, for each method We can see that LIME’s execution time has almost no inter- model variability: the computation time per instance is the same no matter the model. For the other methods, the ranking of the of method’s computation time in regards to the model used, method’s computational performances according to the model except for TreeSHAPapprox where Random Forests are faster to is roughly the same, from slowest to fastest: Random Forests, compute. This may be related to the fact that TreeSHAPapprox XGBoost, SVM and Logistic Regression. SVM has overall higher only considers tree structures, as Random Forests tree structures variability, presenting steeper curves and higher error bars. SVM are simpler than XGBoost’s. In general, the faster a model is even presents outlying results when applied to KernelSHAP in to train and predict values and the simpler it is, the faster the higher dimensions. Overall, we do not observe specific behavior explanations are to compute, no matter the method, We present in Figure 6 the mean absolute difference between each method applied to each model and the Complete method applied to each model. The figure does not present the results for TreeSHAPapprox because the only relevant model for this method is Random Forests, there is no other model to compare the results with. For the three model-agnostic methods (LIME, KernelSHAP and Spearman), the Logistic Regression and SVM models generate the most precise explanations compared to the Complete method on the same models. We can see that the explanations based on Logistic Regression are usually more precise than SVM’s, especially in low dimensions. XGBoost explanations are less precise than Random Forest’s, except for the Spearman method (similar results observed). Overall, it seems that the simpler the model, the more precise it is in regards to the Complete method. Figure 7: AUC of each model, averaged by number of fea- tures, for each method retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. The dataset is composed of 462 individuals for 10 features. The main objective is to predict the bi- nary target feature ’chd’, a coronary heart disease, according to 9 Figure 6: Mean absolute difference of each method with explanatory factors: tobacco (cumulative consumption tobacco), the Complete, averaged by number of features, for each age (at the onset), ldl (low density lipoprotein cholesterol), adi- model posity (estimation of the body fat percentage), obesity (through the body mass index), family (family history of heart disease, Finally, regarding the AUC, we present all the results in Fig- present or absent), alcohol (current alcohol consumption), sbp ure 7. We observe that for LIME and KernelSHAP, there is no (systolic blood pressure) and type-A (Type-A behavior scale). significant difference between the AUC of the model’s explana- After model training, the different explanatory profiles obtained tions. However, for the coalitional-based methods, we can see a between the different methods of explanation are compared. By clear separation between tree-based methods and non tree-based considering a reflection on the end-user side, the health care prac- methods: the latter have higher AUC than the others. This means titioners, explanatory profiles should be used 1) at the population that, when using coalitional-based methods, one should be aware level (global explanations), for example to highlight high-risk that different models may yield a different importance distribu- patient profiles, develop new prevention programs, develop new tion over the features. For the tree-specific methods, we can see physio-pathological hypotheses but also 2) at the instance level that XGBoost generates explanations with slightly higher AUCs (local explanations), for personalized medicine. than Random Forests on average. For conciseness in this paper, we limit the analysis to a single machine learning model. We choose Random Forests, as every 4.5 Example on a medical dataset explanation method that we benchmark is applicable to it. We Amongst the OpenML datasets previously studied, we choose a present the results with SVM, Logistic Regression and XGBoost medical dataset, SA-Heart, to compare the explanations given by models in supplementary data. the different additive methods on an example. This way, we aim To compare the explanations of the different additive meth- to both illustrate and validate the conclusions of the previous ods, we look at global explanations given by each method. We sections regarding explanation methods characteristics. We also use SHAP-like representations to visualize global explanations aim to highlight practical differences that we can see on the by aggregating local explanations on the same representation. influences of different methods for the same model and dataset. This way, we build different figures. The first one, in Figure 8, SA-Heart is a dataset extracted from a larger database of South- represents a global explanation of the predictive model, given by Africans detailed in a 1983 study [20]. The extracted dataset is a each explanation method, by plotting the explanation profile of each feature on a separated line. For each method, the features We also look at adiposity PDPs. Once again, the three SHAP are sorted in decreasing feature importance, the top one being explanations are close to each other. Interestingly, they capture the most contributing feature on average, while the bottom one a non-monotonic relationship between the feature and the out- being the least contributing feature on average. For each feature, come, giving people around 30% of adiposity a higher influence each dot represents an individual from the dataset, its color rep- for this feature (in absolute value) than people close to this value. resenting the value of the associated feature. Its position on the This relationship seems to be captured in a lesser extent by x-axis represents the contribution of the feature to the prediction coalitional-based methods, but not captured at all by LIME. We of this individual, and overlapping dots are jittered on the y-axis. also note that the Complete and Spearman influences are more We can see that most of the features have similar ranking scattered, which means that more variance exists amongst sub- among the different methods: tobacco and age are the two most jects of the same adiposity for these methods than for the others. important features except for the Spearman method which ranks Lastly, looking at obesity PDPs, LIME and SHAP methods find age 5th . On the opposite side, alcohol, spb, and type-A are al- a negative relationship between obesity and the chd prediction. ways in the 4 least important features. These features have also This seems counter intuitive, as obesity is a strong known comor- similar explanation profiles. Conversely, some other features ex- bidity factor of heart diseases. As previously mentioned, obesity hibit more marked difference depending on the methods. The and adiposity are strongly correlated (r=0.72), and it may be the most important difference is observed on the binary feature fam- reason for such observation. Furthermore, we have mentioned ily history of heart disease. This feature is assigned fairly low in section 3 that SHAP works under the hypothesis that features importance by the coalitional-based method, relatively high im- are independent, but with such correlation, it is very unlikely portance (3rd most important feature) by SHAP methods, and that obesity and adiposity are independent. To better understand very high importance by LIME (most important feature). Obesity the relationship between these two features, as found by the and adiposity have also different influences depending on the methods, we plot in Figure 10 the influence values of adiposity method: obesity is ranked second least contributing by LIME and and obesity given by each method. SHAP, but more important by the coalitional-based methods. It is The Complete and Spearman methods seems to find a positive important to note that obesity and adiposity are highly correlated correlation between the influences of the two features: when an (Pearson’s correlation r=0.72). We hypothesize that it may be the individual is assigned a high influence value for obesity, a high reason for such differences. Overall, the three SHAP methods influence value for adiposity is usually assigned, and conversely. give similar explanations and have almost identical ranking of We can even distinguish two clusters of individuals: one for indi- the features. From a global perspective, we can also see that SHAP viduals that have a high influence value for both features, and one and LIME present a more homogeneous "gradient" of colors for for individuals that have a low influence value for both features. the explanations, where coalitional-based methods present mixed Such pattern is not found by LIME or SHAP, thus confirming the up colors in the explanations. This means that LIME and SHAP’s lack of ability of these methods to consider dependent features. explanations are more locally monotonic, in the sense that the On a more global scale, we see that LIME and SHAP produce influence value of a feature for an individual is more locally cor- explanations that are easier to read at a first glance compared to related to the value of the feature for LIME and SHAP than it is Complete and Spearman explanations. However, LIME and SHAP for coalitional-based methods. seem to capture different cut-offs and relationships, and it is hard The second visualization that we present are Partial Depen- to confirm such values without further biological knowledge. dence Plots (PDP). PDPs focus on the relationship between a Coalitional-based methods seems to produce explanations that feature and the influence of this feature on the model’s predic- are harder to read on a global scale, but more precise at an individ- tion by plotting each pair of feature value and influence value on ual level and able to take into account the dependencies between a 2-dimensional axis. We compare the PDPs of several important features. PDPs for all features are available at https://github.com/ features in Figure 9. EmmanuelDoumard/local_explanation_comparative_study. Looking at the PDPs for the age feature, we show that LIME seems to form clusters of points around specific cut-off age val- ues. To a lesser extent, this phenomenon can also be seen on 5 LESSONS-LEARNED FOR THE USE OF the other SHAP methods. Conversely, coalitional-based meth- ADDITIVE LOCAL METHODS ods have similar PDPs, and do not seem to find such cut-offs. Table 2 summarizes advantages and drawbacks of each method However, it seems to be a special behavior of the explanation studied in this paper. Overall, we highlight the fact that coalitional- at specific ages. For example, subjects around 50 years have a based methods should be better at producing precise local expla- marked lower contribution of this feature to the prediction of nations while SHAP should be better at producing coherent and the presence of coronary heart disease than people even slightly easily interpretable global explanations. It is also confirmed by younger or older. This may hint at an over-fitting of the machine the fact that SHAP tends to assign more importance to few fea- learning model that would not have been captured by the other tures than other methods, producing global explanations that are explanation methods. The explanation of the tobacco feature also easier to read, but potentially hiding other features contributions largely differs among explanation methods. Where all the meth- and inter-dependences. Technically, KernelSHAP gives access to ods agree on attributing a low value to non-smoking individuals, hyper-parameters to balance between execution time and expla- the evolution of the contribution varies with the quantity of to- nation precision, but they are less accessible than Spearman’s bacco. Once again, LIME and SHAP explanations seem to find a and LIME’s parameters. Indeed, without extensive KernelSHAP cut-off value for tobacco consumption, of around 7 and 9 respec- knowledge or documentation readout, users can easily miss on tively, while coalitional-based methods capture a non-monotonic, these parameters. more complex relationship. We use all the results presented in this paper to show a sim- plified roadmap in the form of a decision tree in Figure 11 with Figure 8: Summary plots of each method on the SA-Heart dataset Figure 9: Partial dependence plots of age, tobacco, adiposity and obesity for each method the intent to help readers finding the most suitable explanation On this figure, high dimension represents the number of fea- method according to their datasets and objectives. tures present in the studied dataset. Indeed, there is no "hard" Figure 10: Influence value of adiposity against the influence value of obesity Method name Advantages Drawbacks Complete Exact shapley values Slow in high dimension Coalitional Consider feature Parameter 𝛼 to control Global explanations based Spearman interdependence the level of approximation can be hard to read Slow in low dimension Fast in high dimension Low quality explanations LIME Parameters to control Tends to miss non linear and approximation non monotonic influences KernelSHAP Slow in high dimension Easy to interpret Approximations may SHAP TreeSHAP Very fast in low Tree-based models global explanations be inprecise TreeSHAPapprox and high dimensions specific Table 2: Summary table of advantages and drawbacks of each method Figure 11: Roadmap for the most appropriate use of methods cut-off to define when it goes from low to high dimension, but However, we warn the user about the loss of precision induced with our experiments, we can consider this cut-off somewhere by such method approximations. between 11 and 15 features, depending on the dataset complexity Finally, we show that SHAP and LIME can make important ap- and the user computational time and material available. "Accurate proximations in some cases, and that coalitional-based methods tree-based model" represents the ability of training a satisfactory cannot be executed in reasonable time in high dimension. This (defined by the user’s objectives) tree-based model on the dataset. leaves an empty space for high dimension precise explanations The model can then be explained thanks to the optimization done that is not yet addressed to our knowledge. in TreeSHAP. If the desired model is not tree-based, we advise the user to look at KernelSHAP and LIME’s parameters to reduce the 6 CONCLUSION AND PERSPECTIVES number of background samples and perturbation samples respec- tively, until the explanations are computed in a reasonable time. In this paper we performed a practical analysis of several local explainability methods for tabular data. Our findings indicate that there is not a single method that is the most appropriate for every usage. Such usages include the need of a high precision for Computer-Based Medical Systems (CBMS). 275–280. https://doi.org/10.1109/ local explanations or on the contrary the need of explanations CBMS.2019.00065 [7] Radwa ElShawi, Youssef Sherif, Mouaz Al-Mallah, and Sherif Sakr. 2020. In- that can be aggregated to produce a better and clearer global terpretability in healthcare: A comparative study of local machine learning understanding, while taking into account the complexity level of interpretability techniques. Computational Intelligence (2020). [8] Gabriel Ferrettini, Elodie Escriva, Julien Aligon, Jean-Baptiste Excoffier, and data especially concerning the high dimension case. Therefore, Chantal Soulé-Dupuy. 2021. Coalitional Strategies for Efficient Individual this thorough analysis allowed to identify strengths and limi- Prediction Explanation. Information Systems Frontiers (2021). https://doi.org/ tations of each method along with practical recommendations 10.1007/s10796-021-10141-9 [9] D. Garreau and U. von Luxburg. 2020. Explaining the Explainer: A First Theo- on which method is most suitable for the use case of the user. retical Analysis of LIME. In Proceedings of the 23rd International Conference on The Complete is of course the most accurate but suffer for very Artificial Intelligence and Statistics (AISTATS) (Proceedings of Machine Learning long computational time. Nevertheless, Coalitional based meth- Research, Vol. 108). PMLR, 1287–1296. http://proceedings.mlr.press/v108/ garreau20a.html ods allow an acceptable computational time while maintaining [10] I Elizabeth Kumar, Suresh Venkatasubramanian, Carlos Scheidegger, and a strong precision of explanations. On the contrary, LIME and Sorelle Friedler. 2020. Problems with Shapley-value-based explanations as feature importance measures. In International Conference on Machine Learning. SHAP methods offer a more intelligible global view of feature PMLR, 5491–5500. effects. The greatest problem arises when high dimension (i.e., [11] Thibault Laugel, Xavier Renard, Marie-Jeanne Lesot, Christophe Marsala, high number of features) is involved, as it is often the case in and Marcin Detyniecki. 2018. Defining Locality for Surrogates in Post-hoc Interpretablity. Workshop on Human Interpretability for Machine Learning statistics and Machine Learning. In this case, the exponential (WHI) - International Conference on Machine Learning (ICML) (2018). https: complexity of Coalitional-based methods make them too long to //hal.sorbonne-universite.fr/hal-01905924 compute. Indeed, the worst case scenario is the need for high [12] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. 2021. Explainable ai: A review of machine learning interpretability methods. Entropy precision local explanations in high dimension since there is a 23, 1 (2021), 18. clear lack of methods addressing this problem in the current lit- [13] Stan Lipovetsky and Michael Conklin. 2001. Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry 17, 4 erature. However, it is still possible to have local explanations (2001), 319–330. https://doi.org/10.1002/asmb.446 with limited quality in high dimension, with the level of quality [14] Scott M Lundberg, Gabriel G Erion, and Su-In Lee. 2018. Consistent individu- mostly depending on the time available for the user to gener- alized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 (2018). ate such explanations. It is thus a very interesting future axis [15] Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpret- of work to benchmark the performances, in terms of precision ing Model Predictions. In Advances in Neural Information Processing Sys- of local explanations, of every local explainability method in tems 30. Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/ 7062-a-unified-approach-to-interpreting-model-predictions.pdf a high dimension context under the constraint of a time limit. [16] Xin Man and Ernest P. Chan. 2021. The Best Way to Select Features? Com- This would add value to our recommendations by filling out the paring MDA, LIME, and SHAP. The Journal of Financial Data Science 3, 1 (2021), 127–139. https://doi.org/10.3905/jfds.2020.1.047 arXiv:https://jfds.pm- ’high-precision in high-dimension’ gap identified in our study. research.com/content/3/1/127.full.pdf It would also be interesting to look into other machine learning [17] Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social models, especially deep neural networks which are more and sciences. Artificial Intelligence 267 (2019), 1–38. https://doi.org/10.1016/j. artint.2018.07.007 more used. The very high complexity of this type of models hints [18] Christoph Molnar. 2018. A guide for making black box models explainable. at a different behavior for the explanation methods, but also an https://christophm.github.io/interpretable-ml-book/v increase in computation time. [19] Marco Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. 97–101. https: //doi.org/10.18653/v1/N16-3020 ACKNOWLEDGMENTS [20] Rossouw, du Plessis, Benade, Jordaan, Kotze, Jooste, and Ferreira. 1983. Coro- nary risk factor screening in three rural communities-the CORIS baseline This study has been partially supported through the grant EUR study. South African medical journal 64, 12 (1983), 430–436. CARe N°ANR-18-EURE-0003 in the framework of the Programme [21] Lloyd S Shapley. 2016. 17. A value for n-person games. Princeton University Press. des Investissements d’Avenir. [22] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learn- We also thank the French National Association for Research ing Important Features Through Propagating Activation Differences. In and Technology (ANRT) and Kaduceo company for providing us International Conference on Machine Learning. PMLR, 3145–3153. http: //proceedings.mlr.press/v70/shrikumar17a.html with PhD grants (no. 2020/0964). [23] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. 2020. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. Proceedings of the AAAI/ACM Conference on AI, Ethics, REFERENCES and Society (2020). [1] Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, and Vince I. [24] Erik Štrumbelj and Igor Kononenko. 2008. Towards a model independent Madai. 2020. Explainability for artificial intelligence in healthcare: a multidis- method for explaining classification for individual instances. In International ciplinary perspective. BMC Medical Informatics and Decision Making 20 (Nov. Conference on Data Warehousing and Knowledge Discovery. Springer, 273–282. 2020), 310. https://doi.org/10.1186/s12911-020-01332-6 [25] Erik Strumbelj and Igor Kononenko. 2010. An Efficient Explanation of Individ- [2] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, ual Classifications Using Game Theory. J. Mach. Learn. Res. 11 (March 2010), Klaus-Robert Müller, and Wojciech Samek. 2015. On Pixel-Wise Explanations 1–18. Publisher: JMLR.org. for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. [26] Guy Van den Broeck, Anton Lykov, Maximilian Schleich, and Dan Suciu. 2021. PLOS ONE 10, 7 (2015). https://doi.org/10.1371/journal.pone.0130140 Pub- On the tractability of SHAP explanations. In Proceedings of AAAI. lisher: Public Library of Science. [27] Giulia Vilone and Luca Longo. 2021. Notions of explainability and evaluation [3] Nadia Burkart and Marco F. Huber. 2021. A Survey on the Explainability approaches for explainable artificial intelligence. Information Fusion 76 (2021), of Supervised Machine Learning. J. Artif. Int. Res. 70 (May 2021), 245–317. 89–106. https://doi.org/10.1016/j.inffus.2021.05.009 https://doi.org/10.1613/jair.1.12228 [28] Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and [4] Anupam Datta, Shayak Sen, and Yair Zick. 2016. Algorithmic Transparency individual predictions with feature contributions. Knowledge and Information via Quantitative Input Influence: Theory and Experiments with Learning Systems 41, 3 (2014), 647–665. https://doi.org/10.1007/s10115-013-0679-x Systems. In 2016 IEEE Symposium on Security and Privacy (SP). 598–617. https: //doi.org/10.1109/SP.2016.42 [5] William K Diprose, Nicholas Buist, Ning Hua, Quentin Thurier, George Shand, and Reece Robinson. 2020. Physician understanding, explainability, and trust in a hypothetical machine learning risk calculator. Journal of the American Medical Informatics Association : JAMIA 27, 4 (Feb. 2020), 592–600. https: //doi.org/10.1093/jamia/ocz229 [6] Radwa El Shawi, Youssef Sherif, Mouaz Al-Mallah, and Sherif Sakr. 2019. In- terpretability in HealthCare A Comparative Study of Local Machine Learning Interpretability Techniques. In 2019 IEEE 32nd International Symposium on