=Paper=
{{Paper
|id=Vol-2699/paper03
|storemode=property
|title=OptiLIME: Optimized LIME Explanations for Diagnostic
Computer Algorithms

|pdfUrl=https://ceur-ws.org/Vol-2699/paper03.pdf
|volume=Vol-2699
|authors=Giorgio Visani,Enrico Bagli,Federico Chesani
|dblpUrl=https://dblp.org/rec/conf/cikm/VisaniBC20
}}
==OptiLIME: Optimized LIME Explanations for Diagnostic
Computer Algorithms
==
<pdf width="1500px">https://ceur-ws.org/Vol-2699/paper03.pdf</pdf>
<pre>
OptiLIME: Optimized LIME Explanations for Diagnostic
Computer Algorithms
Giorgio Visania,b , Enrico Baglib and Federico Chesania
a University of Bologna, School of Informatics & Engineering, viale Risorgimento 2, 40136 Bologna (BO), Italy
b CRIF S.p.A., via Mario Fantin 1-3, 40131 Bologna (BO), Italy


                                       Abstract
                                       Local Interpretable Model-Agnostic Explanations (LIME) is a popular method to perform interpretability of any kind of Ma-
                                       chine Learning (ML) model. It explains one ML prediction at a time, by learning a simple linear model around the prediction.
                                       The model is trained on randomly generated data points, sampled from the training dataset distribution and weighted ac-
                                       cording to the distance from the reference point - the one being explained by LIME. Feature selection is applied to keep only
                                       the most important variables, their coefficients are regarded as explanation. LIME is widespread across different domains,
                                       although its instability - a single prediction may obtain different explanations - is one of the major shortcomings. This is due
                                       to the randomness in the sampling step, as well and determines a lack of reliability in the retrieved explanations, making
                                       LIME adoption problematic. In Medicine especially, clinical professionals trust is mandatory to determine the acceptance of
                                       an explainable algorithm, considering the importance of the decisions at stake and the related legal issues. In this paper, we
                                       highlight a trade-off between explanation’s stability and adherence, namely how much it resembles the ML model. Exploiting
                                       our innovative discovery, we propose a framework to maximise stability, while retaining a predefined level of adherence. Op-
                                       tiLIME provides freedom to choose the best adherence-stability trade-off level and more importantly, it clearly highlights the
                                       mathematical properties of the retrieved explanation. As a result, the practitioner is provided with tools to decide whether
                                       the explanation is reliable, according to the problem at hand. We extensively test OptiLIME on a toy dataset - to present
                                       visually the geometrical findings - and a medical dataset. In the latter, we show how the method comes up with meaningful
                                       explanations both from a medical and mathematical standpoint.

                                       Keywords
                                       Explainable AI (XAI), Interpretable Machine Learning, Explanation, Model Agnostic, LIME, Healthcare, Stability


1. Introduction                                                                       sions?”) are some of the main topics XAI tries to ad-
                                                                                      dress. To achieve the explainability, quite a few tech-
Nowadays Machine Learning (ML) is pervasive and niques have been proposed in recent literature. These
widespread across multiple domains. Medicine makes approaches can be grouped based on different criterion
no difference, on the contrary it is considered one of [11], [12] such as i) Model agnostic or model specific
the greatest challenges of Artificial Intelligence [1]. The ii) Local, global or example based iii) Intrinsic or post-
idea of exploiting computers to provide assistance to hoc iv) Perturbation or saliency based. Among them,
the medical personnel is not new. An historical overview model agnostic approaches are quite popular in prac-
on the topic, starting from the early ‘60s is provided tice, since the algorithm is designed to be effective on
in [2]. More recently, computer algorithms have been any type of ML model.
proven useful for patients and medical concepts repre-                                   LIME [13] is a well-known instance-based, model
sentation [3], outcome prediction [4],[5],[6] and new agnostic algorithm. The method generates data points,
phenotype discovery [7],[8]. An accurate overview of sampled from the training dataset distribution and weighted
ML successes in Health related environments, is pro- according to distance from the instance being explained.
vided by Topol in [9].                                                                Feature selection is applied to keep only the most im-
   Unfortunately, ML methods are hardly perfect and, portant variables and a linear model is trained on the
especially in the medical field where human lives are weighted dataset. The model coefficients are regarded
at stake, Explainable Artificial Intelligence (XAI) is ur- as explanation. LIME has already been employed sev-
gently needed [10]. Medical education, research and eral times in medicine, such as on Intensive Care data
accountability (“who is accountable for wrong deci- [14] and cancer data [15],[16]. The technique is known
                                                                                      to suffer from instability, mainly caused by the ran-
Proceedings of the CIKM 2020 Workshops, October 19-20, 2020,
Galway, Ireland
                                                                                      domness introduced in the sampling step. Stability is a
" giorgio.visani2@unibo.it (G. Visani)                                                desirable property for an interpretable model, whereas
 0000-0001-6818-3526 (G. Visani); 0000-0003-3913-7701 (E. Bagli); the lack of it reduces the trust in the explanations re-
0000-0003-1664-9632 (F. Chesani)                                                      trieved, especially in the medical field.
         © 2020 Copyright for this paper by its authors. Use permitted under Creative
         Commons License Attribution 4.0 International (CC BY 4.0).                      In our contribution, we review the geometrical idea
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
on which LIME is based upon. Relying on statistical description can be found in [13]. We may consider the
theory and simulations, we highlight a trade-off be- ML model as a multivariate surface in the ℝ𝑑+1 space
tween the explanation’s stability and adherence, namely spanned by the 𝑑 independent variables 𝑋1 , ..., 𝑋𝑑 and
how much LIME’s simple model resembles the ML model.the 𝑌 dependent variable.
Exploiting our innovative discovery, we propose Op-             LIME’s objective is to find the tangent plane to the
tiLIME: a framework to maximise the stability, while ML surface, in the point we want to explain. This task
retaining a predefined level of adherence. OptiLIME is analytically unfeasible, since we don’t have a para-
provides both i) freedom to choose the best adherence- metric formulation of the function, besides the ML sur-
stability trade-off level and ii) it clearly highlights the face may have a huge number of discontinuity points,
mathematical properties of the explanation retrieved. preventing the existence of a proper derivative and
As a result, the practitioner is provided with tools to tangent. To find an approximation of the tangent, LIME
decide whether each explanation is reliable, according uses a Ridge Linear Model to fit points on the ML sur-
to the problem at hand.                                     face, in the neighbourhood of the reference individual.
   We test the validity of the framework on a medical           Points all over the ℝ𝑑 space are generated, sampling
dataset, where the method comes up with meaningful the 𝐗 values from a Normal distribution inferred from
explanations both from a medical and mathematical the training set. The 𝑌 coordinate values are obtained
standpoint. In addition, a toy dataset is employed to by ML predictions, so that the generated points are
present visually the geometrical findings.                  guaranteed to perfectly lie on the ML surface. The
   The code used for the experiments is available at        concept of neighbourhood is introduced using a kernel
https://github.com/giorgiovisani/LIME_stability.            function (RBF Kernel), which smoothly assigns higher
                                                            weights to points closer to the reference. Ridge Model
                                                            is trained on the generated dataset, each point weighted
2. Related Work                                             by the kernel function, to estimate the linear relation-
                                                            ship 𝐄(𝑌 ) = 𝛼 + ∑𝑑𝑗=1 𝛽𝑗 𝑋𝑗 . The 𝛽 coefficients are re-
For the sake of shortness, in the following review we
                                                            garded as LIME explanation.
consider only model agnostic techniques, which are
effective on any kind of ML model by construction. A
popular approach is to exclude a certain feature, or 2.2. LIME Instability
group of features, from the model and evaluate the One of the main issues of LIME is the lack of stability.
loss incurred in terms of model goodness. Such value Explanations derived from repeated LIME calls, under
quantifies the importance of the excluded feature: an the same conditions, are considered stable when statis-
high loss value underlines an important variable for tically equal [26]. In [27] the authors provide insight
the prediction task. The idea has been first introduced about LIME’s lack of robustness, a similar notion to the
by Breiman [17] for the Random Forest model and has above-mentioned stability. Analogous findings also in
been generalised to a model-agnostic framework, named [28]. Often, practitioners are either not aware of such
LOCO [18]. Based on variable exclusion, the predictive drawback or diffident about the method because of its
power of the ML models has been decomposed into unreliability. By all means, unambiguous explanations
single variables contribution in PDP [19], ICE [20] and are a key desiderata for the interpretable frameworks.
ALE [21] plots, based on different assumptions about            The major source of LIME instability comes from
the ML model. The same idea is exploited also for local the sampling step, when new observations are ran-
explanations in SHAP [22], where the decomposition domly selected. Some approaches, grouped in two high
is obtained through a game-based setting.                   level concepts, have been recently laid out in order to
   Another common approach is to train a surrogate solve the stability issue.
model mimicking the behaviour of the ML model. In
this vein, approximations on the entire input space are
                                                            Avoid the sampling step
provided in [23] and [24] among others, while LIME
[13] and its extension using decision rules [25] rely on In [29] the authors propose to bypass the sampling
this technique for providing local approximations.          step using the training units only and a combination
                                                            of Hierarchical Clustering and K-Nearest Neighbour
2.1. LIME Framework                                         techniques. Although this method achieves stability,
                                                            it may find a bad approximation of the ML function,
A thorough examination of LIME is provided from a in regions with only few training points.
geometrical perspective, while a detailed algorithmic
Evaluate the post-hoc stability
The shared idea is to repeat LIME method at the same
conditions, and test whether the results are equivalent.
Among the various propositions on how to conduct
the test, in [30] the authors compare the standard devi-
ations of the Ridge coefficients, whereas [31] examines
the stability of the feature selection step - whether the
selected variables are the same - . In [26] two comple-
mentary indices have been developed, based on statis-
tical comparison of the Ridge models generated by re-
peated LIME calls. The Variables Stability Index (VSI)
checks the stability of the feature selection step, whereas
the Coefficients Stability Index (CSI) asserts the equal-
                                                            Figure 1: Toy Dataset
ity of coefficients attributed to the same feature.


3. Methodology
OptiLIME consists in a framework to guarantee the
highest reachable level of stability, constrained to the
finding of a relevant local explanation. From a geo-
metrical perspective, the relevance of the explanation
corresponds to the adherence of the linear plane to the
ML surface. To evaluate the stability we rely on the CSI
and VSI indices [26], while the adherence is assessed
using the 𝑅 2 statistic, which measures the goodness of
the linear approximation through a set of points [32].
All the figures of merit above span in the range [0, 1],
where higher values define respectively higher stabil- Figure 2: LIME explanations for different kernel widths
ity and adherence.
   To fully explain the rationale of the proposition, we
first cover three important concepts about LIME. In the geometrical ideas about LIME may be well repre-
this section we employ a Toy Dataset to show our the- sented in a 2d plot.
oretical findings.
                                                           3.1. Kernel Width defines locality
Toy Dataset
                                                       Locality is enforced through a kernel function, the de-
The dataset is generated from the Data Generating Pro- fault is the RBF Kernel (Formula 1). It is applied to each
cess:                                                  point 𝑥 (𝑖) generated in the sampling step, obtaining an
                  𝑌 = 𝑠𝑖𝑛(𝑋 ) ∗ 𝑋 + 10                 individual weight. The formulation provides smooth
100 distinct points have been generated uniformly in weights in the range [0, 1] and flexibility through the
the 𝑋 range [0,10] and only 20 of them were kept, at kernel width parameter 𝑘𝑤.
random. In Figure 1, the blue line represents the True
DGP function, whereas the green one is its best ap-                                         ||𝑥 (𝑖) − 𝑥 (𝑟𝑒𝑓 ) ||2
                                                                     𝑅𝐵𝐹 (𝑥 (𝑖) ) = exp −                              (1)
proximation using a Polynomial Regression of degree                                    (            𝑘𝑤             )
5 on the generated dataset (blue points). In the follow-
                                                           The RBF flexibility makes it suitable to each situation,
ing we will regard the Polynomial as our ML function,
                                                           although it requires a proper tuning: setting a high
we will not make use of the True DGP function (blue
                                                           𝑘𝑤 value will result in considering a neighbourhood
line) which is usually not available in practical data
                                                           of large dimension, shrinking 𝑘𝑤 we shrink the width
mining scenarios. The red dot is the reference point
                                                           of the neighbourhood.
in which we will evaluate the local LIME explanation.
                                                              In Figure 2, LIME generated points are displayed as
The dataset is intentionally one dimensional, so that
                                                           green dots and the corresponding LIME explanations
(red lines) are shown. The points are scattered all over
the ML function, however their size is proportional
to the weight assigned by the RBF kernel. Small ker-
nel widths assign significant weights only to the clos-
est points, making the further ones almost invisible.
In this way, they do not contribute to the local linear
model.
   The concept of locality is crucial to LIME: a neigh-
bourhood too large may cause the LIME model not
to be adherent to the ML function in the considered
neighbourhood.

3.2. Ridge penalty is harmful to LIME                                      (a) Ridge Penalty = 0

In statistics, data are assumed to be generated from a
Data Generating Process (DGP) combined with a source
of white noise, so that the standard formulation of the
problem is 𝑌 = 𝑓 (𝐗) + , where  ∼ 𝑁 (0, 𝜎 2 ). The aim
of each statistical model is to retrieve the best spec-
ification of the DGP function 𝑓 (𝐗), given the noisy
dataset.
   Ridge Regression [33] assumes a linear DGP, namely
𝑓 (𝐗) = 𝛼 + ∑𝑑𝑗=1 𝛽𝑗 𝑋𝑗 , and applies a penalty propor-
tional to the norm of the 𝛽 coefficients, enforced dur-
ing the estimation process through the penalty param-
eter 𝜆. This technique is useful when dealing with very
noisy datasets (where the stochastic component  ex-                            (b) Ridge Penalty = 1
hibits high variance 𝜎 2 ) [34]. In fact, the noise makes
various sets of coefficients as viable solutions. Instead, Figure 3: Effects of Ridge Penalty on LIME explanations
tuning 𝜆 to its proper value allows Ridge to retrieve a
unique solution.
   In the LIME setting, the ML function acts as the 3.3. Relationship between Stability,
DGP, while the sampled points are the dataset. Recall-           Adherence and Kernel Width
ing that the 𝑌 coordinate of each point is given by ML
                                                           Since the kernel width represents the main hyper-parameter
prediction, it is guaranteed they lie exactly on the ML
                                                           of LIME, we wish to understand how Stability and Ad-
surface by construction. Hence, no noise is present
                                                           herence vary wrt to it.
in our dataset. For this reason, we argue that Ridge
                                                           From the theory, we have few helpful results:
penalty is not needed, on the contrary it can be harm-
ful and distort the right estimates of the parameters,         • Taylor Theorem [32] gives a polynomial approx-
as shown in Figure 3.                                             imation for any differentiable function, calcu-
   In the 3b panel, Ridge penalty 𝜆 = 1 (LIME default) is         lated in a given point. If we truncate the for-
employed, whereas in 3a no penalty (𝜆 = 0) is imposed.            mula to the first degree polynomial, we obtain a
It is possible to see how the estimation gets severely            linear function, its approximation error depends
distorted by the penalty, proven also by the 𝑅 2 values.          on the distance from the point in which the er-
This happens especially for small kernel width values,            ror is evaluated and the given point.
since each unit has very small weight and the weighted            Thus, if we assume the ML function to be dif-
residuals are almost irrelevant in the Ridge loss, which          ferentiable in the neighbourhood of 𝑥 (𝑟𝑒𝑓 ) , the
is dominated by the penalty term. To minimize the                 adherence of the linear model is expected to be
penalty term the coefficients are shrunk towards 0.               inversely proportional to the width of the neigh-
                                                                  bourhood, i.e. to the kernel width. This is true
                                                                  since the approximation error depends on the
                                                                  distance from the two points, namely the neigh-
                                                                  bourhood size.
                                                             3.4. OptiLIME
                                                             Previously, we empirically showed that adherence and
                                                             stability are monotonous noisy functions of the kernel
                                                             width: for increasing kernel width we observe, on av-
                                                             erage, decreasing adherence and increasing stability.
                                                                Our proposition consists in a framework which en-
                                                             ables the best choice for the trade-off between stabil-
                                                             ity and adherence of the explanations. OptiLIME sets
                                                             a desired level of adherence and finds the largest ker-
                                                             nel width, matching the request. At the same time, the
                                                             best kernel width provides the highest stability value,
                                                             constrained to the chosen level of adherence. At the
                                                             end of the day, OptiLIME consists in an automated way
Figure 4: Relationship among kernel width, 𝑅 2 and CSI       of finding the best kernel width. Moreover, it empow-
                                                             ers the practitioner to be in control of the trade-off
                                                             between the two most important properties of LIME
     • in Linear Regression, the standard deviation of Local Explanations.
       the coefficients is inversely correlated to the stan-    To retrieve the best width, OptiLIME converts the
                                                                                                    2
       dard deviation of the 𝐗 variables [32].               decreasing 𝑅 2 function into 𝑙(𝑘𝑤, 𝑅̃ ), by means of For-
       The stability of the explanations depends on the mula 2:
       spread of the 𝐗 variables in our weighted dataset.
       We then expect the kernel width and Stability to                         {                                 2
       be directly proportional.                                           2      𝑅 2 (𝑘𝑤),      if 𝑅 2 (𝑘𝑤) ≤ 𝑅̃
                                                                         ̃
                                                                  𝑙(𝑘𝑤, 𝑅 ) =         2                           2   (2)
                                                                                  2𝑅̃ − 𝑅 2 (𝑘𝑤) if 𝑅 2 (𝑘𝑤) > 𝑅̃
   To illustrate the conjectures above, we run LIME for
different kernel width values and evaluate both 𝑅 2 and
                                                                      ̃2
CSI metrics (VSI is not considered in the Toy Dataset, where 𝑅 is the         2
                                                                                requested adherence.
since only one variable is present). In Figure 4 the re- For a fixed 𝑅 , chosen by the practitioner, the function
                                                                            ̃
                                                                      2
sults of such experiment, for the reference unit, are 𝑙(𝑘𝑤, 𝑅̃ ) presents a global maximum. We are particu-
shown.                                                                                                    2
                                                             larly interested in the arg max𝑘𝑤 𝑙(𝑘𝑤, 𝑅̃ ), namely the
   Both the adherence and stability are noisy functions best kernel width.
of the kernel width: they contain some stochasticity,           In order to solve the optimum problem, Bayesian
due to the different datasets generated by each LIME Optimization is employed, since it is the most suitable
call. Despite this, it is possible to detect a clear pat- technique to find the global optimum of noisy func-
tern: monotonically increasing for the CSI Index and tions [36]. The technique relies on two parameters
monotonically decreasing for the 𝑅 2 statistic.              to be set beforehand: 𝑝, number of preliminary calls
   For numerical evidence of these properties, we fit with random 𝑘𝑤 values, 𝑚, number of iterations of the
the Logistic function [35], which retrieves the best monotonous
                                                             search refinement strategy. Increasing the parameters
approximation to a set of points. The goodness of the ensures to find a better kernel width value, at the cost
logistic approximation is confirmed by a low value of of longer computation time.
the Mean Absolute Error (MAE).
To corroborate our assumption, the same process has             In Figure 5, an application of OptiLIME to the ref-
been repeated on all the units of the Toy Dataset, ob- erence unit of the Toy Dataset is presented. 𝑅̃ 2 has
taining average MAE for the 𝑅 2 approximation of 0.005 been set to 0.9, 𝑝 = 20 and 𝑚 = 40. The points in the
and for the CSI of 0.026. The logistic growth rate has plot represent the distinct evaluations performed by
also been inspected: 𝑅 2 highest growth rate is -10.78 the Bayesian Search in order to find the optimum.
and CSI lowest growth rate is 7.20. These results en- Comparing the plot with Figure 4, we observe the ef-
sure the monotonous relationships of adherence and                                                                     2
                                                             fect of Formula 2 on the left part of the 𝑅 2 and 𝑙(𝑘𝑤, 𝑅̃ )
stability with the kernel width, respectively decreas-
                                                             functions. In Figure 5 the search has converged to
ing and increasing.
                                                             the maximum, evaluating various points close to the
                                                             best kernel width. At the same time, it is evident the
                                                             stochastic nature of the CSI function: the several CSI
                                                                         (a) Best LIME Explanation, Unit 100


Figure 5: OptiLIME Search for the best kernel width


measurements, performed in the proximity of 0.3 value
of the kernel width, show a certain variation. Nonethe-
less, it is possible to recall the increasing CSI trend.


4. Case Study
                                                                         (b) Best LIME Explanation, Unit 7207
Dataset
                                                            Figure 6: NHANES individual Explanations using OptiL-
To validate our methodology we use a well known med- IME
ical dataset: NHANES I. It has been employed for med-
ical research [37],[38] as well as a benchmark to test
explanation methods [39]. The original dataset is de- 0.9 as a reasonable level of adherence. OptiLIME is em-
scribed in [40]. We use a reformatted version, released ployed to find the proper kernel width to achieve 𝑅 2
at http://github.com/suinleelab/treexplainer-study. It value close to 0.9 while maximizing stability indices
contains 79 features, based on clinical measurements for the local explanation models.
of 14,407 individuals. The aim is to model the risk of          The model prediction consists in the hazard ratio for
death over twenty years of follow-up.                        each individual, higher prediction means the individ-
                                                             ual is likely to survive a shorter time. Therefore, posi-
Diagnostic Algorithm                                         tive coefficients define risk factors, whereas protective
                                                             factors have negative values.
Following Lundberg [39] prescriptions, the dataset has          LIME model interpretation is the same as a Linear
been divided into a 64/16/20 split for train/validation/test.Regression model, but with the additional concept of
The features have been mean imputed and standard- locality. As an example, for Age variable we distin-
ized based on statistics computed on the training set. guish different impact based on the individual charac-
A Survival Gradient Boosting model has been trained, teristics: having 1 year more for the Unit 100 (increas-
using the XGBoost framework [41]. Its hyper-parametersing from 65 to 66 years) will raise the death risk of
have been optimized by coordinate descent, using the 3.56 base points, for Unit 7207 1 year of ageing (from
C-statistic [42] on the validation set as the figure of 49 to 50) will increase the risk of just 0.79. Another
merit.                                                       example is the impact of Sex: it is more pronounced
                                                             in elder people (being female is a protective factor for
Explanations                                                 1.49 points at age 49, at age 65 being male has a much
                                                             worse impact, as a risk factor for 3.04).
We use the OptiLIME framework to achieve the opti-              For the Unit 100 in Figure 6a, the optimal kernel
mal explanation of the XGBoost model on the dataset. width is a bit higher compared with Unit 7207 in Fig-
We consider two randomly chosen individuals to visu- ure 6b. This is probably caused by the ML model hav-
ally show the results. In our simulation, we consider ing a higher degree of non linearity for the latter unit:
to achieve the same adherence, we are forced to con- sciously.
sider a smaller portion of the ML model, hence a small     We exploit these findings in order to tackle LIME
neighbourhood. Smaller kernel width implies also a weak points. The result is the OptiLIME framework,
reduced Stability, testified by small values of the VSI which represents a new and innovative contribution to
and CSI indices. Whenever the practitioner desires the scientific community. OptiLIME achieves stability
more stable results, it is possible to re-run OptiLIME of the explanations and automatically finds the proper
with a less strict requirement for the adherence. It kernel width value, according to the practitioner’s needs.
is important to remark that low degrees of adherence       The framework may serve as an extremely useful
will make the explanations increasingly more global: tool: using OptiLIME, the practitioner knows how much
the linear surface retrieved by LIME will consist in an to trust the explanations, based on their stability and
average of many local non-linearities of the ML model. adherence values.
   The computation time largely depends on the Bayesian Nonetheless, we acknowledge that the optimization
Search, controlled by the parameters 𝑝 and 𝑚. In our framework may be improved to allow for a faster and
setting, 𝑝 = 10 and 𝑚 = 30 produce good results for more precise computation.
both the units in Figure 6.
On a 4 Intel-i7 CPUs 2.50GHz laptop, the OptiLIME
evaluation for Unit 100 and Unit 7207 took respectively Acknowledgments
123 and 147 seconds to compute. For faster, but less ac-
                                                         We acknowledge financial support by CRIF S.p.A. and
curate results, the Bayesian Search parameters can be
                                                         Università degli Studi di Bologna.
reduced.


5. Conclusions                                                  References
                                                                [1] A. Holzinger, G. Langs, H. Denk, K. Zatloukal,
In Medicine, diagnostic computer algorithms provid-
                                                                    H. Müller, Causability and explainability of ar-
ing accurate predictions have countless benefits, no-
                                                                    tificial intelligence in medicine, Wiley Interdis-
tably they may help in saving lives as well as reduc-
                                                                    ciplinary Reviews: Data Mining and Knowledge
ing medical costs. However, precisely because of the
                                                                    Discovery 9 (2019) e1312.
importance of these matters, the rationale of the de-
                                                                [2] I. Kononenko, Machine learning for medical di-
cisions must be clear and understandable. A plethora
                                                                    agnosis: History, state of the art and perspective,
of techniques to explain the ML decisions has grown
                                                                    Artificial Intelligence in medicine 23 (2001) 89–
in recent years, though there is no consensus on the
                                                                    109.
best in class, since each method presents some draw-
                                                                [3] R. Miotto, L. Li, B. A. Kidd, J. T. Dudley, Deep pa-
backs. Explainable models are required to be reliable,
                                                                    tient: An unsupervised representation to predict
thus stability is regarded as a key desiderata.
                                                                    the future of patients from the electronic health
   We consider the LIME technique, whose major draw-
                                                                    records, Scientific reports 6 (2016) 1–10.
back lies in the lack of stability. Moreover, it is difficult
                                                                [4] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart,
to tune properly its main parameter: different values
                                                                    J. Sun, Doctor ai: Predicting clinical events via
of the kernel width provide substantially different ex-
                                                                    recurrent neural networks, in: Machine Learning
planations.
                                                                    for Healthcare Conference, 2016, pp. 301–318.
   The main contribution of this paper consists in the
                                                                [5] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Ha-
clear decomposition of the LIME framework in its rel-
                                                                    jaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun,
evant components and the exhaustive analysis of each
                                                                    Scalable and accurate deep learning with elec-
one, starting from the geometrical meaning through
                                                                    tronic health records, NPJ Digital Medicine 1
the empirical experiments to validate our intuitions.
                                                                    (2018) 18.
We showed that Ridge penalty is not needed and LIME
                                                                [6] B. Shickel, P. J. Tighe, A. Bihorac, P. Rashidi,
works best with simple Linear Regression as explain-
                                                                    Deep EHR: A survey of recent advances in deep
able model. In addition, smaller kernel width values
                                                                    learning techniques for electronic health record
provide a more adherent LIME plane to the ML surface,
                                                                    (EHR) analysis, IEEE journal of biomedical and
therefore a more realistic local explanation. Eventu-
                                                                    health informatics 22 (2017) 1589–1604.
ally, the trade-off between the adherence and stabil-
                                                                [7] Z. Che, D. Kale, W. Li, M. T. Bahadori, Y. Liu, Deep
ity properties is extremely valuable since it empowers
                                                                    computational phenotyping, in: Proceedings of
the practitioner to choose the best kernel width con-
     the 21th ACM SIGKDD International Conference              tional expectation, Journal of Computational and
     on Knowledge Discovery and Data Mining, 2015,             Graphical Statistics 24 (2015) 44–65.
     pp. 507–516.                                         [21] D. W. Apley, J. Zhu, Visualizing the effects of pre-
 [8] T. A. Lasko, J. C. Denny, M. A. Levy, Computa-            dictor variables in black box supervised learning
     tional phenotype discovery using unsupervised             models, arXiv preprint arXiv:1612.08468 (2016).
     feature learning over noisy, sparse, and irregular   [22] S. M. Lundberg, S.-I. Lee, A unified approach
     clinical data, PloS one 8 (2013).                         to interpreting model predictions, in: Advances
 [9] E. J. Topol, High-performance medicine: The               in Neural Information Processing Systems, 2017,
     convergence of human and artificial intelligence,         pp. 4765–4774.
     Nature medicine 25 (2019) 44–56.                     [23] M. Craven, J. W. Shavlik,         Extracting tree-
[10] A. Holzinger, From machine learning to explain-           structured representations of trained networks,
     able AI, in: 2018 World Symposium on Digital              in: Advances in Neural Information Processing
     Intelligence for Systems and Machines (DISA),             Systems, 1996, pp. 24–30.
     IEEE, 2018, pp. 55–66.                               [24] Y. Zhou, G. Hooker,           Interpreting models
[11] C. Molnar, Interpretable Machine Learning, Lulu.          via single tree approximation, arXiv preprint
     com, 2020.                                                arXiv:1610.09036 (2016).
[12] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini,    [25] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors:
     F. Giannotti, D. Pedreschi, A survey of methods           High-precision model-agnostic explanations, in:
     for explaining black box models, ACM comput-              Thirty-Second AAAI Conference on Artificial In-
     ing surveys (CSUR) 51 (2018) 93.                          telligence, 2018.
[13] M. T. Ribeiro, S. Singh, C. Guestrin, Why should     [26] G. Visani, E. Bagli, F. Chesani, A. Poluzzi, D. Ca-
     i trust you?: Explaining the predictions of any           puzzo, Statistical stability indices for LIME: Ob-
     classifier, in: Proceedings of the 22nd ACM               taining reliable explanations for Machine Learn-
     SIGKDD International Conference on Knowl-                 ing models, arXiv preprint arXiv:2001.11757
     edge Discovery and Data Mining, ACM, 2016, pp.            (2020).
     1135–1144.                                           [27] D. Alvarez-Melis, T. S. Jaakkola, On the robust-
[14] G. J. Katuwal, R. Chen, Machine learning model            ness of interpretability methods, arXiv preprint
     interpretability for precision medicine, arXiv            arXiv:1806.08049 (2018).
     preprint arXiv:1610.09045 (2016).                    [28] A. Gosiewska, P. Biecek, IBreakDown: Un-
[15] A. Y. Zhang, S. S. W. Lam, N. Liu, Y. Pang, L. L.         certainty of model explanations for non-
     Chan, P. H. Tang, Development of a Radiology              additive predictive models,         arXiv preprint
     Decision Support System for the Classification of         arXiv:1903.11420 (2019).
     MRI Brain Scans, in: 2018 IEEE/ACM 5th Interna-      [29] M. R. Zafar, N. M. Khan, DLIME: A deterministic
     tional Conference on Big Data Computing Appli-            local interpretable model-agnostic explanations
     cations and Technologies (BDCAT), IEEE, 2018,             approach for computer-aided diagnosis systems,
     pp. 107–115.                                              arXiv preprint arXiv:1906.10263 (2019).
[16] C. Moreira, R. Sindhgatta, C. Ouyang, P. Bruza,      [30] S. M. Shankaranarayana, D. Runje, ALIME:
     A. Wichert, An Investigation of Interpretability          Autoencoder Based Approach for Local Inter-
     Techniques for Deep Learning in Predictive Pro-           pretability, in: International Conference on Intel-
     cess Analytics, arXiv preprint arXiv:2002.09192           ligent Data Engineering and Automated Learn-
     (2020).                                                   ing, Springer, 2019, pp. 454–463.
[17] L. Breiman, Random forests, Machine learning         [31] C. Molnar, Limitations of Interpretable Machine
     45 (2001) 5–32.                                           Learning Methods, 2020.
[18] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani,     [32] W. H. Greene, Econometric Analysis, Pearson
     L. Wasserman, Distribution-free predictive in-            Education India, 2003.
     ference for regression, Journal of the American      [33] A. E. Hoerl, R. W. Kennard, Ridge Regression:
     Statistical Association 113 (2018) 1094–1111.             Biased Estimation for Nonorthogonal Problems,
[19] J. H. Friedman, Greedy function approximation:            Technometrics 12 (1970) 55–67. doi:10.1080/
     A gradient boosting machine, Annals of statistics         00401706.1970.10488634.
     (2001) 1189–1232.                                    [34] W. N. van Wieringen, Lecture notes on ridge re-
[20] A. Goldstein, A. Kapelner, J. Bleich, E. Pitkin,          gression, arXiv preprint arXiv:1509.09169 (2019).
     Peeking inside the black box: Visualizing sta-       [35] P.-F. Verhulst, Correspondance mathématique et
     tistical learning with plots of individual condi-         physique, Ghent and Brussels 10 (1838) 113.
[36] B. Letham, B. Karrer, G. Ottoni, E. Bakshy, Con-
     strained Bayesian optimization with noisy exper-
     iments, Bayesian Analysis 14 (2019) 495–519.
[37] J. Fang, M. H. Alderman, Serum uric acid and car-
     diovascular mortality: The NHANES I epidemio-
     logic follow-up study, 1971-1992, Jama 283 (2000)
     2404–2410.
[38] L. J. Launer, T. Harris, C. Rumpel, J. Madans,
     Body mass index, weight change, and risk of mo-
     bility disability in middle-aged and older women:
     The epidemiologic follow-up study of NHANES
     I, Jama 271 (1994) 1093–1098.
[39] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave,
     J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb,
     N. Bansal, S.-I. Lee, From local explanations
     to global understanding with explainable AI for
     trees, Nature machine intelligence 2 (2020) 2522–
     5839.
[40] C. S. Cox, Plan and Operation of the NHANES I
     Epidemiologic Followup Study, 1987, 27, US De-
     partment of Health and Human Services, Public
     Health Service, Centers . . . , 1992.
[41] T. Chen, C. Guestrin, Xgboost: A scalable tree
     boosting system, in: Proceedings of the 22nd
     Acm Sigkdd International Conference on Knowl-
     edge Discovery and Data Mining, 2016, pp. 785–
     794.
[42] P. J. Heagerty, T. Lumley, M. S. Pepe, Time-
     dependent ROC curves for censored survival data
     and a diagnostic marker, Biometrics 56 (2000)
     337–344.

</pre>