=Paper=
{{Paper
|id=Vol-2699/paper03
|storemode=property
|title=OptiLIME: Optimized LIME Explanations for Diagnostic
Computer Algorithms
|pdfUrl=https://ceur-ws.org/Vol-2699/paper03.pdf
|volume=Vol-2699
|authors=Giorgio Visani,Enrico Bagli,Federico Chesani
|dblpUrl=https://dblp.org/rec/conf/cikm/VisaniBC20
}}
==OptiLIME: Optimized LIME Explanations for Diagnostic
Computer Algorithms
==
OptiLIME: Optimized LIME Explanations for Diagnostic Computer Algorithms Giorgio Visania,b , Enrico Baglib and Federico Chesania a University of Bologna, School of Informatics & Engineering, viale Risorgimento 2, 40136 Bologna (BO), Italy b CRIF S.p.A., via Mario Fantin 1-3, 40131 Bologna (BO), Italy Abstract Local Interpretable Model-Agnostic Explanations (LIME) is a popular method to perform interpretability of any kind of Ma- chine Learning (ML) model. It explains one ML prediction at a time, by learning a simple linear model around the prediction. The model is trained on randomly generated data points, sampled from the training dataset distribution and weighted ac- cording to the distance from the reference point - the one being explained by LIME. Feature selection is applied to keep only the most important variables, their coefficients are regarded as explanation. LIME is widespread across different domains, although its instability - a single prediction may obtain different explanations - is one of the major shortcomings. This is due to the randomness in the sampling step, as well and determines a lack of reliability in the retrieved explanations, making LIME adoption problematic. In Medicine especially, clinical professionals trust is mandatory to determine the acceptance of an explainable algorithm, considering the importance of the decisions at stake and the related legal issues. In this paper, we highlight a trade-off between explanation’s stability and adherence, namely how much it resembles the ML model. Exploiting our innovative discovery, we propose a framework to maximise stability, while retaining a predefined level of adherence. Op- tiLIME provides freedom to choose the best adherence-stability trade-off level and more importantly, it clearly highlights the mathematical properties of the retrieved explanation. As a result, the practitioner is provided with tools to decide whether the explanation is reliable, according to the problem at hand. We extensively test OptiLIME on a toy dataset - to present visually the geometrical findings - and a medical dataset. In the latter, we show how the method comes up with meaningful explanations both from a medical and mathematical standpoint. Keywords Explainable AI (XAI), Interpretable Machine Learning, Explanation, Model Agnostic, LIME, Healthcare, Stability 1. Introduction sions?”) are some of the main topics XAI tries to ad- dress. To achieve the explainability, quite a few tech- Nowadays Machine Learning (ML) is pervasive and niques have been proposed in recent literature. These widespread across multiple domains. Medicine makes approaches can be grouped based on different criterion no difference, on the contrary it is considered one of [11], [12] such as i) Model agnostic or model specific the greatest challenges of Artificial Intelligence [1]. The ii) Local, global or example based iii) Intrinsic or post- idea of exploiting computers to provide assistance to hoc iv) Perturbation or saliency based. Among them, the medical personnel is not new. An historical overview model agnostic approaches are quite popular in prac- on the topic, starting from the early ‘60s is provided tice, since the algorithm is designed to be effective on in [2]. More recently, computer algorithms have been any type of ML model. proven useful for patients and medical concepts repre- LIME [13] is a well-known instance-based, model sentation [3], outcome prediction [4],[5],[6] and new agnostic algorithm. The method generates data points, phenotype discovery [7],[8]. An accurate overview of sampled from the training dataset distribution and weighted ML successes in Health related environments, is pro- according to distance from the instance being explained. vided by Topol in [9]. Feature selection is applied to keep only the most im- Unfortunately, ML methods are hardly perfect and, portant variables and a linear model is trained on the especially in the medical field where human lives are weighted dataset. The model coefficients are regarded at stake, Explainable Artificial Intelligence (XAI) is ur- as explanation. LIME has already been employed sev- gently needed [10]. Medical education, research and eral times in medicine, such as on Intensive Care data accountability (“who is accountable for wrong deci- [14] and cancer data [15],[16]. The technique is known to suffer from instability, mainly caused by the ran- Proceedings of the CIKM 2020 Workshops, October 19-20, 2020, Galway, Ireland domness introduced in the sampling step. Stability is a " giorgio.visani2@unibo.it (G. Visani) desirable property for an interpretable model, whereas 0000-0001-6818-3526 (G. Visani); 0000-0003-3913-7701 (E. Bagli); the lack of it reduces the trust in the explanations re- 0000-0003-1664-9632 (F. Chesani) trieved, especially in the medical field. © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In our contribution, we review the geometrical idea CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) on which LIME is based upon. Relying on statistical description can be found in [13]. We may consider the theory and simulations, we highlight a trade-off be- ML model as a multivariate surface in the ℝ𝑑+1 space tween the explanation’s stability and adherence, namely spanned by the 𝑑 independent variables 𝑋1 , ..., 𝑋𝑑 and how much LIME’s simple model resembles the ML model.the 𝑌 dependent variable. Exploiting our innovative discovery, we propose Op- LIME’s objective is to find the tangent plane to the tiLIME: a framework to maximise the stability, while ML surface, in the point we want to explain. This task retaining a predefined level of adherence. OptiLIME is analytically unfeasible, since we don’t have a para- provides both i) freedom to choose the best adherence- metric formulation of the function, besides the ML sur- stability trade-off level and ii) it clearly highlights the face may have a huge number of discontinuity points, mathematical properties of the explanation retrieved. preventing the existence of a proper derivative and As a result, the practitioner is provided with tools to tangent. To find an approximation of the tangent, LIME decide whether each explanation is reliable, according uses a Ridge Linear Model to fit points on the ML sur- to the problem at hand. face, in the neighbourhood of the reference individual. We test the validity of the framework on a medical Points all over the ℝ𝑑 space are generated, sampling dataset, where the method comes up with meaningful the 𝐗 values from a Normal distribution inferred from explanations both from a medical and mathematical the training set. The 𝑌 coordinate values are obtained standpoint. In addition, a toy dataset is employed to by ML predictions, so that the generated points are present visually the geometrical findings. guaranteed to perfectly lie on the ML surface. The The code used for the experiments is available at concept of neighbourhood is introduced using a kernel https://github.com/giorgiovisani/LIME_stability. function (RBF Kernel), which smoothly assigns higher weights to points closer to the reference. Ridge Model is trained on the generated dataset, each point weighted 2. Related Work by the kernel function, to estimate the linear relation- ship 𝐄(𝑌 ) = 𝛼 + ∑𝑑𝑗=1 𝛽𝑗 𝑋𝑗 . The 𝛽 coefficients are re- For the sake of shortness, in the following review we garded as LIME explanation. consider only model agnostic techniques, which are effective on any kind of ML model by construction. A popular approach is to exclude a certain feature, or 2.2. LIME Instability group of features, from the model and evaluate the One of the main issues of LIME is the lack of stability. loss incurred in terms of model goodness. Such value Explanations derived from repeated LIME calls, under quantifies the importance of the excluded feature: an the same conditions, are considered stable when statis- high loss value underlines an important variable for tically equal [26]. In [27] the authors provide insight the prediction task. The idea has been first introduced about LIME’s lack of robustness, a similar notion to the by Breiman [17] for the Random Forest model and has above-mentioned stability. Analogous findings also in been generalised to a model-agnostic framework, named [28]. Often, practitioners are either not aware of such LOCO [18]. Based on variable exclusion, the predictive drawback or diffident about the method because of its power of the ML models has been decomposed into unreliability. By all means, unambiguous explanations single variables contribution in PDP [19], ICE [20] and are a key desiderata for the interpretable frameworks. ALE [21] plots, based on different assumptions about The major source of LIME instability comes from the ML model. The same idea is exploited also for local the sampling step, when new observations are ran- explanations in SHAP [22], where the decomposition domly selected. Some approaches, grouped in two high is obtained through a game-based setting. level concepts, have been recently laid out in order to Another common approach is to train a surrogate solve the stability issue. model mimicking the behaviour of the ML model. In this vein, approximations on the entire input space are Avoid the sampling step provided in [23] and [24] among others, while LIME [13] and its extension using decision rules [25] rely on In [29] the authors propose to bypass the sampling this technique for providing local approximations. step using the training units only and a combination of Hierarchical Clustering and K-Nearest Neighbour 2.1. LIME Framework techniques. Although this method achieves stability, it may find a bad approximation of the ML function, A thorough examination of LIME is provided from a in regions with only few training points. geometrical perspective, while a detailed algorithmic Evaluate the post-hoc stability The shared idea is to repeat LIME method at the same conditions, and test whether the results are equivalent. Among the various propositions on how to conduct the test, in [30] the authors compare the standard devi- ations of the Ridge coefficients, whereas [31] examines the stability of the feature selection step - whether the selected variables are the same - . In [26] two comple- mentary indices have been developed, based on statis- tical comparison of the Ridge models generated by re- peated LIME calls. The Variables Stability Index (VSI) checks the stability of the feature selection step, whereas the Coefficients Stability Index (CSI) asserts the equal- Figure 1: Toy Dataset ity of coefficients attributed to the same feature. 3. Methodology OptiLIME consists in a framework to guarantee the highest reachable level of stability, constrained to the finding of a relevant local explanation. From a geo- metrical perspective, the relevance of the explanation corresponds to the adherence of the linear plane to the ML surface. To evaluate the stability we rely on the CSI and VSI indices [26], while the adherence is assessed using the 𝑅 2 statistic, which measures the goodness of the linear approximation through a set of points [32]. All the figures of merit above span in the range [0, 1], where higher values define respectively higher stabil- Figure 2: LIME explanations for different kernel widths ity and adherence. To fully explain the rationale of the proposition, we first cover three important concepts about LIME. In the geometrical ideas about LIME may be well repre- this section we employ a Toy Dataset to show our the- sented in a 2d plot. oretical findings. 3.1. Kernel Width defines locality Toy Dataset Locality is enforced through a kernel function, the de- The dataset is generated from the Data Generating Pro- fault is the RBF Kernel (Formula 1). It is applied to each cess: point 𝑥 (𝑖) generated in the sampling step, obtaining an 𝑌 = 𝑠𝑖𝑛(𝑋 ) ∗ 𝑋 + 10 individual weight. The formulation provides smooth 100 distinct points have been generated uniformly in weights in the range [0, 1] and flexibility through the the 𝑋 range [0,10] and only 20 of them were kept, at kernel width parameter 𝑘𝑤. random. In Figure 1, the blue line represents the True DGP function, whereas the green one is its best ap- ||𝑥 (𝑖) − 𝑥 (𝑟𝑒𝑓 ) ||2 𝑅𝐵𝐹 (𝑥 (𝑖) ) = exp − (1) proximation using a Polynomial Regression of degree ( 𝑘𝑤 ) 5 on the generated dataset (blue points). In the follow- The RBF flexibility makes it suitable to each situation, ing we will regard the Polynomial as our ML function, although it requires a proper tuning: setting a high we will not make use of the True DGP function (blue 𝑘𝑤 value will result in considering a neighbourhood line) which is usually not available in practical data of large dimension, shrinking 𝑘𝑤 we shrink the width mining scenarios. The red dot is the reference point of the neighbourhood. in which we will evaluate the local LIME explanation. In Figure 2, LIME generated points are displayed as The dataset is intentionally one dimensional, so that green dots and the corresponding LIME explanations (red lines) are shown. The points are scattered all over the ML function, however their size is proportional to the weight assigned by the RBF kernel. Small ker- nel widths assign significant weights only to the clos- est points, making the further ones almost invisible. In this way, they do not contribute to the local linear model. The concept of locality is crucial to LIME: a neigh- bourhood too large may cause the LIME model not to be adherent to the ML function in the considered neighbourhood. 3.2. Ridge penalty is harmful to LIME (a) Ridge Penalty = 0 In statistics, data are assumed to be generated from a Data Generating Process (DGP) combined with a source of white noise, so that the standard formulation of the problem is 𝑌 = 𝑓 (𝐗) + , where ∼ 𝑁 (0, 𝜎 2 ). The aim of each statistical model is to retrieve the best spec- ification of the DGP function 𝑓 (𝐗), given the noisy dataset. Ridge Regression [33] assumes a linear DGP, namely 𝑓 (𝐗) = 𝛼 + ∑𝑑𝑗=1 𝛽𝑗 𝑋𝑗 , and applies a penalty propor- tional to the norm of the 𝛽 coefficients, enforced dur- ing the estimation process through the penalty param- eter 𝜆. This technique is useful when dealing with very noisy datasets (where the stochastic component ex- (b) Ridge Penalty = 1 hibits high variance 𝜎 2 ) [34]. In fact, the noise makes various sets of coefficients as viable solutions. Instead, Figure 3: Effects of Ridge Penalty on LIME explanations tuning 𝜆 to its proper value allows Ridge to retrieve a unique solution. In the LIME setting, the ML function acts as the 3.3. Relationship between Stability, DGP, while the sampled points are the dataset. Recall- Adherence and Kernel Width ing that the 𝑌 coordinate of each point is given by ML Since the kernel width represents the main hyper-parameter prediction, it is guaranteed they lie exactly on the ML of LIME, we wish to understand how Stability and Ad- surface by construction. Hence, no noise is present herence vary wrt to it. in our dataset. For this reason, we argue that Ridge From the theory, we have few helpful results: penalty is not needed, on the contrary it can be harm- ful and distort the right estimates of the parameters, • Taylor Theorem [32] gives a polynomial approx- as shown in Figure 3. imation for any differentiable function, calcu- In the 3b panel, Ridge penalty 𝜆 = 1 (LIME default) is lated in a given point. If we truncate the for- employed, whereas in 3a no penalty (𝜆 = 0) is imposed. mula to the first degree polynomial, we obtain a It is possible to see how the estimation gets severely linear function, its approximation error depends distorted by the penalty, proven also by the 𝑅 2 values. on the distance from the point in which the er- This happens especially for small kernel width values, ror is evaluated and the given point. since each unit has very small weight and the weighted Thus, if we assume the ML function to be dif- residuals are almost irrelevant in the Ridge loss, which ferentiable in the neighbourhood of 𝑥 (𝑟𝑒𝑓 ) , the is dominated by the penalty term. To minimize the adherence of the linear model is expected to be penalty term the coefficients are shrunk towards 0. inversely proportional to the width of the neigh- bourhood, i.e. to the kernel width. This is true since the approximation error depends on the distance from the two points, namely the neigh- bourhood size. 3.4. OptiLIME Previously, we empirically showed that adherence and stability are monotonous noisy functions of the kernel width: for increasing kernel width we observe, on av- erage, decreasing adherence and increasing stability. Our proposition consists in a framework which en- ables the best choice for the trade-off between stabil- ity and adherence of the explanations. OptiLIME sets a desired level of adherence and finds the largest ker- nel width, matching the request. At the same time, the best kernel width provides the highest stability value, constrained to the chosen level of adherence. At the end of the day, OptiLIME consists in an automated way Figure 4: Relationship among kernel width, 𝑅 2 and CSI of finding the best kernel width. Moreover, it empow- ers the practitioner to be in control of the trade-off between the two most important properties of LIME • in Linear Regression, the standard deviation of Local Explanations. the coefficients is inversely correlated to the stan- To retrieve the best width, OptiLIME converts the 2 dard deviation of the 𝐗 variables [32]. decreasing 𝑅 2 function into 𝑙(𝑘𝑤, 𝑅̃ ), by means of For- The stability of the explanations depends on the mula 2: spread of the 𝐗 variables in our weighted dataset. We then expect the kernel width and Stability to { 2 be directly proportional. 2 𝑅 2 (𝑘𝑤), if 𝑅 2 (𝑘𝑤) ≤ 𝑅̃ ̃ 𝑙(𝑘𝑤, 𝑅 ) = 2 2 (2) 2𝑅̃ − 𝑅 2 (𝑘𝑤) if 𝑅 2 (𝑘𝑤) > 𝑅̃ To illustrate the conjectures above, we run LIME for different kernel width values and evaluate both 𝑅 2 and ̃2 CSI metrics (VSI is not considered in the Toy Dataset, where 𝑅 is the 2 requested adherence. since only one variable is present). In Figure 4 the re- For a fixed 𝑅 , chosen by the practitioner, the function ̃ 2 sults of such experiment, for the reference unit, are 𝑙(𝑘𝑤, 𝑅̃ ) presents a global maximum. We are particu- shown. 2 larly interested in the arg max𝑘𝑤 𝑙(𝑘𝑤, 𝑅̃ ), namely the Both the adherence and stability are noisy functions best kernel width. of the kernel width: they contain some stochasticity, In order to solve the optimum problem, Bayesian due to the different datasets generated by each LIME Optimization is employed, since it is the most suitable call. Despite this, it is possible to detect a clear pat- technique to find the global optimum of noisy func- tern: monotonically increasing for the CSI Index and tions [36]. The technique relies on two parameters monotonically decreasing for the 𝑅 2 statistic. to be set beforehand: 𝑝, number of preliminary calls For numerical evidence of these properties, we fit with random 𝑘𝑤 values, 𝑚, number of iterations of the the Logistic function [35], which retrieves the best monotonous search refinement strategy. Increasing the parameters approximation to a set of points. The goodness of the ensures to find a better kernel width value, at the cost logistic approximation is confirmed by a low value of of longer computation time. the Mean Absolute Error (MAE). To corroborate our assumption, the same process has In Figure 5, an application of OptiLIME to the ref- been repeated on all the units of the Toy Dataset, ob- erence unit of the Toy Dataset is presented. 𝑅̃ 2 has taining average MAE for the 𝑅 2 approximation of 0.005 been set to 0.9, 𝑝 = 20 and 𝑚 = 40. The points in the and for the CSI of 0.026. The logistic growth rate has plot represent the distinct evaluations performed by also been inspected: 𝑅 2 highest growth rate is -10.78 the Bayesian Search in order to find the optimum. and CSI lowest growth rate is 7.20. These results en- Comparing the plot with Figure 4, we observe the ef- sure the monotonous relationships of adherence and 2 fect of Formula 2 on the left part of the 𝑅 2 and 𝑙(𝑘𝑤, 𝑅̃ ) stability with the kernel width, respectively decreas- functions. In Figure 5 the search has converged to ing and increasing. the maximum, evaluating various points close to the best kernel width. At the same time, it is evident the stochastic nature of the CSI function: the several CSI (a) Best LIME Explanation, Unit 100 Figure 5: OptiLIME Search for the best kernel width measurements, performed in the proximity of 0.3 value of the kernel width, show a certain variation. Nonethe- less, it is possible to recall the increasing CSI trend. 4. Case Study (b) Best LIME Explanation, Unit 7207 Dataset Figure 6: NHANES individual Explanations using OptiL- To validate our methodology we use a well known med- IME ical dataset: NHANES I. It has been employed for med- ical research [37],[38] as well as a benchmark to test explanation methods [39]. The original dataset is de- 0.9 as a reasonable level of adherence. OptiLIME is em- scribed in [40]. We use a reformatted version, released ployed to find the proper kernel width to achieve 𝑅 2 at http://github.com/suinleelab/treexplainer-study. It value close to 0.9 while maximizing stability indices contains 79 features, based on clinical measurements for the local explanation models. of 14,407 individuals. The aim is to model the risk of The model prediction consists in the hazard ratio for death over twenty years of follow-up. each individual, higher prediction means the individ- ual is likely to survive a shorter time. Therefore, posi- Diagnostic Algorithm tive coefficients define risk factors, whereas protective factors have negative values. Following Lundberg [39] prescriptions, the dataset has LIME model interpretation is the same as a Linear been divided into a 64/16/20 split for train/validation/test.Regression model, but with the additional concept of The features have been mean imputed and standard- locality. As an example, for Age variable we distin- ized based on statistics computed on the training set. guish different impact based on the individual charac- A Survival Gradient Boosting model has been trained, teristics: having 1 year more for the Unit 100 (increas- using the XGBoost framework [41]. Its hyper-parametersing from 65 to 66 years) will raise the death risk of have been optimized by coordinate descent, using the 3.56 base points, for Unit 7207 1 year of ageing (from C-statistic [42] on the validation set as the figure of 49 to 50) will increase the risk of just 0.79. Another merit. example is the impact of Sex: it is more pronounced in elder people (being female is a protective factor for Explanations 1.49 points at age 49, at age 65 being male has a much worse impact, as a risk factor for 3.04). We use the OptiLIME framework to achieve the opti- For the Unit 100 in Figure 6a, the optimal kernel mal explanation of the XGBoost model on the dataset. width is a bit higher compared with Unit 7207 in Fig- We consider two randomly chosen individuals to visu- ure 6b. This is probably caused by the ML model hav- ally show the results. In our simulation, we consider ing a higher degree of non linearity for the latter unit: to achieve the same adherence, we are forced to con- sciously. sider a smaller portion of the ML model, hence a small We exploit these findings in order to tackle LIME neighbourhood. Smaller kernel width implies also a weak points. The result is the OptiLIME framework, reduced Stability, testified by small values of the VSI which represents a new and innovative contribution to and CSI indices. Whenever the practitioner desires the scientific community. OptiLIME achieves stability more stable results, it is possible to re-run OptiLIME of the explanations and automatically finds the proper with a less strict requirement for the adherence. It kernel width value, according to the practitioner’s needs. is important to remark that low degrees of adherence The framework may serve as an extremely useful will make the explanations increasingly more global: tool: using OptiLIME, the practitioner knows how much the linear surface retrieved by LIME will consist in an to trust the explanations, based on their stability and average of many local non-linearities of the ML model. adherence values. The computation time largely depends on the Bayesian Nonetheless, we acknowledge that the optimization Search, controlled by the parameters 𝑝 and 𝑚. In our framework may be improved to allow for a faster and setting, 𝑝 = 10 and 𝑚 = 30 produce good results for more precise computation. both the units in Figure 6. On a 4 Intel-i7 CPUs 2.50GHz laptop, the OptiLIME evaluation for Unit 100 and Unit 7207 took respectively Acknowledgments 123 and 147 seconds to compute. For faster, but less ac- We acknowledge financial support by CRIF S.p.A. and curate results, the Bayesian Search parameters can be Università degli Studi di Bologna. reduced. 5. Conclusions References [1] A. Holzinger, G. Langs, H. Denk, K. Zatloukal, In Medicine, diagnostic computer algorithms provid- H. Müller, Causability and explainability of ar- ing accurate predictions have countless benefits, no- tificial intelligence in medicine, Wiley Interdis- tably they may help in saving lives as well as reduc- ciplinary Reviews: Data Mining and Knowledge ing medical costs. However, precisely because of the Discovery 9 (2019) e1312. importance of these matters, the rationale of the de- [2] I. Kononenko, Machine learning for medical di- cisions must be clear and understandable. A plethora agnosis: History, state of the art and perspective, of techniques to explain the ML decisions has grown Artificial Intelligence in medicine 23 (2001) 89– in recent years, though there is no consensus on the 109. best in class, since each method presents some draw- [3] R. Miotto, L. Li, B. A. Kidd, J. T. Dudley, Deep pa- backs. Explainable models are required to be reliable, tient: An unsupervised representation to predict thus stability is regarded as a key desiderata. the future of patients from the electronic health We consider the LIME technique, whose major draw- records, Scientific reports 6 (2016) 1–10. back lies in the lack of stability. Moreover, it is difficult [4] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, to tune properly its main parameter: different values J. Sun, Doctor ai: Predicting clinical events via of the kernel width provide substantially different ex- recurrent neural networks, in: Machine Learning planations. for Healthcare Conference, 2016, pp. 301–318. The main contribution of this paper consists in the [5] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Ha- clear decomposition of the LIME framework in its rel- jaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun, evant components and the exhaustive analysis of each Scalable and accurate deep learning with elec- one, starting from the geometrical meaning through tronic health records, NPJ Digital Medicine 1 the empirical experiments to validate our intuitions. (2018) 18. We showed that Ridge penalty is not needed and LIME [6] B. Shickel, P. J. Tighe, A. Bihorac, P. Rashidi, works best with simple Linear Regression as explain- Deep EHR: A survey of recent advances in deep able model. In addition, smaller kernel width values learning techniques for electronic health record provide a more adherent LIME plane to the ML surface, (EHR) analysis, IEEE journal of biomedical and therefore a more realistic local explanation. Eventu- health informatics 22 (2017) 1589–1604. ally, the trade-off between the adherence and stabil- [7] Z. Che, D. Kale, W. Li, M. T. Bahadori, Y. Liu, Deep ity properties is extremely valuable since it empowers computational phenotyping, in: Proceedings of the practitioner to choose the best kernel width con- the 21th ACM SIGKDD International Conference tional expectation, Journal of Computational and on Knowledge Discovery and Data Mining, 2015, Graphical Statistics 24 (2015) 44–65. pp. 507–516. [21] D. W. Apley, J. Zhu, Visualizing the effects of pre- [8] T. A. Lasko, J. C. Denny, M. A. Levy, Computa- dictor variables in black box supervised learning tional phenotype discovery using unsupervised models, arXiv preprint arXiv:1612.08468 (2016). feature learning over noisy, sparse, and irregular [22] S. M. Lundberg, S.-I. Lee, A unified approach clinical data, PloS one 8 (2013). to interpreting model predictions, in: Advances [9] E. J. Topol, High-performance medicine: The in Neural Information Processing Systems, 2017, convergence of human and artificial intelligence, pp. 4765–4774. Nature medicine 25 (2019) 44–56. [23] M. Craven, J. W. Shavlik, Extracting tree- [10] A. Holzinger, From machine learning to explain- structured representations of trained networks, able AI, in: 2018 World Symposium on Digital in: Advances in Neural Information Processing Intelligence for Systems and Machines (DISA), Systems, 1996, pp. 24–30. IEEE, 2018, pp. 55–66. [24] Y. Zhou, G. Hooker, Interpreting models [11] C. Molnar, Interpretable Machine Learning, Lulu. via single tree approximation, arXiv preprint com, 2020. arXiv:1610.09036 (2016). [12] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, [25] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors: F. Giannotti, D. Pedreschi, A survey of methods High-precision model-agnostic explanations, in: for explaining black box models, ACM comput- Thirty-Second AAAI Conference on Artificial In- ing surveys (CSUR) 51 (2018) 93. telligence, 2018. [13] M. T. Ribeiro, S. Singh, C. Guestrin, Why should [26] G. Visani, E. Bagli, F. Chesani, A. Poluzzi, D. Ca- i trust you?: Explaining the predictions of any puzzo, Statistical stability indices for LIME: Ob- classifier, in: Proceedings of the 22nd ACM taining reliable explanations for Machine Learn- SIGKDD International Conference on Knowl- ing models, arXiv preprint arXiv:2001.11757 edge Discovery and Data Mining, ACM, 2016, pp. (2020). 1135–1144. [27] D. Alvarez-Melis, T. S. Jaakkola, On the robust- [14] G. J. Katuwal, R. Chen, Machine learning model ness of interpretability methods, arXiv preprint interpretability for precision medicine, arXiv arXiv:1806.08049 (2018). preprint arXiv:1610.09045 (2016). [28] A. Gosiewska, P. Biecek, IBreakDown: Un- [15] A. Y. Zhang, S. S. W. Lam, N. Liu, Y. Pang, L. L. certainty of model explanations for non- Chan, P. H. Tang, Development of a Radiology additive predictive models, arXiv preprint Decision Support System for the Classification of arXiv:1903.11420 (2019). MRI Brain Scans, in: 2018 IEEE/ACM 5th Interna- [29] M. R. Zafar, N. M. Khan, DLIME: A deterministic tional Conference on Big Data Computing Appli- local interpretable model-agnostic explanations cations and Technologies (BDCAT), IEEE, 2018, approach for computer-aided diagnosis systems, pp. 107–115. arXiv preprint arXiv:1906.10263 (2019). [16] C. Moreira, R. Sindhgatta, C. Ouyang, P. Bruza, [30] S. M. Shankaranarayana, D. Runje, ALIME: A. Wichert, An Investigation of Interpretability Autoencoder Based Approach for Local Inter- Techniques for Deep Learning in Predictive Pro- pretability, in: International Conference on Intel- cess Analytics, arXiv preprint arXiv:2002.09192 ligent Data Engineering and Automated Learn- (2020). ing, Springer, 2019, pp. 454–463. [17] L. Breiman, Random forests, Machine learning [31] C. Molnar, Limitations of Interpretable Machine 45 (2001) 5–32. Learning Methods, 2020. [18] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, [32] W. H. Greene, Econometric Analysis, Pearson L. Wasserman, Distribution-free predictive in- Education India, 2003. ference for regression, Journal of the American [33] A. E. Hoerl, R. W. Kennard, Ridge Regression: Statistical Association 113 (2018) 1094–1111. Biased Estimation for Nonorthogonal Problems, [19] J. H. Friedman, Greedy function approximation: Technometrics 12 (1970) 55–67. doi:10.1080/ A gradient boosting machine, Annals of statistics 00401706.1970.10488634. (2001) 1189–1232. [34] W. N. van Wieringen, Lecture notes on ridge re- [20] A. Goldstein, A. Kapelner, J. Bleich, E. Pitkin, gression, arXiv preprint arXiv:1509.09169 (2019). Peeking inside the black box: Visualizing sta- [35] P.-F. Verhulst, Correspondance mathématique et tistical learning with plots of individual condi- physique, Ghent and Brussels 10 (1838) 113. [36] B. Letham, B. Karrer, G. Ottoni, E. Bakshy, Con- strained Bayesian optimization with noisy exper- iments, Bayesian Analysis 14 (2019) 495–519. [37] J. Fang, M. H. Alderman, Serum uric acid and car- diovascular mortality: The NHANES I epidemio- logic follow-up study, 1971-1992, Jama 283 (2000) 2404–2410. [38] L. J. Launer, T. Harris, C. Rumpel, J. Madans, Body mass index, weight change, and risk of mo- bility disability in middle-aged and older women: The epidemiologic follow-up study of NHANES I, Jama 271 (1994) 1093–1098. [39] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, S.-I. Lee, From local explanations to global understanding with explainable AI for trees, Nature machine intelligence 2 (2020) 2522– 5839. [40] C. S. Cox, Plan and Operation of the NHANES I Epidemiologic Followup Study, 1987, 27, US De- partment of Health and Human Services, Public Health Service, Centers . . . , 1992. [41] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd Acm Sigkdd International Conference on Knowl- edge Discovery and Data Mining, 2016, pp. 785– 794. [42] P. J. Heagerty, T. Lumley, M. S. Pepe, Time- dependent ROC curves for censored survival data and a diagnostic marker, Biometrics 56 (2000) 337–344.