=Paper=
{{Paper
|id=Vol-2448/SSS19_Paper_Upload_237
|storemode=property
|title=Diagnostic Model Explanations: A Medical Narrative
|pdfUrl=https://ceur-ws.org/Vol-2448/SSS19_Paper_Upload_237.pdf
|volume=Vol-2448
|authors=Umang Bhatt,Brian Davis,Jose M.F. Moura
|dblpUrl=https://dblp.org/rec/conf/aaaiss/BhattDM19
}}
==Diagnostic Model Explanations: A Medical Narrative ==
Diagnostic Model Explanations: A Medical Narrative
Umang Bhatt, Brian Davis, & José M.F. Moura
Carnegie Mellon University
{usb, bmdavis, moura}@andrew.cmu.edu
Abstract a given patient. This training mandates doctors learn how to
proactively search for particular risk predictors upon seeing
Explainability techniques are important to understanding ma-
chine learning models used in decision critical settings. We
a patient; for example, a cardiologist learns to look at past
explore how pattern recognition techniques ought to couple patients to determine if a patient has a given valve disease.
requisite transparency with predictive power. By leveraging Doctors are not only trained to identify which attributes of
medical data with the task of predicting the onset of sepsis, a patients (vital signs, personal information, family history,
we expose the most important features for the model predic- etc.) deem the patient at risk for a particular disease or out-
tion. We uncover how important training points and consen- come, but also develop a intuition from past patients based
sus feature attributions vary over the learning process of these on years of experience; for example, if a doctor treats a rare
models. We then pose a counterfactual question to explore disease was seen over a decade ago, that patient can be ex-
trained predictors in the medical domain. tremely vital to a doctors diagnosis engine when attributes
alone are uninformative about how to proceed. The doctor
Overview treats the patient from ten years ago as an anchor for fu-
As machine learning becomes pervasive, transparency and ture patients with similar symptoms. Over time, the doctor
intelligibility of underlying machine learning models pre- learns a larger set of diagnosis he or she feels comfortable
cedes adoption of these technologies (Doshi-Velez and Kim with diagnosing: this growth of a doctor describes a power-
2017; Lipton 2016). Recent machine learning interpretabil- ful narrative that uncovers how a doctor reasons overtime.
ity techniques fall under either gradient-based methods that
compute the gradient of the output with respect to the in- Dataset
put, treating gradient flow as a saliency map (Shrikumar, The MIMIC-III (Medical Information Mart for Intensive
Greenside, and Kundaje 2017; Sundararajan, Taly, and Yan Care III) is a large electronic health record dataset compro-
2017), or perturbation-based methods that approximate a mised of health related data of over 40,000 patients who
complex model using a locally additive model, thus ex- were admitted to the the critical care units of Beth Israel
plaining the difference between test output-input pair and Deaconess Medical Center between the years 2001 and 2012
some reference output-input pair (Lundberg and Lee 2017; (Johnson et al. 2016). MIMIC-III consists of demograph-
Ribeiro, Singh, and Guestrin 2016). While gradient-based ics, vital sign measurements, lab test results, medications,
techniques like (Shrikumar, Greenside, and Kundaje 2017) procedures, caregiver notes, imaging reports, and mortality
consider infinitesimal changes to the decision surface and of the ICU patients. Using MIMIC-III dataset, we extracted
then take the first-order term in the Taylor expansion as seventeen real-valued features deemed critical in the sepsis
the additive explanatory model, perturbation-based additive diagnosis task as per (Purushotham et al. 2018). These are
models consider the difference between an input and ref- the processed features we extracted for every sepsis diag-
erence vector. There have also been approaches that assign nosis (a binary variable indicating the presence of sepsis):
attributions to features with more complex counterfactuals Glasgow Coma Scale, Systolic Blood Pressure, Heart Rate,
than that discussed above for gradients (Kusner et al. 2017; Body Temperature, Pao2 / Fio2 ratio, Urine Output, Serum
Datta, Sen, and Zick 2016). Urea Nitrogen Level, White Blood Cells Count, Serum Bi-
There is also a burgeoning line of research in using carbonate Level, Sodium Level, Potassium Level, Bilirubin
deep learning for medical diagnostics (Choi et al. 2016; Level, Age, Acquired Immunodeficiency Syndrome, Hema-
Caruana et al. 2015; Rajkomar et al. 2018). Starting with tologic Malignancy, Metastatic Cancer, Admission Type.
tabular data (where each feature is semantically meaning-
ful) from the medical domain (Purushotham et al. 2017),
we fuse the aforementioned interpretability techniques into Approach
these nascent medical diagnostic models. Once we train a predictor, f , on the aforementioned dataset,
We note that medical professionals and doctors undergo we use the attribution aggregation approach, AVA, proposed
rigorous training to learn how to determine the outcome for in (Bhatt, Ravikumar, and Moura 2019) to concurrently find
influential patients and the features that were important to be able to learn about a particular region unbeknownst to it a
sepsis diagnoses in the test set. First let us introduce some few time steps ago. Note: we do NOT make any assumptions
notation. Let x ∈ Rd be a datapoint’s feature vector where about the model class of f . We assume black box access to
the xi ∈ R is a specific feature of this datapoint. Let the predictor we wish to explain.
D = {x(j) }N j=1 represent the training datapoints, where
D ∈ Rd×N is the entire training set in matrix form with Experimentation
(j)
Di,j = xi . Let f be the learned predictor we wish to ex- We first create different candidate feed-forward models to be
plain. Using the tractable approximation derived in (Koh and explained and then train them on the aforementioned sepsis
Liang 2017), we define the influence weight, ρj of training data set. We varied the depth of the models from 1 to 3 hid-
point, x(j) , on a test point, xtest as: den layers, with ReLU or Sigmoid activations, trained with
the ADAM optimizer and cross-entropy loss. To explain the
d model, we use SHAP, IG, and various proposed techniques
ρj = Iup,loss (x(j) , xtest ) =
L(f,x(j) , xtest ) =0
d (WSHAP is the AVA weighted SHAP technique, WIG is the
Next, we find the feature attribution of xtest via an expla- AVA weighted IG technique, MIG is Markovian aggrega-
nation function g. Suppose we let g be a Shapley Value attri- tion with IG attribution, MSHAP is Markovian aggregation
bution from classical game theory and from (Lundberg and with SHAP attribution, BIG is Borda aggregation with IG
Lee 2017), then we find that attribution of the ith feature of attribution, BSHAP is Borda aggregation with SHAP attribu-
point x is given by the Shapley value, which is the sum of the tion). We report recall of a decision tree’s gold set averaged
contributions to f for the ith feature in all over all the test instances of the sepsis dataset in Table 1, as
possible subsets
done in (Ribeiro, Singh, and Guestrin 2016). For these ex-
S of the features F given by, where R = |S|!(|F|F |−|S|−1)!
|! :
periments, we fix k to be five, use the mean values of the
X training input as the region of perturbation for SHAP, and
gi (x) = R(fS∪{i} (xS∪{i} ) − fS (xS )) use the aforementioned greedy technique to determine m to
S⊆F \{i} be five. Note random attribution will recall m/d, where d is
If we let g be be gradient-based attribution from (Sundarara- the total number of features; for these experiments, random
jan, Taly, and Yan 2017), we find that attribution of the ith attribution will have a recall of 13%.
feature of point x is given by the gradient of f (x) along the
ith dimension of x with respect to a baseline x̄. m-Sensitivity
Z 1 We also run experiments where we analyze how gold set re-
∂f (x̄ + α(x − x̄)) call changes as a function of m, the size of a gold set. If
gi (x) = (xi − x̄i ) dα
α=0 ∂xi m = d, then all attribution techniques, including random,
Using the aggregation methodology from AVA, we aggre- will have 100% recall. In Figure 1, we see that all attribu-
gate the attributions for past patients (training set) to explain tions (other than random) recall a high percentage of the im-
model predictions for new patients (test set). For a given test portant features. As such for all following experiments we
set example, we provide an aggregate feature attribution to set m to 5.
explain why the given prediction was made. Aggregate fea-
ture attribution can be found via the weighted variants of Expectation over Explanations
AVA from the original paper or via rank aggregation tech- We cannot declare the absolute feature attribution for any
niques like Borda Count and Markov Chain aggregation. We arbitrary test point. We therefore aim to see how our meth-
then provide the most important patient from the previously ods perform in expectation by iterating 1000 times to find a
seen patients. We find this important patient, ximp , as fol- probability distribution over the rank of the explanation al-
lows. gorithms. We used gold set recall to rank every method on
ximp = arg max ρj every iteration, keep the ordinal position of each method,
x(j) ∈D and iterate. For every iteration, we sample 100 points at ran-
We are interested in how over time different training points dom with replacement: we find explanations for those points
become more or less influential to the retrained predictor. To using every single method in question. After 1000 itera-
be concrete, to make a prediction on Day 10, the doctor can tions, we can then say with 55% probability weighted ag-
use a patient she saw on Day 5. After ”retraining” her inter- gregation with SHAP attribution yields the best explanation
nal predictor that day, she can now use the patients from Day (in the first position) and with 44% probability Markovian
10 to explain and predict patients on subsequent days. The aggregation with SHAP attribution yields the best explana-
influential anchors in the training data change as a function tion. Interestingly, Markovian aggregation with SHAP attri-
of time; therefore, model explanations capture how different bution appears in the top two positions 99.5% of the time,
patients serve as anchors based on the exhaustiveness of the while weighted aggregation with SHAP attribution appears
predictor’s training set. We are also interested in understand- only 94.3% of the time. A graph of the distribution for each
ing how the predictor deals with unknown unknowns that lie method can be found in Figure 2.
in an uncharted portion of the feature space. The predictor For every iteration of every method, we can also keep
might not be confident about its predictions in a given re- track of the position of each feature. This gives us a prob-
gion, but as more training data is added, the predictor may ability distribution of rankings for each feature, which gives
M ODEL ACCURACY SHAP IG WSHAP WIG BSHAP BIG MSHAP MIG
1-S IGMOID 85.3 60 29 68 37 65 26 67 31
1-R E LU 82.8 62 33 69 47 65 37 69 38
2-S IGMOID 86.7 61 34 75 41 73 75 76 40
2-R E LU 87.2 55 35 64 35 60 30 62 33
3-S IGMOID 83 64 31 68 41 67 29 71 31
3-R E LU 87 55 38 65 48 57 44 64 43
Table 1: Gold set recall on important features from an interpretable classifier to explain models trained on the sepsis dataset
(a)
Figure 2: Representative Distribution of Methods Ranks
(b)
Figure 1: m-Sensitivity for the sepsis dataset for different
models trained with the ADAM optimizer (a) recall for 2
Hidden Layer Sigmoid model (b) recall for 2 Hidden Layer
ReLU model
us better insight into how important features are in expecta-
tion. In Figure 3, we find that sofa and sepsis cdc appear in
the top position of importance among all explanations most
often: this is expected because both are highly correlated
with the onset of sepsis. Simultaneously, as a sanity check,
we find the race (e.g. hispanic) does not matter (appears at
a lower rank) in expectation for all explanations; therefore,
the top model learns not to correlate race and sepsis.
Counterfactual Intuition
It is instructive to consider the counterfactual entailed in
temporal explanations. Feature attribution techniques like
(Sundararajan, Taly, and Yan 2017) calculate attribution by
finding the partial derivative of the output with respect to
Figure 3: Feature rank distribution for race hispanic, sep-
every input feature (Ancona et al. 2018). One perspective of
sis cdc, and sofa (left to right)
this is as a counterfactual of how perturbing the j-th input
infinitesimally would perturb the learnt predictor f . Indeed,
such counterfactual intuition allows humans to intuit about
the impact of a cause by having the baseline be the absence ments with learning systems. In 2016 IEEE Symposium on
of the cause: from here, humans can tell the importance of Security and Privacy (SP), 598–617.
a cause by seeing how the output changes in the causes’ ab- Doshi-Velez, F., and Kim, B. 2017. Towards a rigorous
sence. Influence functions from (Koh and Liang 2017) con- science of interpretable machine learning.
sider the counterfactual of how upweighting a training data
Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.;
point x infinitesimally will affect the loss at a test point xtest .
Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi,
The counterfactual posed by temporal explanations is as
L. A.; Mark, R. G.; and et al. 2016. Mimic-iii, a freely
follows: what training point (past patients) perturbations
accessible critical care database.
which when used to train the predictor (doctor) would influ-
ence the test prediction (current patient) the most. Suppose Koh, P. W., and Liang, P. 2017. Understanding black-box
we perturb a training point (past patient) as zδ = z + δ, and predictions via influence functions. In Proceedings of the
we denote the predictor obtained by upweighting substitut- 34th International Conference on Machine Learning, vol-
ing the training data point x by xδ and moreover upweight- ume 70, 1885–1894. International Convention Centre, Syd-
ing this by some constant as fb,xδ ,−x . Then the counterfac- ney, Australia: PMLR.
tual mentioned above at a test point xtest , would compute: Kusner, M. J.; Loftus, J.; Russell, C.; and Silva, R. 2017.
Counterfactual fairness. In Guyon, I.; Luxburg, U. V.; Ben-
d gio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Gar-
∇δ L(f,xδ ,−x , xtest ) =0 δ=0 .
d nett, R., eds., Advances in Neural Information Processing
Systems 30. Curran Associates, Inc. 4066–4076.
We “freeze” the predictor function, since we only have
black box access to the model, and ask: given influential Lipton, Z. C. 2016. The mythos of model interpretability.
training points, what would be the change to the frozen and Lundberg, S. M., and Lee, S.-I. 2017. A unified approach
trained predictor as we perturb those training points. This to interpreting model predictions. In Advances in Neural
counterfactual allows us to create explanations that capture Information Processing Systems 30. 4765–4774.
global patterns in the local neighborhood of the test point: Purushotham, S.; Meng, C.; Che, Z.; and Liu, Y. 2017.
this allows users to better audit the global trends of a model Benchmark of deep learning models on large healthcare
whilst still having the fidelity of local explanations. Essen- mimic datasets.
tially, we can explore what would happen if a doctor had
Purushotham, S.; Meng, C.; Che, Z.; and Liu, Y. 2018.
seen a different patient at time step t − 1 who would have
Benchmarking deep learning models on large healthcare
expanded the doctor (that is, the predictor’s understanding
datasets. Journal of Biomedical Informatics 83:112–134.
of the feature space), would the doctor have made a differ-
ent prediction at time step t? Such an understanding would Rajkomar, A.; Oren, E.; Chen, K.; Dai, A. M.; Hajaj, N.;
not only debug the predictors learned from real data but also Liu, P. J.; Liu, X.; Sun, M.; Sundberg, P.; Yee, H.; Zhang,
ensure diagnostic models align with doctor intuition. K.; Duggan, G. E.; Flores, G.; Hardt, M.; Irvine, J.; Le,
Q. V.; Litsch, K.; Marcus, J.; Mossin, A.; Tansuwan, J.;
References Wang, D.; Wexler, J.; Wilson, J.; Ludwig, D.; Volchenboum,
S. L.; Chou, K.; Pearson, M.; Madabushi, S.; Shah, N. H.;
Ancona, M.; Ceolini, E.; Oztireli, C.; and Gross, M. 2018. Butte, A. J.; Howell, M.; Cui, C.; Corrado, G.; and Dean,
Towards better understanding of gradient-based attribution J. 2018. Scalable and accurate deep learning for electronic
methods for deep neural networks. In International Confer- health records. CoRR abs/1801.07860.
ence on Learning Representations.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why
Bhatt, U.; Ravikumar, P.; and Moura, J. M. F. 2019. Towards should I trust you?: Explaining the predictions of any clas-
aggregating weighted feature attributions. In the AAAI 2019 sifier. In Proceedings of the 22nd ACM SIGKDD Interna-
Workshop on Network Interpretability abs/1901.10040. tional Conference on Knowledge Discovery and Data Min-
Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; and ing, 1135–1144.
Elhadad, N. 2015. Intelligible models for healthcare: Pre- Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learn-
dicting pneumonia risk and hospital 30-day readmission. In ing important features through propagating activation differ-
Proceedings of the 21th ACM SIGKDD International Con- ences. CoRR abs/1704.02685.
ference on Knowledge Discovery and Data Mining, Sydney, Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic
NSW, Australia, August 10-13, 2015, 1721–1730. attribution for deep networks. In Proceedings of the 34th
Choi, E.; Bahadori, M. T.; Schuetz, A.; Stewart, W. F.; and International Conference on Machine Learning, volume 70,
Sun, J. 2016. Doctor ai: Predicting clinical events via recur- 3319–3328. International Convention Centre, Sydney, Aus-
rent neural networks. In Doshi-Velez, F.; Fackler, J.; Kale, tralia: PMLR.
D.; Wallace, B.; and Wiens, J., eds., Proceedings of the 1st
Machine Learning for Healthcare Conference, volume 56 of
Proceedings of Machine Learning Research, 301–318. Chil-
dren’s Hospital LA, Los Angeles, CA, USA: PMLR.
Datta, A.; Sen, S.; and Zick, Y. 2016. Algorithmic trans-
parency via quantitative input influence: Theory and experi-