=Paper= {{Paper |id=Vol-2448/SSS19_Paper_Upload_237 |storemode=property |title=Diagnostic Model Explanations: A Medical Narrative |pdfUrl=https://ceur-ws.org/Vol-2448/SSS19_Paper_Upload_237.pdf |volume=Vol-2448 |authors=Umang Bhatt,Brian Davis,Jose M.F. Moura |dblpUrl=https://dblp.org/rec/conf/aaaiss/BhattDM19 }} ==Diagnostic Model Explanations: A Medical Narrative == https://ceur-ws.org/Vol-2448/SSS19_Paper_Upload_237.pdf
                         Diagnostic Model Explanations: A Medical Narrative

                                     Umang Bhatt, Brian Davis, & José M.F. Moura
                                                  Carnegie Mellon University
                                         {usb, bmdavis, moura}@andrew.cmu.edu




                            Abstract                                 a given patient. This training mandates doctors learn how to
                                                                     proactively search for particular risk predictors upon seeing
  Explainability techniques are important to understanding ma-
  chine learning models used in decision critical settings. We
                                                                     a patient; for example, a cardiologist learns to look at past
  explore how pattern recognition techniques ought to couple         patients to determine if a patient has a given valve disease.
  requisite transparency with predictive power. By leveraging        Doctors are not only trained to identify which attributes of
  medical data with the task of predicting the onset of sepsis,      a patients (vital signs, personal information, family history,
  we expose the most important features for the model predic-        etc.) deem the patient at risk for a particular disease or out-
  tion. We uncover how important training points and consen-         come, but also develop a intuition from past patients based
  sus feature attributions vary over the learning process of these   on years of experience; for example, if a doctor treats a rare
  models. We then pose a counterfactual question to explore          disease was seen over a decade ago, that patient can be ex-
  trained predictors in the medical domain.                          tremely vital to a doctors diagnosis engine when attributes
                                                                     alone are uninformative about how to proceed. The doctor
                          Overview                                   treats the patient from ten years ago as an anchor for fu-
As machine learning becomes pervasive, transparency and              ture patients with similar symptoms. Over time, the doctor
intelligibility of underlying machine learning models pre-           learns a larger set of diagnosis he or she feels comfortable
cedes adoption of these technologies (Doshi-Velez and Kim            with diagnosing: this growth of a doctor describes a power-
2017; Lipton 2016). Recent machine learning interpretabil-           ful narrative that uncovers how a doctor reasons overtime.
ity techniques fall under either gradient-based methods that
compute the gradient of the output with respect to the in-                                     Dataset
put, treating gradient flow as a saliency map (Shrikumar,            The MIMIC-III (Medical Information Mart for Intensive
Greenside, and Kundaje 2017; Sundararajan, Taly, and Yan             Care III) is a large electronic health record dataset compro-
2017), or perturbation-based methods that approximate a              mised of health related data of over 40,000 patients who
complex model using a locally additive model, thus ex-               were admitted to the the critical care units of Beth Israel
plaining the difference between test output-input pair and           Deaconess Medical Center between the years 2001 and 2012
some reference output-input pair (Lundberg and Lee 2017;             (Johnson et al. 2016). MIMIC-III consists of demograph-
Ribeiro, Singh, and Guestrin 2016). While gradient-based             ics, vital sign measurements, lab test results, medications,
techniques like (Shrikumar, Greenside, and Kundaje 2017)             procedures, caregiver notes, imaging reports, and mortality
consider infinitesimal changes to the decision surface and           of the ICU patients. Using MIMIC-III dataset, we extracted
then take the first-order term in the Taylor expansion as            seventeen real-valued features deemed critical in the sepsis
the additive explanatory model, perturbation-based additive          diagnosis task as per (Purushotham et al. 2018). These are
models consider the difference between an input and ref-             the processed features we extracted for every sepsis diag-
erence vector. There have also been approaches that assign           nosis (a binary variable indicating the presence of sepsis):
attributions to features with more complex counterfactuals           Glasgow Coma Scale, Systolic Blood Pressure, Heart Rate,
than that discussed above for gradients (Kusner et al. 2017;         Body Temperature, Pao2 / Fio2 ratio, Urine Output, Serum
Datta, Sen, and Zick 2016).                                          Urea Nitrogen Level, White Blood Cells Count, Serum Bi-
   There is also a burgeoning line of research in using              carbonate Level, Sodium Level, Potassium Level, Bilirubin
deep learning for medical diagnostics (Choi et al. 2016;             Level, Age, Acquired Immunodeficiency Syndrome, Hema-
Caruana et al. 2015; Rajkomar et al. 2018). Starting with            tologic Malignancy, Metastatic Cancer, Admission Type.
tabular data (where each feature is semantically meaning-
ful) from the medical domain (Purushotham et al. 2017),
we fuse the aforementioned interpretability techniques into                                  Approach
these nascent medical diagnostic models.                             Once we train a predictor, f , on the aforementioned dataset,
   We note that medical professionals and doctors undergo            we use the attribution aggregation approach, AVA, proposed
rigorous training to learn how to determine the outcome for          in (Bhatt, Ravikumar, and Moura 2019) to concurrently find
influential patients and the features that were important to         be able to learn about a particular region unbeknownst to it a
sepsis diagnoses in the test set. First let us introduce some        few time steps ago. Note: we do NOT make any assumptions
notation. Let x ∈ Rd be a datapoint’s feature vector where           about the model class of f . We assume black box access to
the xi ∈ R is a specific feature of this datapoint. Let              the predictor we wish to explain.
D = {x(j) }N    j=1 represent the training datapoints, where
D ∈ Rd×N is the entire training set in matrix form with                                  Experimentation
          (j)
Di,j = xi . Let f be the learned predictor we wish to ex-            We first create different candidate feed-forward models to be
plain. Using the tractable approximation derived in (Koh and         explained and then train them on the aforementioned sepsis
Liang 2017), we define the influence weight, ρj of training          data set. We varied the depth of the models from 1 to 3 hid-
point, x(j) , on a test point, xtest as:                             den layers, with ReLU or Sigmoid activations, trained with
                                                                     the ADAM optimizer and cross-entropy loss. To explain the
                                  d                                  model, we use SHAP, IG, and various proposed techniques
      ρj = Iup,loss (x(j) , xtest ) =
                                    L(f,x(j) , xtest ) =0
                                 d                                  (WSHAP is the AVA weighted SHAP technique, WIG is the
  Next, we find the feature attribution of xtest via an expla-       AVA weighted IG technique, MIG is Markovian aggrega-
nation function g. Suppose we let g be a Shapley Value attri-        tion with IG attribution, MSHAP is Markovian aggregation
bution from classical game theory and from (Lundberg and             with SHAP attribution, BIG is Borda aggregation with IG
Lee 2017), then we find that attribution of the ith feature of       attribution, BSHAP is Borda aggregation with SHAP attribu-
point x is given by the Shapley value, which is the sum of the       tion). We report recall of a decision tree’s gold set averaged
contributions to f for the ith feature in all                        over all the test instances of the sepsis dataset in Table 1, as
                                            possible subsets
                                                                     done in (Ribeiro, Singh, and Guestrin 2016). For these ex-
                                                               
S of the features F given by, where R = |S|!(|F|F    |−|S|−1)!
                                                        |!       :
                                                                     periments, we fix k to be five, use the mean values of the
                  X                                                  training input as the region of perturbation for SHAP, and
      gi (x) =           R(fS∪{i} (xS∪{i} ) − fS (xS ))              use the aforementioned greedy technique to determine m to
                 S⊆F \{i}                                            be five. Note random attribution will recall m/d, where d is
If we let g be be gradient-based attribution from (Sundarara-        the total number of features; for these experiments, random
jan, Taly, and Yan 2017), we find that attribution of the ith        attribution will have a recall of 13%.
feature of point x is given by the gradient of f (x) along the
ith dimension of x with respect to a baseline x̄.                    m-Sensitivity
                            Z 1                                      We also run experiments where we analyze how gold set re-
                                 ∂f (x̄ + α(x − x̄))                 call changes as a function of m, the size of a gold set. If
       gi (x) = (xi − x̄i )                          dα
                             α=0         ∂xi                         m = d, then all attribution techniques, including random,
   Using the aggregation methodology from AVA, we aggre-             will have 100% recall. In Figure 1, we see that all attribu-
gate the attributions for past patients (training set) to explain    tions (other than random) recall a high percentage of the im-
model predictions for new patients (test set). For a given test      portant features. As such for all following experiments we
set example, we provide an aggregate feature attribution to          set m to 5.
explain why the given prediction was made. Aggregate fea-
ture attribution can be found via the weighted variants of           Expectation over Explanations
AVA from the original paper or via rank aggregation tech-            We cannot declare the absolute feature attribution for any
niques like Borda Count and Markov Chain aggregation. We             arbitrary test point. We therefore aim to see how our meth-
then provide the most important patient from the previously          ods perform in expectation by iterating 1000 times to find a
seen patients. We find this important patient, ximp , as fol-        probability distribution over the rank of the explanation al-
lows.                                                                gorithms. We used gold set recall to rank every method on
                      ximp = arg max ρj                              every iteration, keep the ordinal position of each method,
                                  x(j) ∈D                            and iterate. For every iteration, we sample 100 points at ran-
We are interested in how over time different training points         dom with replacement: we find explanations for those points
become more or less influential to the retrained predictor. To       using every single method in question. After 1000 itera-
be concrete, to make a prediction on Day 10, the doctor can          tions, we can then say with 55% probability weighted ag-
use a patient she saw on Day 5. After ”retraining” her inter-        gregation with SHAP attribution yields the best explanation
nal predictor that day, she can now use the patients from Day        (in the first position) and with 44% probability Markovian
10 to explain and predict patients on subsequent days. The           aggregation with SHAP attribution yields the best explana-
influential anchors in the training data change as a function        tion. Interestingly, Markovian aggregation with SHAP attri-
of time; therefore, model explanations capture how different         bution appears in the top two positions 99.5% of the time,
patients serve as anchors based on the exhaustiveness of the         while weighted aggregation with SHAP attribution appears
predictor’s training set. We are also interested in understand-      only 94.3% of the time. A graph of the distribution for each
ing how the predictor deals with unknown unknowns that lie           method can be found in Figure 2.
in an uncharted portion of the feature space. The predictor             For every iteration of every method, we can also keep
might not be confident about its predictions in a given re-          track of the position of each feature. This gives us a prob-
gion, but as more training data is added, the predictor may          ability distribution of rankings for each feature, which gives
                   M ODEL          ACCURACY    SHAP      IG      WSHAP     WIG   BSHAP    BIG   MSHAP     MIG
                   1-S IGMOID        85.3       60       29       68       37     65      26     67       31
                   1-R E LU          82.8       62       33       69       47     65      37     69       38
                   2-S IGMOID        86.7       61       34       75       41     73      75     76       40
                   2-R E LU          87.2       55       35       64       35     60      30     62       33
                   3-S IGMOID         83        64       31       68       41     67      29     71       31
                   3-R E LU           87        55       38       65       48     57      44     64       43

 Table 1: Gold set recall on important features from an interpretable classifier to explain models trained on the sepsis dataset




                             (a)




                                                                         Figure 2: Representative Distribution of Methods Ranks
                             (b)

Figure 1: m-Sensitivity for the sepsis dataset for different
models trained with the ADAM optimizer (a) recall for 2
Hidden Layer Sigmoid model (b) recall for 2 Hidden Layer
ReLU model

us better insight into how important features are in expecta-
tion. In Figure 3, we find that sofa and sepsis cdc appear in
the top position of importance among all explanations most
often: this is expected because both are highly correlated
with the onset of sepsis. Simultaneously, as a sanity check,
we find the race (e.g. hispanic) does not matter (appears at
a lower rank) in expectation for all explanations; therefore,
the top model learns not to correlate race and sepsis.

              Counterfactual Intuition
It is instructive to consider the counterfactual entailed in
temporal explanations. Feature attribution techniques like
(Sundararajan, Taly, and Yan 2017) calculate attribution by
finding the partial derivative of the output with respect to
                                                                     Figure 3: Feature rank distribution for race hispanic, sep-
every input feature (Ancona et al. 2018). One perspective of
                                                                     sis cdc, and sofa (left to right)
this is as a counterfactual of how perturbing the j-th input
infinitesimally would perturb the learnt predictor f . Indeed,
such counterfactual intuition allows humans to intuit about
the impact of a cause by having the baseline be the absence            ments with learning systems. In 2016 IEEE Symposium on
of the cause: from here, humans can tell the importance of             Security and Privacy (SP), 598–617.
a cause by seeing how the output changes in the causes’ ab-            Doshi-Velez, F., and Kim, B. 2017. Towards a rigorous
sence. Influence functions from (Koh and Liang 2017) con-              science of interpretable machine learning.
sider the counterfactual of how upweighting a training data
                                                                       Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.;
point x infinitesimally will affect the loss at a test point xtest .
                                                                       Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi,
   The counterfactual posed by temporal explanations is as
                                                                       L. A.; Mark, R. G.; and et al. 2016. Mimic-iii, a freely
follows: what training point (past patients) perturbations
                                                                       accessible critical care database.
which when used to train the predictor (doctor) would influ-
ence the test prediction (current patient) the most. Suppose           Koh, P. W., and Liang, P. 2017. Understanding black-box
we perturb a training point (past patient) as zδ = z + δ, and          predictions via influence functions. In Proceedings of the
we denote the predictor obtained by upweighting substitut-             34th International Conference on Machine Learning, vol-
ing the training data point x by xδ and moreover upweight-             ume 70, 1885–1894. International Convention Centre, Syd-
ing this by some constant  as fb,xδ ,−x . Then the counterfac-       ney, Australia: PMLR.
tual mentioned above at a test point xtest , would compute:            Kusner, M. J.; Loftus, J.; Russell, C.; and Silva, R. 2017.
                                                                       Counterfactual fairness. In Guyon, I.; Luxburg, U. V.; Ben-
                    d                                                  gio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Gar-
               ∇δ      L(f,xδ ,−x , xtest ) =0 δ=0 .
                    d                                                 nett, R., eds., Advances in Neural Information Processing
                                                                       Systems 30. Curran Associates, Inc. 4066–4076.
   We “freeze” the predictor function, since we only have
black box access to the model, and ask: given influential              Lipton, Z. C. 2016. The mythos of model interpretability.
training points, what would be the change to the frozen and            Lundberg, S. M., and Lee, S.-I. 2017. A unified approach
trained predictor as we perturb those training points. This            to interpreting model predictions. In Advances in Neural
counterfactual allows us to create explanations that capture           Information Processing Systems 30. 4765–4774.
global patterns in the local neighborhood of the test point:           Purushotham, S.; Meng, C.; Che, Z.; and Liu, Y. 2017.
this allows users to better audit the global trends of a model         Benchmark of deep learning models on large healthcare
whilst still having the fidelity of local explanations. Essen-         mimic datasets.
tially, we can explore what would happen if a doctor had
                                                                       Purushotham, S.; Meng, C.; Che, Z.; and Liu, Y. 2018.
seen a different patient at time step t − 1 who would have
                                                                       Benchmarking deep learning models on large healthcare
expanded the doctor (that is, the predictor’s understanding
                                                                       datasets. Journal of Biomedical Informatics 83:112–134.
of the feature space), would the doctor have made a differ-
ent prediction at time step t? Such an understanding would             Rajkomar, A.; Oren, E.; Chen, K.; Dai, A. M.; Hajaj, N.;
not only debug the predictors learned from real data but also          Liu, P. J.; Liu, X.; Sun, M.; Sundberg, P.; Yee, H.; Zhang,
ensure diagnostic models align with doctor intuition.                  K.; Duggan, G. E.; Flores, G.; Hardt, M.; Irvine, J.; Le,
                                                                       Q. V.; Litsch, K.; Marcus, J.; Mossin, A.; Tansuwan, J.;
                          References                                   Wang, D.; Wexler, J.; Wilson, J.; Ludwig, D.; Volchenboum,
                                                                       S. L.; Chou, K.; Pearson, M.; Madabushi, S.; Shah, N. H.;
Ancona, M.; Ceolini, E.; Oztireli, C.; and Gross, M. 2018.             Butte, A. J.; Howell, M.; Cui, C.; Corrado, G.; and Dean,
Towards better understanding of gradient-based attribution             J. 2018. Scalable and accurate deep learning for electronic
methods for deep neural networks. In International Confer-             health records. CoRR abs/1801.07860.
ence on Learning Representations.
                                                                       Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why
Bhatt, U.; Ravikumar, P.; and Moura, J. M. F. 2019. Towards            should I trust you?: Explaining the predictions of any clas-
aggregating weighted feature attributions. In the AAAI 2019            sifier. In Proceedings of the 22nd ACM SIGKDD Interna-
Workshop on Network Interpretability abs/1901.10040.                   tional Conference on Knowledge Discovery and Data Min-
Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; and             ing, 1135–1144.
Elhadad, N. 2015. Intelligible models for healthcare: Pre-             Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learn-
dicting pneumonia risk and hospital 30-day readmission. In             ing important features through propagating activation differ-
Proceedings of the 21th ACM SIGKDD International Con-                  ences. CoRR abs/1704.02685.
ference on Knowledge Discovery and Data Mining, Sydney,                Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic
NSW, Australia, August 10-13, 2015, 1721–1730.                         attribution for deep networks. In Proceedings of the 34th
Choi, E.; Bahadori, M. T.; Schuetz, A.; Stewart, W. F.; and            International Conference on Machine Learning, volume 70,
Sun, J. 2016. Doctor ai: Predicting clinical events via recur-         3319–3328. International Convention Centre, Sydney, Aus-
rent neural networks. In Doshi-Velez, F.; Fackler, J.; Kale,           tralia: PMLR.
D.; Wallace, B.; and Wiens, J., eds., Proceedings of the 1st
Machine Learning for Healthcare Conference, volume 56 of
Proceedings of Machine Learning Research, 301–318. Chil-
dren’s Hospital LA, Los Angeles, CA, USA: PMLR.
Datta, A.; Sen, S.; and Zick, Y. 2016. Algorithmic trans-
parency via quantitative input influence: Theory and experi-