Explaining Deep Classification of Time-Series Data with Learned Prototypes

            Alan H. Gee∗,1,2 , Diego Garcia-Olano∗,1,2 , Joydeep Ghosh1 and David Paydarfar2
                    1
                      Electrical and Computer Engineering, The University of Texas at Austin
                       2
                         Neurology, Dell Medical School, The University of Texas at Austin
         {alangee, diegoolano}@utexas.edu, ghosh@ece.utexas.edu, david.paydarfar@austin.utexas.edu


                              Abstract                                    preterm infants (∼10% of births worldwide) in the neonatal
                                                                          intensive care unit (NICU).
        The emergence of deep learning networks raises a                     A common disorder observed in majority of preterm in-
        need for explainable AI so that users and domain                  fants is recurrent episodes of apnea (cessation of breathing)
        experts can be confident applying them to high-risk               and bradycardia (slowing of heart rate). Both of these spon-
        decisions. In this paper, we leverage data from                   taneous events may cause end organ damage related to hy-
        the latent space induced by deep learning mod-                    poxemia (low oxygenation of blood) and ischemia (reduced
        els to learn stereotypical representations or “pro-               blood flow) [Martin and Wilson, 2012]. Early detection of
        totypes” during training to elucidate the algorith-               apnea and bradycardia can help prevent hypoxic-ischemic in-
        mic decision-making process. We study how lever-                  jury in tissue with high-metabolic demands [Schmid et al.,
        aging prototypes effect classification decisions of               2015; Pichler et al., 2003] and prevent the cascade into inter-
        two dimensional time-series data in a few differ-                 mittent hypoxia, which leads to complications of retinopa-
        ent settings: (1) electrocardiogram (ECG) wave-                   thy, developmental delays, and neuropsychiatric disorders
        forms to detect clinical bradycardia, a slowing of                [Williamson et al., 2013; Poets et al., 2015; Di Fiore et al.,
        heart rate, in preterm infants, (2) respiration wave-             2015]. Leveraging explainability in deep neural network clas-
        forms to detect apnea of prematurity, and (3) audio               sification of these time series can reveal complex morpholog-
        waveforms to classify spoken digits. We improve                   ical and physiological features that clinicians may not readily
        upon existing models by optimizing for increased                  see. Thus, machine learning algorithms need transparency
        prototype diversity and robustness, visualize how                 in their decision-making process to highlight subtle patterns.
        these prototypes in the latent space are used by the              One such technique in deep explainability is prototypes, a
        model to distinguish classes, and show that proto-                case-based reasoning technique.
        types are capable of learning features on two di-                    Prototypes are representative examples, learned in-process
        mensional time-series data to produce explainable                 during model training, that describe influential data regions
        insights during classification tasks. We show that                in latent representations and provide insight into aggregated
        the prototypes are capable of learning real-world                 features across training data that are utilized by the model for
        features - bradycardia in ECG, apnea in respira-                  classification. In contrast to post-hoc explainability, which
        tion, and articulation in speech - as well as fea-                trains a secondary model to infer decision reasoning from
        tures within sub-classes. Our novel work lever-                   a primary model by only leveraging inputs and outputs, in-
        ages learned prototypical framework on two dimen-                 process explainable methods offer faithful explanations of a
        sional time-series data to produce explainable in-                primary model’s decisions [Rudin, 2018]. So, users who em-
        sights during classification tasks.                               ploy prototypes can confidently gain direct insight into the
                                                                          decisions algorithms are making for classification tasks.
                                                                             On data with unclear class boundaries, in-process methods
1       Introduction                                                      can misbehave. For example when the model in [Li et al.,
Despite the recent surge of machine learning, adoption of                 2017] is applied to the MNIST dataset, the prototypes eas-
deep learning models in decision critical domains, such as                ily separate in the latent space because the latent data repre-
healthcare, has been slow because of limited transparency                 sentation is separable and well-structured (Fig 1). However,
and explanations in black-box algorithms. This observation                when class boundaries and features do not form distinguish-
points to the critical need for black-box models to offer inter-          able clusters, learned prototypes become archetypes (extreme
pretable, faithful explanations of their decisions so that prac-          corner cases) that exist near the convex hull of the data in the
titioners in high-risk domains can trust model outputs and                latent space (Fig. 4). This phenomenon yields prototypes that
leverage their results. One such high-risk domain is treating             represent extreme class types (i.e. archetypes) and can under-
                                                                          perform on classifying data in overlapping class regions.
    ∗
        Equal Contribution                                                   In this work, we provide a deep classification method with
    Copyright © 2019 for this paper by its authors. Use permitted under
    Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                                                   15
                                                                             Figure 2: Prototype Architecture from [Li et al., 2017]
Figure 1: Learned prototypes of handwritten digits (MNIST) using
the architecture from [Li et al., 2017]. While colors represent the
handwritten digits 0-9, the labels represent the learned prototypes.   vide probability maps to highlight areas of images that lead
Because the latent representation of MNIST cluster distinctly, the     to a certain prediction [Zhou et al., 2015], but do not give ex-
prototypes are diverse. This may not be true when classes overlap      amples of prototypical examples of the data or explanations
                                                                       of how the training data relates to the end result. We focus
                                                                       on the former work [Li et al., 2017] for example-based ex-
explainable insights for health time-series data. We introduce         plainability where the generation of prototypes are intended
a prototype diversity penalty that explicitly accounts for pro-        to look like global representations of the training data.
totype clustering and encourages the model to learn more di-              Time-series classification on 1-D data with deep neural
verse prototypes. These diverse prototypes will help focus             networks is a rapidly growing field, with almost 9,000 deep
on areas of the latent space where class separation is most            learning models [Fawaz et al., 2018; Pons et al., 2017;
difficult and least defined to improve classification accura-          Faust et al., 2018; Goodfellow et al., 2018]. One such ex-
cies. We show the utility of this approach on three tasks in           ample leverages global average pooling to produce CAMs to
two-dimensional time-series classification: (1) bradycardia            provide explainability for a deep CNN to classify atrial fib-
from ECG; (2) apnea from respiration; and (3) spoken digits            rillation in ECG data [Goodfellow et al., 2018]. However,
from audio waveforms. The two-dimensional representation               the number of available healthcare datasets, specifically ECG
of time-series provides an interpretable method for domain             waveforms, is limited [Fawaz et al., 2018]. Within this con-
experts (e.g. clinicians) to understand the evolution of clin-         text, time-series classification on ECG waveforms has been
ically relevant features based on visible phenotypes in time-          done on a small scale, typically with single beat or short-
series data. Our work enables a closed-loop collaboration be-          duration (10 s) arrhythmia classification [Faust et al., 2018;
tween experts and machine learning algorithms to accelerate            Yildirim et al., 2018].
the efficacy of outcome predictions. The learning algorithms
can find nuance features through development of explainable            2     Methods
prototypes, and the experts can fine-tune the algorithms by
providing feedback through the regularization of the diversity         2.1    Time-Series Explanation via Prototypes
penalty. This is especially important for clinician experts who        We adopt the autoencoder-prototype architecture from [Li
                                                                                                             n
need explainability in black-box models to understand and di-          et al., 2017]. Let X = (xi , yi )i be the training set with
agnose different pathological mechanisms. To the best of our                     p
                                                                       xi ∈ R and class labels yi ∈ {1, ..., K} for each train-
knowledge this is the first application of prototypes and la-          ing point i ∈ {1, ..., n}. The front-end autoencoder net-
tent space analysis for health time-series data that could help        work learns a lower-dimension latent representation of the
reveal clinically relevant and explainable phenotypes to im-           data with an encoder network, f : Rp → Rq . The latent
prove the baseline for standard of care with automatic moni-           space is then projected back to the original dimension using
toring and detection.                                                  a decoder function, g : Rq → Rp . The latent representa-
                                                                       tion, f (x) is also passed to a feed-forward prototype network,
1.1   Relevant Work                                                    h : Rq → RK , for classification. The prototype network
Explainable methods [Ribeiro et al., 2016; Caruana et al.,             learns m prototype vectors, p1 , p2 , ..., pm ∈ Rq using a four-
2015; Zhou et al., 2015] have largely focused on labeled im-           layer fully-connected network over the latent space that learns
age and tabular data sets where classes are clearly separable          a probability distribution over the class labels yi (Fig 2). The
and less so on time-series data in general. Recent work has            learned prototypes can then be decoded using g and exam-
focused on using prototypes to provide in-process explain-             ined to infer what the network has learned. The choice of m
ability of classification models, either by learning meaning-          is determined a priori, with larger values allowing for higher
ful pixels in the entire image [Li et al., 2017] or by applying        throughput and model capacity.
attention through the use of sub-regions or patches over an               We improve prior work by adding a penalty for learned
image [Chen et al., 2018]. Class attention maps (CAMs) pro-            prototypes in the objective function of the above network to


                                                                                                                                   16
increase prototype diversity and coverage of the data in latent    sity score, Ψ:
representations. To align with the minimization of the objec-                                          t
                                                                                                  1 Xp
tive function, this new prototype diversity penalty needs to                                Ψ=          |φi |                       (5)
be (1) small when distances between prototypes are far apart,                                     Z i=1
and (2) large when distances between prototypes are close in       where φi , i ∈ {1, ..., t} is defined for a specific metric and Z
distance. We can evaluate the feasibility of a set of proto-       is the normalization constant. For the neighbor diversity met-
types by considering the distance of the two closest proto-        ric ΨN , φi is the set of prototypes that have nearest neighbor
types across all prototype combinations. So, we consider the       i and Z is the number of prototypes m. For the class di-
average minimum squared L2 distance between any two pro-           versity metric ΨC , φi is the set of prototypes that are from
totypes, pi , pj for our loss function. To achieve the desired     class i and Z is the number of classes K. Higher scores will
property above, we take the inverse of this average distance:      occur when prototypes have more unique elements. Thus,
                                                                   max(ΨD ) = 1.
 P DL(p1 ,..., pm ) =
                                  1                          (1)   2.3      Datasets
                  1
                      Pm                           2              The neonatal intensive care unit (NICU) dataset is composed
            log   m     j=1 mini>j∈[1,m] kpi − pj k2 + 
                                                                   of two sources: (1) ECG and Respiration waveforms from
The logarithm function tapers large distances so that the          PhysioNet’s PICS database [Gee et al., 2017; Goldberger
penalty does not quickly vanish, and the  term is for numeric     et al., 2000]; and (2) ECG waveforms (500 Hz, Intellivue
stability. By taking the inverse of the log of the prototype       MP450) collected from a preterm infant over their entire stay
distances, we penalize prototypes that are close in distance       (∼10 weeks) at Seton Medical Center Austin. The inclusion
while making sure the minimum distance between prototypes          of (2) helps supplement the ECG events from (1). The image
does not get too large. This prototype diversity loss (PDL)        data used in this study are made publicly available1 .
promotes coverage over the latent space. We update the ob-            The inter-breath intervals (IBIs) from the respiration were
jective function to:                                               extracted using a standard peak finder. The respiration sig-
                                                                   nals were clipped into 60 second segments that were nor-
      L((f, g, h), X) = E(h ◦ f, X) + λR R(g ◦ f, X)               malized to zero-mean, unit variance. The R-R intervals for
                                                                   the ECG of the NICU dataset were extracted using a Morlet
                        + λ1 R1 (p1 , ..., pm , X)                 wavelet transformation of the ECG signal. An open-source
                                                             (2)
                        + λ2 R2 (p1 , ..., pm , X)                 peak finder was applied to the wavelet scale range (0.01 to
                        + λpd P DL(p1 , ..., pm )                  .04 scales) related to QRS complex formation in the spec-
                                                                   trogram. The ECG waveforms were clipped at 15 seconds
where E is the classification (cross entropy) loss, R is the       with the event in the middle. All ECG segments were band-
reconstruction loss of the autoencoder (i.e. L2 norm), and         passed filtered from 3 to 45 GHz, scaled to zero-mean, unit-
R1 and R2 are the loss terms that relate the distances of the      variance, and scaled to the median QRS complex amplitude.
feature vectors to the prototype vectors in latent space [Li et    Images were then captured to mimic what a clinician would
al., 2017]:                                                        see upon investigation of an ECG signal. Waveforms with no
                                                                      1
                            m                                             https://physionet.org/physiobank/database/picsdb
                         1 X                            2
 R1 (p1 , ..., pm , X) =       mini∈[1,n] kpj − f (xi )k2 , (3)
                         m j=1
                            n
                         1X                              2
 R2 (p1 , ..., pm , X) =       minj∈[1,m] kf (xi ) − pj k2   (4)
                         n i=1

The minimization of the R1 loss term promotes each proto-
type vector to learn one of the encoded training examples,
while the minimization of R2 loss promotes encoded training
examples to be close to one of the prototypes. This balance
gives meaningful pixel-to-pixel representations between the
prototypes and training data.
   We train our models with a randomly shuffled batch size of
100 (ECG, Speech) and 125 (Respiration). We parameterize
the number of prototypes (see supplement) and the regular-
ization term λpd for the classification tasks while keeping the    Figure 3: Examples of waveforms for each task: (A) Electro-
other hyperparameters as in [Li et al., 2017].                     cardiogram (ECG) waveforms related to bradycardia classification,
                                                                   (B) Respiration waveforms related to apnea classification, and (C)
2.2    Prototype Diversity Score                                   Speech waveforms for a particular a speaker (Jackson). For (A) and
We adopt a version of the group fairness metric presented in       (B) we classify the segments based on severity (i.e. time difference
[Mehrotra et al., 2018] and refer to it as the prototype diver-    between peaks), and for (C) we classify based on digit class.


                                                                                                                               17
Figure 4: Effect of loss regularization on the latent space and spread of prototypes for the NICU classification task using 10 prototypes with
λpd = 0 (baseline) and λpd = 103 . The second and third dimensions of a t-SNE projection on each space shows prototypes with more
coverage and diversity in the latter case.


visibly distinguishable QRS complexes or respiratory peaks               purposes. This technique calculates the KL-Divergence be-
were discarded because these waveforms are too obscure for               tween the higher-order dimensional latent space and the lower
even a clinician expert to evaluate.                                     dimensional space used to represent the former visually. This
   Class breakdowns for bradycardia in the ECG signal follow             approach is non-deterministic so the global position in the
clinical thresholds [Perlman and Volpe, 1985]: XECG = {                  lower space is uninformative and instead proximity to neigh-
normal (>100 beats per minute (bpm)): 1039, mild (100-80                 bors is the key insight to gain. Additionally while the first
bpm): 634, moderate (80-60 bpm): 306, severe (<60 bpm):                  two dimensions of the projection show the general spread of
132 }. Moderate and severe events were combined into a                   information, the second and third dimensions maybe useful
single class. The class breakdown for apneas in respiration              for visualizing within group information. Thus, we use the
are: XRESP = { normal (1-3 s): 1939, mild (4-6 s): 1921,                 second and third dimensions for our visualizations.
moderate/severe (> 6 s): 1487 }.
   The Free Spoken Digit Dataset [Jackson et al., 2018] con-             3     Results
sists of 2000 audio clips (8 kHz) of four speakers repeat-               3.1    Classification of ECG with 2-D Prototypes
ing the digits 0 through 9, 50 times each. Each segment
was normalized to zero-mean, unit-variance and clipped for               We test our prototype implementation on ECG waveforms re-
white space (Fig. 3). This data can be thought of as “spoken             lated to bradycardia using the NICU data for a 3-class classifi-
MNIST”. We perform speaker classification and digit classi-              cation task using 10 prototypes. We treat the input waveforms
fication within a speaker.                                               as 2-D images and use a four-layer autoencoder to learn com-
                                                                         plex representations over the data.
2.4   Visualization of Latent Space                                         We observe more diverse prototypes and comparable or
                                                                         better test accuracy with our model 93.1±0.4% compared
We use PCA to reduce the latent space vectors to a di-                   with 92.1±0.1% from the baseline model in [Li et al., 2017]
mension of 500, which retains 98% of the variability. We                 (Table 1). Both models perform well on the classification of
then calculate the cosine similarity between these 500 di-               the normal class, as expected since normal waveforms have
mensional vectors to produce a similarity matrix and use                 near-constant phase. Both models additionally have difficulty
t-distributed stochastic neighbor embedding (t-SNE) from                 separating between the mild and moderate/severe classes, of-
[Van der Maaten and Hinton, 2008] to reduce the 500 x 500                ten confusing the classification between these two (see sup-
similarity matrix down to three dimensions for visualization             plement). This behavior is expected since data near these


                                                                                                                                      18
                             ECG: Bradycardia                                                  Respiration: Apnea
         λpd        Accu.          ΨN                 ΨC                λpd           Acc.             ΨN                 ΨC
          0     92.1 ± 0.1%       0.83 ± 0.04     0.78 ± 0.19            0        81.4 ± 3.6%       0.96 ± 0.07      1.00 ± 0.00
         500    92.7 ± 1.0 %      0.86 ± 0.07     0.89 ± 0.19           500      82.3 ± 3.8%        0.94 ± 0.09      1.00 ± 0.00
         1e3    92.4 ± 1.3%       0.87 ± 0.11     0.89 ± 0.19           1e3      77.1 ± 0.6 %       1.00 ± 0.00      1.00 ± 0.00
         2e3    93.1 ± 0.4%       0.90 ± 0.04     1.00 ± 0.00           2e3      80.2 ± 2.5 %       0.97 ± 0.04      0.84 ± 0.23

Table 1: Diversity score for neighbors ΨN and class ΨC . We report Ψ’s related to the epoch with the highest test accuracy. Our model,
λpd > 0, returns better accuracies and diversity scores (bold) than the baseline model, which is row λpd = 0, across ECG and Respiration
waveforms. (Model details: 3-class, 10-prototypes, learning rate = 0.002).


two class boundaries are difficult to discern, even for domain
experts, due to events existing in both classes with possible
subtle time differences in cardiac firing. Our model also im-
proves prototype diversity (Table 1) over the baseline model.
This result suggests that the prototype diversity loss encour-
ages exploration, through learning diverse prototypes, within
the data represented in the latent space. As a result, our model
finds more helpful features and prototypes and thus, improves
classification results.
   Because prototypes are generated during training, we in-
fer features that the algorithm utilized to classify waveforms
at different points during training (Fig 5). For example, by
epoch 100, we see that some of the prototypes exhibit global
                                                                      Figure 5: Prototype evolution with in-process explainability over
morphological features of the normal waveform class after             training time. High level features are easily learned in early epochs
random initialization at epoch 0. As training progresses, we          of training, while more complex features are developed over time.
observe other complex phenotypes emerging: one prototype              The final nearest neighbors are depicted on the right. The prototypes
learns that large gaps in cardiac firings are important for iden-     correspond to a subset of the λpd = 103 latent space cloud in Figure
tifying severe cases and another prototype learns the consis-         4. Model details: 3-class, 10-prototypes.
tent pattern of spikes are important for mild cases. Since the
mild class shares mixed features of both normal and positive
events, it is not surprising that more prototypes are needed in       class labels and cardiac firing periods (Fig. 6, bottom). Even
this class to learn subtleties of the class features (see supple-     though we did not impose a class constraint, we observe that
ment). Thus, prototypes highlight waveform structures that            the algorithm found two separate features within the moder-
the algorithm deemed as important when trying to learn the            ate/severe class that were important in the classification task
classification of bradycardia. This finding aligns with the idea      (i.e. prototypes 2 and 10 shown at the top of the (Fig 6).
of clinicians using visible features present in a bradycardia         These two prototypes explore two different cardiac timings
(i.e. the increasing distance between QRS complexes) to de-           as prototype 2 exhibits a progressive delay in cardiac firing,
cide whether or not a bradycardia exists in an image.                 while prototype 10 exhibits a large spontaneous delay. The
   We compare the latent space of [Li et al., 2017] to the            incorporation of the prototype diversity loss encouraged this
latent space of our model with prototype diversity loss via t-        exploration of the latent space. These results suggest that
SNE projections, where proximity in 2-D space suggests that           there are physiologic dependencies (i.e. clustering based on
points are “close” in distance in the original latent space. We       cardiac morphology and function) that can be learned using
represent the learned prototypes by mapping each prototype            our model to investigate physiological phenomena, and possi-
to its nearest neighbor (Fig 4). We find that by increasing           bly applied to other clinical areas, like cardiac ischemia or ap-
our loss term, P DL, our model increases the local cover-             nea of prematurity in respiration - both exhibit visible, abnor-
age of the prototypes compared with the baseline model (i.e.          mal waveform behavior. This work provides a visualization
λpd = 0). However, if we regularize our loss term too much            tool for clinician experts to evaluate different morphological
(i.e. λpd > 104 ), we begin to introduce clustering of proto-         of physiological time-series data2 .
types and diversity suffers. Thus with the additional proto-
type distance penalty, we achieve higher diversity scores and         3.3      Classification of Apnea in Respiration
classification accuracies for various hyperparameters (Fig 9).        Apnea of prematurity is common among preterm infants, and
                                                                      is visually apparent as a pause of inhalation and exhalation
3.2   Case Study with Prototypes: Exploring ECG                       (i.e. absence of sinusoidal behavior) in the respiratory sig-
      Morphology and Bradycardia Classification.                      nal. We next test our prototype implementation on respiration
We observe that ECG events in a local neighborhood share
                                                                         2
similar QRS complex morphology, despite having different                     https://github.com/alangee/ijcai19-ts-prototypes


                                                                                                                                   19
Figure 6: Learned prototypes showcase the diversity of features that are important for understanding ECG morphology while classifying
bradycardia events. ( 10-prototypes, λpd = 104 ).


Figure 7: Learned prototypes showcase the diversity of features         Figure 8: Learned prototypes from audio waveforms of spoken dig-
across classes that are important for understanding respiration mor-    its by Nicolas from the FSDD (λpd = 500).
phology while classifying apnea events. For this classification task,
we observe a variety of prototypes (at epoch 500) that learn vari-
ous cases with cessation of breathing (6 and 9 second gaps) and the     ological examples and generate learned prototypes that dis-
global features within the segment that are important for the model’s   tinctly relate to physiological behavior. For example, in Fig.
classification. (8-prototypes, λpd = 500).
                                                                        7, we see that algorithm finds segments that are related to
                                                                        periodic breathing of 9 second duration (moderate/severe).
waveforms that are related to apnea in a 3-class classifica-            These segments are physiologically different from normal ap-
tion task. We treat the input waveforms as 2-D images again,            neas of 6 seconds (mild), and clearly different from normal
since clinicians evaluate apneas through visual inspection of           breathing with periodicity of 1 second (Fig 7). In the set of
the respiration signal.                                                 eight learned prototypes, the algorithm finds three different
                                                                        classes easily, each with different respiratory phenomena, that
   We observe more diverse prototypes and comparable or
                                                                        are critical in the classifying various types of apneas.
better test accuracy with our model 82.3±3.8% compared
with 81.4±3.6% from the baseline model, and with overall
unique nearest neighbors (ΨN = 1) and class diversity (ΨC =             3.4   Spoken Digits Classification and Analysis
1) (Table 1). Both models have difficulty separating between            Speech abnormalities can be suggestive of underlying patho-
the event classes because data near these two class boundaries          logical dysfunction, and common features that clinicians vis-
are difficult to visually discern (i.e. 6 second gap versus 7 sec-      ibly discern in waveforms to assess speech include cadence,
ond gap) and have common behavior with regular respiratory              prosody, and syllable articulation. To aid in speech fea-
function that is found in the normal class. We find that the            ture detection, we assess our model on high-frequency audio
addition of a prototype diversity loss maintains or improves            waveforms of spoken digits (FSDD) from medically-normal
performance and yields more diverse prototypes (Table 1).               individuals. These digits are treated as 2-D images for 4 class
   We also note that the algorithm is able to discern physi-            speaker and 10 digit classification tasks with 4 and 10 proto-


                                                                                                                                 20
                                                                       ences in the time series features, or conversely accentuate
                                                                       nuanced differences in learned prototypes as clinically impor-
                                                                       tant signs of impending adverse outcomes. Therefore, our im-
                                                                       plementation offers a collaborative method for clinician ex-
                                                                       perts to use their insight interactively with machine learning
                                                                       algorithms: increasing λpd promotes large observable differ-
                                                                       ences in the prototypes, while decreasing λpd promotes di-
                                                                       verse features and prototypes. In turn, our model enables a
                                                                       closed-loop feedback framework to accelerate phenotype dis-
                                                                       covery to lead clinicians to better-informed decision.
                                                                          We evaluate the performance of our model on increas-
                                                                       ingly difficult physiological datasets to demonstrate the ef-
                                                                       fect of λpd . The ECG signal is more robust against move-
Figure 9: Accuracy and diversity metrics for the spoken digits ex-     ment artifact and produces a cleaner signal for the 2-D vi-
periments using the FSDD. We divide this dataset into two tasks: (1)   sualization task, whereas the respiration signal, which is
classifying the person speaking and (2) classifying the digit spoken   the resultant voltage change across diaphragm movement, is
within each person.                                                    highly susceptible to signal artifact. Additionally, speech
                                                                       waveforms are compressed, high-frequency waveforms (kHz)
                                                                       which make it difficult to visibly extract high-resolution fea-
types, respectively. The waveform envelope and syllables of            tures. We find that our model allocates more prototypes to
these spoken digits are discernible to the eye (see “six” and          learn the intricacies of the more indistinguishable classes (i.e.
“se-ven” in Fig 2) and, as such, make good candidates for              mild and moderate/severe) that are hard for a human to dis-
our image-based explainability model. We demonstrate some              cern, especially the mild cases because this class is a mixture
of the learned prototypes in Fig. 8, which show representa-            and intermediary of the two extreme classes.
tions the model finds useful in classifying digits for a given            We observe, however, that the high number of loss terms
speaker. Experiments show that by varying regularization of            creates a trade-off between prototype interpretability and
the prototype diversity penalty, we observe slightly better or         model accuracy. For example, we observe that for a small
similar accuracies when compared to the baseline model (Fig.           number of prototypes, we achieve near-perfect prototype re-
9). With a fine-tuned λpd we can increase diversity of the             construction but at the cost of classification accuracy. When
prototypes and correspondingly see improved accuracy and               the number of prototypes was large, we achieve higher ac-
data coverage (see supplement). For example, λpd = 500                 curacy but received noisy prototypes. In future implementa-
gives a higher diversity score across all tasks, indicating pro-       tions, we can replace the front-end autoencoder with a model
totypes with more unique nearest neighbors as compared with            that operates well on 1-D time series, like an recurrent neural
the baseline model (Fig 9).                                            network, to balance accuracy and prototype interpretability.
   Experiments show that increasing the depth of the network              There has also been work on computing prototypical
and fine-tuning the learning rate lead to both increased accu-         patches over 2-D images to generate explainable sub-features
racy and diversity over all tasks. Similarly, recent data aug-         [Chen et al., 2018]. Extending the idea of patches to 1-D
mentation techniques in medical [Bahadori and Lipton, 2019]            time-series signals would allow for parsing the signal for sub-
and speech recognition [Park et al., 2019] domains could help          frequencies and features that could better explain how events
further improve performance. The purpose of this work, how-            are triggered. Nonetheless, the work presented in this paper
ever, is not to obtain the best performance on these tasks, but        provides a more robust prototype model to help explain al-
rather to show the utility of learned prototypes as faithful ex-       gorithmic behavior and decision-making in deep time-series
planations of decisions made by a model.                               classification tasks with promising results in clinically rele-
                                                                       vant datasets.
4    Discussion
We presented a new autoencoder-prototype model that pro-               Acknowledgement
motes diversity in learned prototypes by penalizing proto-             The authors would like to thank Sinead Williamson and the
types that are too close in squared L2 distance in the la-             three reviewers for providing helpful feedback and critical re-
tent space. The new term, λpd P DL(p1 , ..., pm ), in the loss         views of our work.
function (Eq. 2) promotes prototype diversity while improv-
ing classification accuracy and prototype coverage of data             References
represented in the latent space. These prototypes help ex-
plain which global features and representative segments in             [Bahadori and Lipton, 2019] Mohammad Taha Bahadori and
the training data are most useful for deep time-series classi-           Zachary Chase Lipton. Temporal-clustering invariance
fication. This in-process generation of prototypes offers ex-            in irregular healthcare time series.       arXiv preprint
plainable insights into deep classifiers.                                arXiv:1904.12206, 2019.
   Our model and results provide an important significance             [Caruana et al., 2015] Rich Caruana, Yin Lou, Johannes
that previous works lack. Depending on the clinical context              Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad.
of the case, experts may want to either trivialize big differ-           Intelligible models for healthcare: Predicting pneumonia


                                                                                                                                 21
   risk and hospital 30-day readmission. In Proceedings           [Park et al., 2019] Daniel S Park, William Chan, Yu Zhang,
   of the 21th ACM SIGKDD International Conference on                Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and
   Knowledge Discovery and Data Mining, KDD ’15, pages               Quoc V Le. Specaugment: A simple data augmentation
   1721–1730, New York, NY, USA, 2015. ACM.                          method for automatic speech recognition. arXiv preprint
[Chen et al., 2018] Chaofan Chen, Oscar Li, Alina Barnett,           arXiv:1904.08779, 2019.
   Jonathan Su, and Cynthia Rudin. This looks like that:          [Perlman and Volpe, 1985] Jeffrey M. Perlman and Joseph J.
   deep learning for interpretable image recognition. CoRR,          Volpe. Episodes of apnea and bradycardia in the preterm
   abs/1806.10574, 2018.                                             newborn: Impact on cerebral circulation. Pediatrics,
[Di Fiore et al., 2015] J.M. Di Fiore, E Gauda, R.J. Martin,         76(3):333–338, 1985.
   and P MacFarlane. Cardiorespiratory events in preterm          [Pichler et al., 2003] G. Pichler, B. Urlesberger, and
   infants: interventions and consequences. Journal Of Peri-         W. Muller. Impact of bradycardia on cerebral oxygenation
   natology, 36(251), 2015.                                          and cerebral blood volume using apnoea in preterm
[Faust et al., 2018] Oliver Faust, Yuki Hagiwara, Tan Jen            infants. Physio. Measurement, 24(3):671–680, 2003.
   Hong, Oh Shu Lih, and U Rajendra Acharya. Deep learn-          [Poets et al., 2015] Christian F. Poets, Robin S. Roberts,
   ing for healthcare applications based on physiological sig-       Barbara Schmidt, Robin K. Whyte, Elizabeth V. Aszta-
   nals: A review. Computer Methods and Programs in                  los, David Bader, Aida Bairam, Diane Moddemann, Abra-
   Biomedicine, 161:1 – 13, 2018.                                    ham Peliowski, Yacov Rabi, Alfonso Solimano, and Har-
[Fawaz et al., 2018] Hassan Ismail Fawaz,             Germain        vey Nelson. Association between intermittent hypoxemia
   Forestier, Jonathan Weber, Lhassane Idoumghar, and                or bradycardia and late death or disability in extremely
   Pierre-Alain Muller.       Deep learning for time series          preterm infants. JAMA, 314(6):595–603, 08 2015.
   classification: a review. CoRR, abs/1809.04356, 2018.          [Pons et al., 2017] Jordi Pons, Oriol Nieto, Matthew
[Gee et al., 2017] A. H. Gee, R. Barbieri, D. Paydarfar, and         Prockup, Erik M. Schmidt, Andreas F. Ehmann, and
                                                                     Xavier Serra.      End-to-end learning for music audio
   P. Indic. Predicting bradycardia in preterm infants using
                                                                     tagging at scale. CoRR, abs/1711.02520, 2017.
   point process analysis of heart rate. IEEE Transactions on
   Biomedical Engineering, 64(9):2300–2308, 2017.                 [Ribeiro et al., 2016] Marco Túlio Ribeiro, Sameer Singh,
[Goldberger et al., 2000] Ary L. Goldberger, Luis A. N.              and Carlos Guestrin.        ”why should I trust you?”:
                                                                     Explaining the predictions of any classifier. CoRR,
   Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch.
                                                                     abs/1602.04938, 2016.
   Ivanov, Roger G. Mark, Joseph E. Mietus, George B.
   Moody, Chung-Kang Peng, and H. Eugene Stanley. Phys-           [Rudin, 2018] Cynthia Rudin.        Please stop explaining
   ioBank, PhysioToolkit, and PhysioNet: Components of               black box models for high stakes decisions. CoRR,
   a new research resource for complex physiologic signals.          abs/1811.10154, 11 2018.
   Circulation, 101(23):e215–e220, June 2000.                     [Schmid et al., 2015] M.B. Schmid, R.J. Hopfner, S. Lenhof,
[Goodfellow et al., 2018] Sebastian Goodfellow, Andrew               H.D. Hummler, and H. Fuchs. Cerebral oxygenation dur-
   Goodwin, Danny Eytan, Robert Greer, Mjaye Mazwi, and              ing intermittent hypoxemia and bradycardia in preterm in-
   Peter Laussen. Towards understanding ecg rhythm clas-             fants. Neonatology, 107:137–146, 2015.
   sification using convolutional neural networks and atten-      [Van der Maaten and Hinton, 2008] L. Van der Maaten and
   tion mappings. In Proceedings of Machine Learning for             G. Hinton. Visualizing data using t-sne. Journal of Ma-
   Healthcare, MLHC ’18, pages 2243–2251, 08 2018.                   chine Learning Research, 9:2579–2605, 2008.
[Jackson et al., 2018] Zohar Jackson, César Souza, Yuxin         [Williamson et al., 2013] James R. Williamson, Daniel W.
   Flaks, Jason; Pan, Hereman Nicolas, and Adhish Thite.             Bliss, and David Paydarfar. Forecasting respiratory col-
   Free spoken digit dataset (fsdd). 2018.                           lapse: Theory and practice for averting life-threatening
[Li et al., 2017] Oscar Li, Hao Liu, Chaofan Chen, and               infant apneas. Respiratory Physiology & Neurobiology,
   Cynthia Rudin. Deep learning for case-based reasoning             189(2):223 – 231, 2013.
   through prototypes: A neural network that explains its pre-    [Yildirim et al., 2018] Ozal Yildirim, Pawel Plawiak, Ru-
   dictions. CoRR, abs/1710.04806, 2017.                             San Tan, and U. Rajendra Acharya. Arrhythmia detec-
[Martin and Wilson, 2012] Richard J. Martin and Christo-             tion using deep convolutional neural network with long
   pher G. Wilson. Apnea of prematurity. pages 2923–2931,            duration ecg signals. Computers in Biology and Medicine,
   2012.                                                             102:411 – 420, 2018.
[Mehrotra et al., 2018] Rishabh Mehrotra, James McIner-           [Zhou et al., 2015] Bolei Zhou, Aditya Khosla, Àgata
   ney, Hugues Bouchard, Mounia Lalmas, and Fernando                 Lapedriza, Aude Oliva, and Antonio Torralba. Learn-
   Diaz. Towards a fair marketplace: Counterfactual eval-            ing deep features for discriminative localization. CoRR,
   uation of the trade-off between relevance, fairness & satis-      abs/1512.04150, 2015.
   faction in recommendation systems. In Proceedings of the
   27th ACM International Conference on Information and
   Knowledge Management, pages 2243–2251, 2018.


                                                                                                                       22