<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explaining Deep Classification of Time-Series Data with Learned Prototypes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alan H. Gee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Garcia-Olano</string-name>
          <email>diegoolanog@utexas.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joydeep Ghosh</string-name>
          <email>ghosh@ece.utexas.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Paydarfar</string-name>
          <email>david.paydarfar@austin.utexas.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Electrical and Computer Engineering, The University of Texas at Austin</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Neurology, Dell Medical School, The University of Texas at Austin</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The emergence of deep learning networks raises a need for explainable AI so that users and domain experts can be confident applying them to high-risk decisions. In this paper, we leverage data from the latent space induced by deep learning models to learn stereotypical representations or “prototypes” during training to elucidate the algorithmic decision-making process. We study how leveraging prototypes effect classification decisions of two dimensional time-series data in a few different settings: (1) electrocardiogram (ECG) waveforms to detect clinical bradycardia, a slowing of heart rate, in preterm infants, (2) respiration waveforms to detect apnea of prematurity, and (3) audio waveforms to classify spoken digits. We improve upon existing models by optimizing for increased prototype diversity and robustness, visualize how these prototypes in the latent space are used by the model to distinguish classes, and show that prototypes are capable of learning features on two dimensional time-series data to produce explainable insights during classification tasks. We show that the prototypes are capable of learning real-world features - bradycardia in ECG, apnea in respiration, and articulation in speech - as well as features within sub-classes. Our novel work leverages learned prototypical framework on two dimensional time-series data to produce explainable insights during classification tasks. Equal Contribution</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Despite the recent surge of machine learning, adoption of
deep learning models in decision critical domains, such as
healthcare, has been slow because of limited transparency
and explanations in black-box algorithms. This observation
points to the critical need for black-box models to offer
interpretable, faithful explanations of their decisions so that
practitioners in high-risk domains can trust model outputs and
leverage their results. One such high-risk domain is treating
preterm infants ( 10% of births worldwide) in the neonatal
intensive care unit (NICU).</p>
      <p>A common disorder observed in majority of preterm
infants is recurrent episodes of apnea (cessation of breathing)
and bradycardia (slowing of heart rate). Both of these
spontaneous events may cause end organ damage related to
hypoxemia (low oxygenation of blood) and ischemia (reduced
blood flow) [Martin and Wilson, 2012]. Early detection of
apnea and bradycardia can help prevent hypoxic-ischemic
injury in tissue with high-metabolic demands [Schmid et al.,
2015; Pichler et al., 2003] and prevent the cascade into
intermittent hypoxia, which leads to complications of
retinopathy, developmental delays, and neuropsychiatric disorders
[Williamson et al., 2013; Poets et al., 2015; Di Fiore et al.,
2015]. Leveraging explainability in deep neural network
classification of these time series can reveal complex
morphological and physiological features that clinicians may not readily
see. Thus, machine learning algorithms need transparency
in their decision-making process to highlight subtle patterns.
One such technique in deep explainability is prototypes, a
case-based reasoning technique.</p>
      <p>Prototypes are representative examples, learned in-process
during model training, that describe influential data regions
in latent representations and provide insight into aggregated
features across training data that are utilized by the model for
classification. In contrast to post-hoc explainability, which
trains a secondary model to infer decision reasoning from
a primary model by only leveraging inputs and outputs,
inprocess explainable methods offer faithful explanations of a
primary model’s decisions [Rudin, 2018]. So, users who
employ prototypes can confidently gain direct insight into the
decisions algorithms are making for classification tasks.</p>
      <p>On data with unclear class boundaries, in-process methods
can misbehave. For example when the model in [Li et al.,
2017] is applied to the MNIST dataset, the prototypes
easily separate in the latent space because the latent data
representation is separable and well-structured (Fig 1). However,
when class boundaries and features do not form
distinguishable clusters, learned prototypes become archetypes (extreme
corner cases) that exist near the convex hull of the data in the
latent space (Fig. 4). This phenomenon yields prototypes that
represent extreme class types (i.e. archetypes) and can
underperform on classifying data in overlapping class regions.</p>
      <p>In this work, we provide a deep classification method with
explainable insights for health time-series data. We introduce
a prototype diversity penalty that explicitly accounts for
prototype clustering and encourages the model to learn more
diverse prototypes. These diverse prototypes will help focus
on areas of the latent space where class separation is most
difficult and least defined to improve classification
accuracies. We show the utility of this approach on three tasks in
two-dimensional time-series classification: (1) bradycardia
from ECG; (2) apnea from respiration; and (3) spoken digits
from audio waveforms. The two-dimensional representation
of time-series provides an interpretable method for domain
experts (e.g. clinicians) to understand the evolution of
clinically relevant features based on visible phenotypes in
timeseries data. Our work enables a closed-loop collaboration
between experts and machine learning algorithms to accelerate
the efficacy of outcome predictions. The learning algorithms
can find nuance features through development of explainable
prototypes, and the experts can fine-tune the algorithms by
providing feedback through the regularization of the diversity
penalty. This is especially important for clinician experts who
need explainability in black-box models to understand and
diagnose different pathological mechanisms. To the best of our
knowledge this is the first application of prototypes and
latent space analysis for health time-series data that could help
reveal clinically relevant and explainable phenotypes to
improve the baseline for standard of care with automatic
monitoring and detection.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Relevant Work</title>
      <p>Explainable methods [Ribeiro et al., 2016; Caruana et al.,
2015; Zhou et al., 2015] have largely focused on labeled
image and tabular data sets where classes are clearly separable
and less so on time-series data in general. Recent work has
focused on using prototypes to provide in-process
explainability of classification models, either by learning
meaningful pixels in the entire image [Li et al., 2017] or by applying
attention through the use of sub-regions or patches over an
image [Chen et al., 2018]. Class attention maps (CAMs)
provide probability maps to highlight areas of images that lead
to a certain prediction [Zhou et al., 2015], but do not give
examples of prototypical examples of the data or explanations
of how the training data relates to the end result. We focus
on the former work [Li et al., 2017] for example-based
explainability where the generation of prototypes are intended
to look like global representations of the training data.</p>
      <p>Time-series classification on 1-D data with deep neural
networks is a rapidly growing field, with almost 9,000 deep
learning models [Fawaz et al., 2018; Pons et al., 2017;
Faust et al., 2018; Goodfellow et al., 2018]. One such
example leverages global average pooling to produce CAMs to
provide explainability for a deep CNN to classify atrial
fibrillation in ECG data [Goodfellow et al., 2018]. However,
the number of available healthcare datasets, specifically ECG
waveforms, is limited [Fawaz et al., 2018]. Within this
context, time-series classification on ECG waveforms has been
done on a small scale, typically with single beat or
shortduration (10 s) arrhythmia classification [Faust et al., 2018;
Yildirim et al., 2018].
We adopt the autoencoder-prototype architecture from [Li
et al., 2017]. Let X = (xi; yi)in be the training set with
xi 2 Rp and class labels yi 2 f1; :::; Kg for each
training point i 2 f1; :::; ng. The front-end autoencoder
network learns a lower-dimension latent representation of the
data with an encoder network, f : Rp ! Rq. The latent
space is then projected back to the original dimension using
a decoder function, g : Rq ! Rp. The latent
representation, f (x) is also passed to a feed-forward prototype network,
h : Rq ! RK , for classification. The prototype network
learns m prototype vectors, p1; p2; :::; pm 2 Rq using a
fourlayer fully-connected network over the latent space that learns
a probability distribution over the class labels yi (Fig 2). The
learned prototypes can then be decoded using g and
examined to infer what the network has learned. The choice of m
is determined a priori, with larger values allowing for higher
throughput and model capacity.</p>
      <p>We improve prior work by adding a penalty for learned
prototypes in the objective function of the above network to
increase prototype diversity and coverage of the data in latent
representations. To align with the minimization of the
objective function, this new prototype diversity penalty needs to
be (1) small when distances between prototypes are far apart,
and (2) large when distances between prototypes are close in
distance. We can evaluate the feasibility of a set of
prototypes by considering the distance of the two closest
prototypes across all prototype combinations. So, we consider the
average minimum squared L2 distance between any two
prototypes, pi; pj for our loss function. To achieve the desired
property above, we take the inverse of this average distance:
P DL(p1;:::; pm) =
1
log m1 Pm
j=1 mini&gt;j2[1;m] kpi
pj k22 +
The logarithm function tapers large distances so that the
penalty does not quickly vanish, and the term is for numeric
stability. By taking the inverse of the log of the prototype
distances, we penalize prototypes that are close in distance
while making sure the minimum distance between prototypes
does not get too large. This prototype diversity loss (PDL)
promotes coverage over the latent space. We update the
objective function to:</p>
      <p>L((f; g; h); X) = E(h f; X) +
R R(g f; X)
(1)
(2)
+ 1 R1(p1; :::; pm; X)
+ 2 R2(p1; :::; pm; X)
+ pd P DL(p1; :::; pm)
where E is the classification (cross entropy) loss, R is the
reconstruction loss of the autoencoder (i.e. L2 norm), and
R1 and R2 are the loss terms that relate the distances of the
feature vectors to the prototype vectors in latent space [Li et
al., 2017]:</p>
      <p>R1(p1; :::; pm; X) =
R2(p1; :::; pm; X) =
1 Xm
m
j=1
n
1 X
n
i=1
mini2[1;n] kpj</p>
      <p>f (xi)k22 ; (3)
minj2[1;m] kf (xi)</p>
      <p>2
pj k2
(4)
The minimization of the R1 loss term promotes each
prototype vector to learn one of the encoded training examples,
while the minimization of R2 loss promotes encoded training
examples to be close to one of the prototypes. This balance
gives meaningful pixel-to-pixel representations between the
prototypes and training data.</p>
      <p>We train our models with a randomly shuffled batch size of
100 (ECG, Speech) and 125 (Respiration). We parameterize
the number of prototypes (see supplement) and the
regularization term pd for the classification tasks while keeping the
other hyperparameters as in [Li et al., 2017].</p>
    </sec>
    <sec id="sec-3">
      <title>2.2 Prototype Diversity Score</title>
      <p>We adopt a version of the group fairness metric presented in
[Mehrotra et al., 2018] and refer to it as the prototype
diversity score, :
t
= 1 X p</p>
      <p>Z
i=1
j ij
(5)
where i; i 2 f1; :::; tg is defined for a specific metric and Z
is the normalization constant. For the neighbor diversity
metric N , i is the set of prototypes that have nearest neighbor
i and Z is the number of prototypes m. For the class
diversity metric C , i is the set of prototypes that are from
class i and Z is the number of classes K. Higher scores will
occur when prototypes have more unique elements. Thus,
max( D) = 1.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Datasets</title>
      <p>The neonatal intensive care unit (NICU) dataset is composed
of two sources: (1) ECG and Respiration waveforms from
PhysioNet’s PICS database [Gee et al., 2017; Goldberger
et al., 2000]; and (2) ECG waveforms (500 Hz, Intellivue
MP450) collected from a preterm infant over their entire stay
( 10 weeks) at Seton Medical Center Austin. The inclusion
of (2) helps supplement the ECG events from (1). The image
data used in this study are made publicly available1.</p>
      <p>The inter-breath intervals (IBIs) from the respiration were
extracted using a standard peak finder. The respiration
signals were clipped into 60 second segments that were
normalized to zero-mean, unit variance. The R-R intervals for
the ECG of the NICU dataset were extracted using a Morlet
wavelet transformation of the ECG signal. An open-source
peak finder was applied to the wavelet scale range (0.01 to
.04 scales) related to QRS complex formation in the
spectrogram. The ECG waveforms were clipped at 15 seconds
with the event in the middle. All ECG segments were
bandpassed filtered from 3 to 45 GHz, scaled to zero-mean,
unitvariance, and scaled to the median QRS complex amplitude.
Images were then captured to mimic what a clinician would
see upon investigation of an ECG signal. Waveforms with no</p>
      <sec id="sec-4-1">
        <title>1https://physionet.org/physiobank/database/picsdb</title>
        <p>visibly distinguishable QRS complexes or respiratory peaks
were discarded because these waveforms are too obscure for
even a clinician expert to evaluate.</p>
        <p>Class breakdowns for bradycardia in the ECG signal follow
clinical thresholds [Perlman and Volpe, 1985]: XECG = f
normal (&gt;100 beats per minute (bpm)): 1039, mild (100-80
bpm): 634, moderate (80-60 bpm): 306, severe (&lt;60 bpm):
132 g. Moderate and severe events were combined into a
single class. The class breakdown for apneas in respiration
are: XRESP = f normal (1-3 s): 1939, mild (4-6 s): 1921,
moderate/severe (&gt; 6 s): 1487 g.</p>
        <p>The Free Spoken Digit Dataset [Jackson et al., 2018]
consists of 2000 audio clips (8 kHz) of four speakers
repeating the digits 0 through 9, 50 times each. Each segment
was normalized to zero-mean, unit-variance and clipped for
white space (Fig. 3). This data can be thought of as “spoken
MNIST”. We perform speaker classification and digit
classification within a speaker.
2.4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Visualization of Latent Space</title>
      <p>We use PCA to reduce the latent space vectors to a
dimension of 500, which retains 98% of the variability. We
then calculate the cosine similarity between these 500
dimensional vectors to produce a similarity matrix and use
t-distributed stochastic neighbor embedding (t-SNE) from
[Van der Maaten and Hinton, 2008] to reduce the 500 x 500
similarity matrix down to three dimensions for visualization
purposes. This technique calculates the KL-Divergence
between the higher-order dimensional latent space and the lower
dimensional space used to represent the former visually. This
approach is non-deterministic so the global position in the
lower space is uninformative and instead proximity to
neighbors is the key insight to gain. Additionally while the first
two dimensions of the projection show the general spread of
information, the second and third dimensions maybe useful
for visualizing within group information. Thus, we use the
second and third dimensions for our visualizations.
3
3.1</p>
      <sec id="sec-5-1">
        <title>Results</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Classification of ECG with 2-D Prototypes</title>
      <p>We test our prototype implementation on ECG waveforms
related to bradycardia using the NICU data for a 3-class
classification task using 10 prototypes. We treat the input waveforms
as 2-D images and use a four-layer autoencoder to learn
complex representations over the data.</p>
      <p>We observe more diverse prototypes and comparable or
better test accuracy with our model 93.1 0.4% compared
with 92.1 0.1% from the baseline model in [Li et al., 2017]
(Table 1). Both models perform well on the classification of
the normal class, as expected since normal waveforms have
near-constant phase. Both models additionally have difficulty
separating between the mild and moderate/severe classes,
often confusing the classification between these two (see
supplement). This behavior is expected since data near these
ECG: Bradycardia
pd
0
500
1e3
2e3</p>
      <p>Accu.
92.1
92.7
92.4
93.1
0.1%
two class boundaries are difficult to discern, even for domain
experts, due to events existing in both classes with possible
subtle time differences in cardiac firing. Our model also
improves prototype diversity (Table 1) over the baseline model.
This result suggests that the prototype diversity loss
encourages exploration, through learning diverse prototypes, within
the data represented in the latent space. As a result, our model
finds more helpful features and prototypes and thus, improves
classification results.</p>
      <p>Because prototypes are generated during training, we
infer features that the algorithm utilized to classify waveforms
at different points during training (Fig 5). For example, by
epoch 100, we see that some of the prototypes exhibit global
morphological features of the normal waveform class after
random initialization at epoch 0. As training progresses, we
observe other complex phenotypes emerging: one prototype
learns that large gaps in cardiac firings are important for
identifying severe cases and another prototype learns the
consistent pattern of spikes are important for mild cases. Since the
mild class shares mixed features of both normal and positive
events, it is not surprising that more prototypes are needed in
this class to learn subtleties of the class features (see
supplement). Thus, prototypes highlight waveform structures that
the algorithm deemed as important when trying to learn the
classification of bradycardia. This finding aligns with the idea
of clinicians using visible features present in a bradycardia
(i.e. the increasing distance between QRS complexes) to
decide whether or not a bradycardia exists in an image.</p>
      <p>We compare the latent space of [Li et al., 2017] to the
latent space of our model with prototype diversity loss via
tSNE projections, where proximity in 2-D space suggests that
points are “close” in distance in the original latent space. We
represent the learned prototypes by mapping each prototype
to its nearest neighbor (Fig 4). We find that by increasing
our loss term, P DL, our model increases the local
coverage of the prototypes compared with the baseline model (i.e.</p>
      <p>pd = 0). However, if we regularize our loss term too much
(i.e. pd &gt; 104), we begin to introduce clustering of
prototypes and diversity suffers. Thus with the additional
prototype distance penalty, we achieve higher diversity scores and
classification accuracies for various hyperparameters (Fig 9).
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Case Study with Prototypes: Exploring ECG</title>
    </sec>
    <sec id="sec-8">
      <title>Morphology and Bradycardia Classification.</title>
      <p>We observe that ECG events in a local neighborhood share
similar QRS complex morphology, despite having different
class labels and cardiac firing periods (Fig. 6, bottom). Even
though we did not impose a class constraint, we observe that
the algorithm found two separate features within the
moderate/severe class that were important in the classification task
(i.e. prototypes 2 and 10 shown at the top of the (Fig 6).
These two prototypes explore two different cardiac timings
as prototype 2 exhibits a progressive delay in cardiac firing,
while prototype 10 exhibits a large spontaneous delay. The
incorporation of the prototype diversity loss encouraged this
exploration of the latent space. These results suggest that
there are physiologic dependencies (i.e. clustering based on
cardiac morphology and function) that can be learned using
our model to investigate physiological phenomena, and
possibly applied to other clinical areas, like cardiac ischemia or
apnea of prematurity in respiration - both exhibit visible,
abnormal waveform behavior. This work provides a visualization
tool for clinician experts to evaluate different morphological
of physiological time-series data2.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Classification of Apnea in Respiration</title>
      <p>Apnea of prematurity is common among preterm infants, and
is visually apparent as a pause of inhalation and exhalation
(i.e. absence of sinusoidal behavior) in the respiratory
signal. We next test our prototype implementation on respiration</p>
      <sec id="sec-9-1">
        <title>2https://github.com/alangee/ijcai19-ts-prototypes</title>
        <p>waveforms that are related to apnea in a 3-class
classification task. We treat the input waveforms as 2-D images again,
since clinicians evaluate apneas through visual inspection of
the respiration signal.</p>
        <p>We observe more diverse prototypes and comparable or
better test accuracy with our model 82.3 3.8% compared
with 81.4 3.6% from the baseline model, and with overall
unique nearest neighbors ( N = 1) and class diversity ( C =
1) (Table 1). Both models have difficulty separating between
the event classes because data near these two class boundaries
are difficult to visually discern (i.e. 6 second gap versus 7
second gap) and have common behavior with regular respiratory
function that is found in the normal class. We find that the
addition of a prototype diversity loss maintains or improves
performance and yields more diverse prototypes (Table 1).</p>
        <p>We also note that the algorithm is able to discern
physiological examples and generate learned prototypes that
distinctly relate to physiological behavior. For example, in Fig.
7, we see that algorithm finds segments that are related to
periodic breathing of 9 second duration (moderate/severe).
These segments are physiologically different from normal
apneas of 6 seconds (mild), and clearly different from normal
breathing with periodicity of 1 second (Fig 7). In the set of
eight learned prototypes, the algorithm finds three different
classes easily, each with different respiratory phenomena, that
are critical in the classifying various types of apneas.
3.4</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Spoken Digits Classification and Analysis</title>
      <p>Speech abnormalities can be suggestive of underlying
pathological dysfunction, and common features that clinicians
visibly discern in waveforms to assess speech include cadence,
prosody, and syllable articulation. To aid in speech
feature detection, we assess our model on high-frequency audio
waveforms of spoken digits (FSDD) from medically-normal
individuals. These digits are treated as 2-D images for 4 class
speaker and 10 digit classification tasks with 4 and 10
prototypes, respectively. The waveform envelope and syllables of
these spoken digits are discernible to the eye (see “six” and
“se-ven” in Fig 2) and, as such, make good candidates for
our image-based explainability model. We demonstrate some
of the learned prototypes in Fig. 8, which show
representations the model finds useful in classifying digits for a given
speaker. Experiments show that by varying regularization of
the prototype diversity penalty, we observe slightly better or
similar accuracies when compared to the baseline model (Fig.
9). With a fine-tuned pd we can increase diversity of the
prototypes and correspondingly see improved accuracy and
data coverage (see supplement). For example, pd = 500
gives a higher diversity score across all tasks, indicating
prototypes with more unique nearest neighbors as compared with
the baseline model (Fig 9).</p>
      <p>Experiments show that increasing the depth of the network
and fine-tuning the learning rate lead to both increased
accuracy and diversity over all tasks. Similarly, recent data
augmentation techniques in medical [Bahadori and Lipton, 2019]
and speech recognition [Park et al., 2019] domains could help
further improve performance. The purpose of this work,
however, is not to obtain the best performance on these tasks, but
rather to show the utility of learned prototypes as faithful
explanations of decisions made by a model.
4</p>
      <sec id="sec-10-1">
        <title>Discussion</title>
        <p>We presented a new autoencoder-prototype model that
promotes diversity in learned prototypes by penalizing
prototypes that are too close in squared L2 distance in the
latent space. The new term, pd P DL(p1; :::; pm), in the loss
function (Eq. 2) promotes prototype diversity while
improving classification accuracy and prototype coverage of data
represented in the latent space. These prototypes help
explain which global features and representative segments in
the training data are most useful for deep time-series
classification. This in-process generation of prototypes offers
explainable insights into deep classifiers.</p>
        <p>Our model and results provide an important significance
that previous works lack. Depending on the clinical context
of the case, experts may want to either trivialize big
differences in the time series features, or conversely accentuate
nuanced differences in learned prototypes as clinically
important signs of impending adverse outcomes. Therefore, our
implementation offers a collaborative method for clinician
experts to use their insight interactively with machine learning
algorithms: increasing pd promotes large observable
differences in the prototypes, while decreasing pd promotes
diverse features and prototypes. In turn, our model enables a
closed-loop feedback framework to accelerate phenotype
discovery to lead clinicians to better-informed decision.</p>
        <p>We evaluate the performance of our model on
increasingly difficult physiological datasets to demonstrate the
effect of pd. The ECG signal is more robust against
movement artifact and produces a cleaner signal for the 2-D
visualization task, whereas the respiration signal, which is
the resultant voltage change across diaphragm movement, is
highly susceptible to signal artifact. Additionally, speech
waveforms are compressed, high-frequency waveforms (kHz)
which make it difficult to visibly extract high-resolution
features. We find that our model allocates more prototypes to
learn the intricacies of the more indistinguishable classes (i.e.
mild and moderate/severe) that are hard for a human to
discern, especially the mild cases because this class is a mixture
and intermediary of the two extreme classes.</p>
        <p>We observe, however, that the high number of loss terms
creates a trade-off between prototype interpretability and
model accuracy. For example, we observe that for a small
number of prototypes, we achieve near-perfect prototype
reconstruction but at the cost of classification accuracy. When
the number of prototypes was large, we achieve higher
accuracy but received noisy prototypes. In future
implementations, we can replace the front-end autoencoder with a model
that operates well on 1-D time series, like an recurrent neural
network, to balance accuracy and prototype interpretability.</p>
        <p>There has also been work on computing prototypical
patches over 2-D images to generate explainable sub-features
[Chen et al., 2018]. Extending the idea of patches to 1-D
time-series signals would allow for parsing the signal for
subfrequencies and features that could better explain how events
are triggered. Nonetheless, the work presented in this paper
provides a more robust prototype model to help explain
algorithmic behavior and decision-making in deep time-series
classification tasks with promising results in clinically
relevant datasets.</p>
      </sec>
      <sec id="sec-10-2">
        <title>Acknowledgement</title>
        <p>The authors would like to thank Sinead Williamson and the
three reviewers for providing helpful feedback and critical
reviews of our work.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Bahadori and Lipton</source>
          , 2019]
          <article-title>Mohammad Taha Bahadori and Zachary Chase Lipton</article-title>
          .
          <article-title>Temporal-clustering invariance in irregular healthcare time series</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .12206,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Caruana et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Rich</given-names>
            <surname>Caruana</surname>
          </string-name>
          , Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and
          <string-name>
            <given-names>Noemie</given-names>
            <surname>Elhadad</surname>
          </string-name>
          .
          <article-title>Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission</article-title>
          .
          <source>In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '15</source>
          , pages
          <fpage>1721</fpage>
          -
          <lpage>1730</lpage>
          , New York, NY, USA,
          <year>2015</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Chen et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Chaofan</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Oscar</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alina</given-names>
            <surname>Barnett</surname>
          </string-name>
          , Jonathan Su, and
          <string-name>
            <given-names>Cynthia</given-names>
            <surname>Rudin</surname>
          </string-name>
          .
          <article-title>This looks like that: deep learning for interpretable image recognition</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1806</year>
          .10574,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>[Di</surname>
          </string-name>
          Fiore et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>J.M.</given-names>
            <surname>Di Fiore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E</given-names>
            <surname>Gauda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.J.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and P</given-names>
            <surname>MacFarlane</surname>
          </string-name>
          .
          <article-title>Cardiorespiratory events in preterm infants: interventions and consequences</article-title>
          .
          <source>Journal Of Perinatology</source>
          ,
          <volume>36</volume>
          (
          <issue>251</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Faust et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Oliver</given-names>
            <surname>Faust</surname>
          </string-name>
          , Yuki Hagiwara, Tan Jen Hong, Oh Shu Lih, and
          <string-name>
            <given-names>U Rajendra</given-names>
            <surname>Acharya</surname>
          </string-name>
          .
          <article-title>Deep learning for healthcare applications based on physiological signals: A review</article-title>
          .
          <source>Computer Methods</source>
          and Programs in Biomedicine,
          <volume>161</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Fawaz et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Hassan</given-names>
            <surname>Ismail</surname>
          </string-name>
          <string-name>
            <surname>Fawaz</surname>
          </string-name>
          , Germain Forestier, Jonathan Weber,
          <string-name>
            <given-names>Lhassane</given-names>
            <surname>Idoumghar</surname>
          </string-name>
          , and
          <string-name>
            <surname>Pierre-Alain Muller</surname>
          </string-name>
          .
          <article-title>Deep learning for time series classification: a review</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1809</year>
          .04356,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Gee et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Gee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Paydarfar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Indic</surname>
          </string-name>
          .
          <article-title>Predicting bradycardia in preterm infants using point process analysis of heart rate</article-title>
          .
          <source>IEEE Transactions on Biomedical Engineering</source>
          ,
          <volume>64</volume>
          (
          <issue>9</issue>
          ):
          <fpage>2300</fpage>
          -
          <lpage>2308</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Goldberger et al.,
          <year>2000</year>
          ] Ary L. Goldberger,
          <string-name>
            <given-names>Luis A. N.</given-names>
            <surname>Amaral</surname>
          </string-name>
          , Leon Glass,
          <string-name>
            <surname>Jeffrey M. Hausdorff</surname>
            ,
            <given-names>Plamen</given-names>
          </string-name>
          <string-name>
            <surname>Ch</surname>
          </string-name>
          . Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody,
          <string-name>
            <surname>Chung-Kang Peng</surname>
            , and
            <given-names>H. Eugene</given-names>
          </string-name>
          <string-name>
            <surname>Stanley</surname>
          </string-name>
          . PhysioBank, PhysioToolkit, and
          <article-title>PhysioNet: Components of a new research resource for complex physiologic signals</article-title>
          .
          <source>Circulation</source>
          ,
          <volume>101</volume>
          (
          <issue>23</issue>
          ):
          <fpage>e215</fpage>
          -
          <lpage>e220</lpage>
          ,
          <year>June 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Goodfellow et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , Andrew Goodwin, Danny Eytan, Robert Greer, Mjaye Mazwi, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Laussen</surname>
          </string-name>
          .
          <article-title>Towards understanding ecg rhythm classification using convolutional neural networks and attention mappings</article-title>
          .
          <source>In Proceedings of Machine Learning for Healthcare, MLHC '18</source>
          , pages
          <fpage>2243</fpage>
          -
          <lpage>2251</lpage>
          ,
          <year>08 2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [
          <string-name>
            <surname>Jackson</surname>
          </string-name>
          et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Zohar</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <source>Ce´sar Souza</source>
          , Yuxin Flaks, Jason; Pan,
          <string-name>
            <given-names>Hereman</given-names>
            <surname>Nicolas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Adhish</given-names>
            <surname>Thite</surname>
          </string-name>
          .
          <article-title>Free spoken digit dataset (fsdd</article-title>
          ).
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Oscar</given-names>
            <surname>Li</surname>
          </string-name>
          , Hao Liu, Chaofan Chen, and
          <string-name>
            <given-names>Cynthia</given-names>
            <surname>Rudin</surname>
          </string-name>
          .
          <article-title>Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions</article-title>
          .
          <source>CoRR, abs/1710.04806</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Martin and Wilson</source>
          , 2012] Richard J. Martin and
          <string-name>
            <given-names>Christopher G.</given-names>
            <surname>Wilson</surname>
          </string-name>
          . Apnea of prematurity. pages
          <fpage>2923</fpage>
          -
          <lpage>2931</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Mehrotra et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Rishabh</given-names>
            <surname>Mehrotra</surname>
          </string-name>
          ,
          <string-name>
            <surname>James</surname>
            <given-names>McInerney</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hugues</given-names>
            <surname>Bouchard</surname>
          </string-name>
          , Mounia Lalmas, and
          <string-name>
            <given-names>Fernando</given-names>
            <surname>Diaz</surname>
          </string-name>
          .
          <article-title>Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness &amp; satisfaction in recommendation systems</article-title>
          .
          <source>In Proceedings of the 27th ACM International Conference on Information and Knowledge Management</source>
          , pages
          <fpage>2243</fpage>
          -
          <lpage>2251</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Park et al.,
          <year>2019</year>
          ] Daniel S Park,
          <string-name>
            <given-names>William</given-names>
            <surname>Chan</surname>
          </string-name>
          , Yu Zhang, Chung-Cheng Chiu, Barret Zoph,
          <string-name>
            <surname>Ekin D Cubuk</surname>
          </string-name>
          , and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le. Specaugment</surname>
          </string-name>
          :
          <article-title>A simple data augmentation method for automatic speech recognition</article-title>
          .
          <source>arXiv preprint arXiv:1904.08779</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Perlman and Volpe</source>
          , 1985]
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Perlman</surname>
            and
            <given-names>Joseph J.</given-names>
          </string-name>
          <string-name>
            <surname>Volpe</surname>
          </string-name>
          .
          <article-title>Episodes of apnea and bradycardia in the preterm newborn: Impact on cerebral circulation</article-title>
          .
          <source>Pediatrics</source>
          ,
          <volume>76</volume>
          (
          <issue>3</issue>
          ):
          <fpage>333</fpage>
          -
          <lpage>338</lpage>
          ,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Pichler et al.,
          <year>2003</year>
          ]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pichler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Urlesberger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Muller</surname>
          </string-name>
          .
          <article-title>Impact of bradycardia on cerebral oxygenation and cerebral blood volume using apnoea in preterm infants</article-title>
          .
          <source>Physio. Measurement</source>
          ,
          <volume>24</volume>
          (
          <issue>3</issue>
          ):
          <fpage>671</fpage>
          -
          <lpage>680</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [Poets et al.,
          <year>2015</year>
          ] Christian
          <string-name>
            <given-names>F.</given-names>
            <surname>Poets</surname>
          </string-name>
          , Robin S. Roberts, Barbara Schmidt,
          <string-name>
            <surname>Robin K. Whyte</surname>
          </string-name>
          ,
          <string-name>
            <surname>Elizabeth</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Asztalos</surname>
            , David Bader,
            <given-names>Aida</given-names>
          </string-name>
          <string-name>
            <surname>Bairam</surname>
            , Diane Moddemann, Abraham Peliowski, Yacov Rabi, Alfonso Solimano, and
            <given-names>Harvey</given-names>
          </string-name>
          <string-name>
            <surname>Nelson</surname>
          </string-name>
          .
          <article-title>Association between intermittent hypoxemia or bradycardia and late death or disability in extremely preterm infants</article-title>
          .
          <source>JAMA</source>
          ,
          <volume>314</volume>
          (
          <issue>6</issue>
          ):
          <fpage>595</fpage>
          -
          <lpage>603</lpage>
          ,
          <year>08 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [Pons et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Jordi</given-names>
            <surname>Pons</surname>
          </string-name>
          , Oriol Nieto, Matthew Prockup,
          <string-name>
            <surname>Erik M. Schmidt</surname>
            ,
            <given-names>Andreas F.</given-names>
          </string-name>
          <string-name>
            <surname>Ehmann</surname>
            , and
            <given-names>Xavier</given-names>
          </string-name>
          <string-name>
            <surname>Serra</surname>
          </string-name>
          .
          <article-title>End-to-end learning for music audio tagging at scale</article-title>
          .
          <source>CoRR, abs/1711.02520</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [Ribeiro et al.,
          <year>2016</year>
          ]
          <article-title>Marco Tu´lio Ribeiro, Sameer Singh, and Carlos Guestrin. ”why should I trust you?”: Explaining the predictions of any classifier</article-title>
          .
          <source>CoRR, abs/1602.04938</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>[Rudin</source>
          , 2018]
          <string-name>
            <given-names>Cynthia</given-names>
            <surname>Rudin</surname>
          </string-name>
          .
          <article-title>Please stop explaining black box models for high stakes decisions</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1811</year>
          .10154, 11
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [Schmid et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>M.B.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.J.</given-names>
            <surname>Hopfner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lenhof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.D.</given-names>
            <surname>Hummler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Fuchs</surname>
          </string-name>
          .
          <article-title>Cerebral oxygenation during intermittent hypoxemia and bradycardia in preterm infants</article-title>
          .
          <source>Neonatology</source>
          ,
          <volume>107</volume>
          :
          <fpage>137</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[Van der Maaten and Hinton</source>
          , 2008]
          <string-name>
            <given-names>L. Van der Maaten and G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Visualizing data using t-sne</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>9</volume>
          :
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Williamson et al.,
          <year>2013</year>
          ] James R. Williamson, Daniel W. Bliss, and David Paydarfar.
          <article-title>Forecasting respiratory collapse: Theory and practice for averting life-threatening infant apneas</article-title>
          .
          <source>Respiratory Physiology &amp; Neurobiology</source>
          ,
          <volume>189</volume>
          (
          <issue>2</issue>
          ):
          <fpage>223</fpage>
          -
          <lpage>231</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Yildirim et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Ozal</given-names>
            <surname>Yildirim</surname>
          </string-name>
          , Pawel Plawiak, RuSan Tan, and
          <string-name>
            <given-names>U. Rajendra</given-names>
            <surname>Acharya</surname>
          </string-name>
          .
          <article-title>Arrhythmia detection using deep convolutional neural network with long duration ecg signals</article-title>
          .
          <source>Computers in Biology and Medicine</source>
          ,
          <volume>102</volume>
          :
          <fpage>411</fpage>
          -
          <lpage>420</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>[Zhou</surname>
          </string-name>
          et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Bolei</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          ` gata Lapedriza, Aude Oliva, and Antonio Torralba.
          <article-title>Learning deep features for discriminative localization</article-title>
          .
          <source>CoRR, abs/1512.04150</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>