Explaining Deep Classification of Time-Series Data with Learned Prototypes Alan H. Gee∗,1,2 , Diego Garcia-Olano∗,1,2 , Joydeep Ghosh1 and David Paydarfar2 1 Electrical and Computer Engineering, The University of Texas at Austin 2 Neurology, Dell Medical School, The University of Texas at Austin {alangee, diegoolano}@utexas.edu, ghosh@ece.utexas.edu, david.paydarfar@austin.utexas.edu Abstract preterm infants (∼10% of births worldwide) in the neonatal intensive care unit (NICU). The emergence of deep learning networks raises a A common disorder observed in majority of preterm in- need for explainable AI so that users and domain fants is recurrent episodes of apnea (cessation of breathing) experts can be confident applying them to high-risk and bradycardia (slowing of heart rate). Both of these spon- decisions. In this paper, we leverage data from taneous events may cause end organ damage related to hy- the latent space induced by deep learning mod- poxemia (low oxygenation of blood) and ischemia (reduced els to learn stereotypical representations or “pro- blood flow) [Martin and Wilson, 2012]. Early detection of totypes” during training to elucidate the algorith- apnea and bradycardia can help prevent hypoxic-ischemic in- mic decision-making process. We study how lever- jury in tissue with high-metabolic demands [Schmid et al., aging prototypes effect classification decisions of 2015; Pichler et al., 2003] and prevent the cascade into inter- two dimensional time-series data in a few differ- mittent hypoxia, which leads to complications of retinopa- ent settings: (1) electrocardiogram (ECG) wave- thy, developmental delays, and neuropsychiatric disorders forms to detect clinical bradycardia, a slowing of [Williamson et al., 2013; Poets et al., 2015; Di Fiore et al., heart rate, in preterm infants, (2) respiration wave- 2015]. Leveraging explainability in deep neural network clas- forms to detect apnea of prematurity, and (3) audio sification of these time series can reveal complex morpholog- waveforms to classify spoken digits. We improve ical and physiological features that clinicians may not readily upon existing models by optimizing for increased see. Thus, machine learning algorithms need transparency prototype diversity and robustness, visualize how in their decision-making process to highlight subtle patterns. these prototypes in the latent space are used by the One such technique in deep explainability is prototypes, a model to distinguish classes, and show that proto- case-based reasoning technique. types are capable of learning features on two di- Prototypes are representative examples, learned in-process mensional time-series data to produce explainable during model training, that describe influential data regions insights during classification tasks. We show that in latent representations and provide insight into aggregated the prototypes are capable of learning real-world features across training data that are utilized by the model for features - bradycardia in ECG, apnea in respira- classification. In contrast to post-hoc explainability, which tion, and articulation in speech - as well as fea- trains a secondary model to infer decision reasoning from tures within sub-classes. Our novel work lever- a primary model by only leveraging inputs and outputs, in- ages learned prototypical framework on two dimen- process explainable methods offer faithful explanations of a sional time-series data to produce explainable in- primary model’s decisions [Rudin, 2018]. So, users who em- sights during classification tasks. ploy prototypes can confidently gain direct insight into the decisions algorithms are making for classification tasks. On data with unclear class boundaries, in-process methods 1 Introduction can misbehave. For example when the model in [Li et al., Despite the recent surge of machine learning, adoption of 2017] is applied to the MNIST dataset, the prototypes eas- deep learning models in decision critical domains, such as ily separate in the latent space because the latent data repre- healthcare, has been slow because of limited transparency sentation is separable and well-structured (Fig 1). However, and explanations in black-box algorithms. This observation when class boundaries and features do not form distinguish- points to the critical need for black-box models to offer inter- able clusters, learned prototypes become archetypes (extreme pretable, faithful explanations of their decisions so that prac- corner cases) that exist near the convex hull of the data in the titioners in high-risk domains can trust model outputs and latent space (Fig. 4). This phenomenon yields prototypes that leverage their results. One such high-risk domain is treating represent extreme class types (i.e. archetypes) and can under- perform on classifying data in overlapping class regions. ∗ Equal Contribution In this work, we provide a deep classification method with Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 15 Figure 2: Prototype Architecture from [Li et al., 2017] Figure 1: Learned prototypes of handwritten digits (MNIST) using the architecture from [Li et al., 2017]. While colors represent the handwritten digits 0-9, the labels represent the learned prototypes. vide probability maps to highlight areas of images that lead Because the latent representation of MNIST cluster distinctly, the to a certain prediction [Zhou et al., 2015], but do not give ex- prototypes are diverse. This may not be true when classes overlap amples of prototypical examples of the data or explanations of how the training data relates to the end result. We focus on the former work [Li et al., 2017] for example-based ex- explainable insights for health time-series data. We introduce plainability where the generation of prototypes are intended a prototype diversity penalty that explicitly accounts for pro- to look like global representations of the training data. totype clustering and encourages the model to learn more di- Time-series classification on 1-D data with deep neural verse prototypes. These diverse prototypes will help focus networks is a rapidly growing field, with almost 9,000 deep on areas of the latent space where class separation is most learning models [Fawaz et al., 2018; Pons et al., 2017; difficult and least defined to improve classification accura- Faust et al., 2018; Goodfellow et al., 2018]. One such ex- cies. We show the utility of this approach on three tasks in ample leverages global average pooling to produce CAMs to two-dimensional time-series classification: (1) bradycardia provide explainability for a deep CNN to classify atrial fib- from ECG; (2) apnea from respiration; and (3) spoken digits rillation in ECG data [Goodfellow et al., 2018]. However, from audio waveforms. The two-dimensional representation the number of available healthcare datasets, specifically ECG of time-series provides an interpretable method for domain waveforms, is limited [Fawaz et al., 2018]. Within this con- experts (e.g. clinicians) to understand the evolution of clin- text, time-series classification on ECG waveforms has been ically relevant features based on visible phenotypes in time- done on a small scale, typically with single beat or short- series data. Our work enables a closed-loop collaboration be- duration (10 s) arrhythmia classification [Faust et al., 2018; tween experts and machine learning algorithms to accelerate Yildirim et al., 2018]. the efficacy of outcome predictions. The learning algorithms can find nuance features through development of explainable 2 Methods prototypes, and the experts can fine-tune the algorithms by providing feedback through the regularization of the diversity 2.1 Time-Series Explanation via Prototypes penalty. This is especially important for clinician experts who We adopt the autoencoder-prototype architecture from [Li n need explainability in black-box models to understand and di- et al., 2017]. Let X = (xi , yi )i be the training set with agnose different pathological mechanisms. To the best of our p xi ∈ R and class labels yi ∈ {1, ..., K} for each train- knowledge this is the first application of prototypes and la- ing point i ∈ {1, ..., n}. The front-end autoencoder net- tent space analysis for health time-series data that could help work learns a lower-dimension latent representation of the reveal clinically relevant and explainable phenotypes to im- data with an encoder network, f : Rp → Rq . The latent prove the baseline for standard of care with automatic moni- space is then projected back to the original dimension using toring and detection. a decoder function, g : Rq → Rp . The latent representa- tion, f (x) is also passed to a feed-forward prototype network, 1.1 Relevant Work h : Rq → RK , for classification. The prototype network Explainable methods [Ribeiro et al., 2016; Caruana et al., learns m prototype vectors, p1 , p2 , ..., pm ∈ Rq using a four- 2015; Zhou et al., 2015] have largely focused on labeled im- layer fully-connected network over the latent space that learns age and tabular data sets where classes are clearly separable a probability distribution over the class labels yi (Fig 2). The and less so on time-series data in general. Recent work has learned prototypes can then be decoded using g and exam- focused on using prototypes to provide in-process explain- ined to infer what the network has learned. The choice of m ability of classification models, either by learning meaning- is determined a priori, with larger values allowing for higher ful pixels in the entire image [Li et al., 2017] or by applying throughput and model capacity. attention through the use of sub-regions or patches over an We improve prior work by adding a penalty for learned image [Chen et al., 2018]. Class attention maps (CAMs) pro- prototypes in the objective function of the above network to 16 increase prototype diversity and coverage of the data in latent sity score, Ψ: representations. To align with the minimization of the objec- t 1 Xp tive function, this new prototype diversity penalty needs to Ψ= |φi | (5) be (1) small when distances between prototypes are far apart, Z i=1 and (2) large when distances between prototypes are close in where φi , i ∈ {1, ..., t} is defined for a specific metric and Z distance. We can evaluate the feasibility of a set of proto- is the normalization constant. For the neighbor diversity met- types by considering the distance of the two closest proto- ric ΨN , φi is the set of prototypes that have nearest neighbor types across all prototype combinations. So, we consider the i and Z is the number of prototypes m. For the class di- average minimum squared L2 distance between any two pro- versity metric ΨC , φi is the set of prototypes that are from totypes, pi , pj for our loss function. To achieve the desired class i and Z is the number of classes K. Higher scores will property above, we take the inverse of this average distance: occur when prototypes have more unique elements. Thus, max(ΨD ) = 1. P DL(p1 ,..., pm ) = 1 (1) 2.3 Datasets 1 Pm 2 The neonatal intensive care unit (NICU) dataset is composed log m j=1 mini>j∈[1,m] kpi − pj k2 +  of two sources: (1) ECG and Respiration waveforms from The logarithm function tapers large distances so that the PhysioNet’s PICS database [Gee et al., 2017; Goldberger penalty does not quickly vanish, and the  term is for numeric et al., 2000]; and (2) ECG waveforms (500 Hz, Intellivue stability. By taking the inverse of the log of the prototype MP450) collected from a preterm infant over their entire stay distances, we penalize prototypes that are close in distance (∼10 weeks) at Seton Medical Center Austin. The inclusion while making sure the minimum distance between prototypes of (2) helps supplement the ECG events from (1). The image does not get too large. This prototype diversity loss (PDL) data used in this study are made publicly available1 . promotes coverage over the latent space. We update the ob- The inter-breath intervals (IBIs) from the respiration were jective function to: extracted using a standard peak finder. The respiration sig- nals were clipped into 60 second segments that were nor- L((f, g, h), X) = E(h ◦ f, X) + λR R(g ◦ f, X) malized to zero-mean, unit variance. The R-R intervals for the ECG of the NICU dataset were extracted using a Morlet + λ1 R1 (p1 , ..., pm , X) wavelet transformation of the ECG signal. An open-source (2) + λ2 R2 (p1 , ..., pm , X) peak finder was applied to the wavelet scale range (0.01 to + λpd P DL(p1 , ..., pm ) .04 scales) related to QRS complex formation in the spec- trogram. The ECG waveforms were clipped at 15 seconds where E is the classification (cross entropy) loss, R is the with the event in the middle. All ECG segments were band- reconstruction loss of the autoencoder (i.e. L2 norm), and passed filtered from 3 to 45 GHz, scaled to zero-mean, unit- R1 and R2 are the loss terms that relate the distances of the variance, and scaled to the median QRS complex amplitude. feature vectors to the prototype vectors in latent space [Li et Images were then captured to mimic what a clinician would al., 2017]: see upon investigation of an ECG signal. Waveforms with no 1 m https://physionet.org/physiobank/database/picsdb 1 X 2 R1 (p1 , ..., pm , X) = mini∈[1,n] kpj − f (xi )k2 , (3) m j=1 n 1X 2 R2 (p1 , ..., pm , X) = minj∈[1,m] kf (xi ) − pj k2 (4) n i=1 The minimization of the R1 loss term promotes each proto- type vector to learn one of the encoded training examples, while the minimization of R2 loss promotes encoded training examples to be close to one of the prototypes. This balance gives meaningful pixel-to-pixel representations between the prototypes and training data. We train our models with a randomly shuffled batch size of 100 (ECG, Speech) and 125 (Respiration). We parameterize the number of prototypes (see supplement) and the regular- ization term λpd for the classification tasks while keeping the Figure 3: Examples of waveforms for each task: (A) Electro- other hyperparameters as in [Li et al., 2017]. cardiogram (ECG) waveforms related to bradycardia classification, (B) Respiration waveforms related to apnea classification, and (C) 2.2 Prototype Diversity Score Speech waveforms for a particular a speaker (Jackson). For (A) and We adopt a version of the group fairness metric presented in (B) we classify the segments based on severity (i.e. time difference [Mehrotra et al., 2018] and refer to it as the prototype diver- between peaks), and for (C) we classify based on digit class. 17 Figure 4: Effect of loss regularization on the latent space and spread of prototypes for the NICU classification task using 10 prototypes with λpd = 0 (baseline) and λpd = 103 . The second and third dimensions of a t-SNE projection on each space shows prototypes with more coverage and diversity in the latter case. visibly distinguishable QRS complexes or respiratory peaks purposes. This technique calculates the KL-Divergence be- were discarded because these waveforms are too obscure for tween the higher-order dimensional latent space and the lower even a clinician expert to evaluate. dimensional space used to represent the former visually. This Class breakdowns for bradycardia in the ECG signal follow approach is non-deterministic so the global position in the clinical thresholds [Perlman and Volpe, 1985]: XECG = { lower space is uninformative and instead proximity to neigh- normal (>100 beats per minute (bpm)): 1039, mild (100-80 bors is the key insight to gain. Additionally while the first bpm): 634, moderate (80-60 bpm): 306, severe (<60 bpm): two dimensions of the projection show the general spread of 132 }. Moderate and severe events were combined into a information, the second and third dimensions maybe useful single class. The class breakdown for apneas in respiration for visualizing within group information. Thus, we use the are: XRESP = { normal (1-3 s): 1939, mild (4-6 s): 1921, second and third dimensions for our visualizations. moderate/severe (> 6 s): 1487 }. The Free Spoken Digit Dataset [Jackson et al., 2018] con- 3 Results sists of 2000 audio clips (8 kHz) of four speakers repeat- 3.1 Classification of ECG with 2-D Prototypes ing the digits 0 through 9, 50 times each. Each segment was normalized to zero-mean, unit-variance and clipped for We test our prototype implementation on ECG waveforms re- white space (Fig. 3). This data can be thought of as “spoken lated to bradycardia using the NICU data for a 3-class classifi- MNIST”. We perform speaker classification and digit classi- cation task using 10 prototypes. We treat the input waveforms fication within a speaker. as 2-D images and use a four-layer autoencoder to learn com- plex representations over the data. 2.4 Visualization of Latent Space We observe more diverse prototypes and comparable or better test accuracy with our model 93.1±0.4% compared We use PCA to reduce the latent space vectors to a di- with 92.1±0.1% from the baseline model in [Li et al., 2017] mension of 500, which retains 98% of the variability. We (Table 1). Both models perform well on the classification of then calculate the cosine similarity between these 500 di- the normal class, as expected since normal waveforms have mensional vectors to produce a similarity matrix and use near-constant phase. Both models additionally have difficulty t-distributed stochastic neighbor embedding (t-SNE) from separating between the mild and moderate/severe classes, of- [Van der Maaten and Hinton, 2008] to reduce the 500 x 500 ten confusing the classification between these two (see sup- similarity matrix down to three dimensions for visualization plement). This behavior is expected since data near these 18 ECG: Bradycardia Respiration: Apnea λpd Accu. ΨN ΨC λpd Acc. ΨN ΨC 0 92.1 ± 0.1% 0.83 ± 0.04 0.78 ± 0.19 0 81.4 ± 3.6% 0.96 ± 0.07 1.00 ± 0.00 500 92.7 ± 1.0 % 0.86 ± 0.07 0.89 ± 0.19 500 82.3 ± 3.8% 0.94 ± 0.09 1.00 ± 0.00 1e3 92.4 ± 1.3% 0.87 ± 0.11 0.89 ± 0.19 1e3 77.1 ± 0.6 % 1.00 ± 0.00 1.00 ± 0.00 2e3 93.1 ± 0.4% 0.90 ± 0.04 1.00 ± 0.00 2e3 80.2 ± 2.5 % 0.97 ± 0.04 0.84 ± 0.23 Table 1: Diversity score for neighbors ΨN and class ΨC . We report Ψ’s related to the epoch with the highest test accuracy. Our model, λpd > 0, returns better accuracies and diversity scores (bold) than the baseline model, which is row λpd = 0, across ECG and Respiration waveforms. (Model details: 3-class, 10-prototypes, learning rate = 0.002). two class boundaries are difficult to discern, even for domain experts, due to events existing in both classes with possible subtle time differences in cardiac firing. Our model also im- proves prototype diversity (Table 1) over the baseline model. This result suggests that the prototype diversity loss encour- ages exploration, through learning diverse prototypes, within the data represented in the latent space. As a result, our model finds more helpful features and prototypes and thus, improves classification results. Because prototypes are generated during training, we in- fer features that the algorithm utilized to classify waveforms at different points during training (Fig 5). For example, by epoch 100, we see that some of the prototypes exhibit global Figure 5: Prototype evolution with in-process explainability over morphological features of the normal waveform class after training time. High level features are easily learned in early epochs random initialization at epoch 0. As training progresses, we of training, while more complex features are developed over time. observe other complex phenotypes emerging: one prototype The final nearest neighbors are depicted on the right. The prototypes learns that large gaps in cardiac firings are important for iden- correspond to a subset of the λpd = 103 latent space cloud in Figure tifying severe cases and another prototype learns the consis- 4. Model details: 3-class, 10-prototypes. tent pattern of spikes are important for mild cases. Since the mild class shares mixed features of both normal and positive events, it is not surprising that more prototypes are needed in class labels and cardiac firing periods (Fig. 6, bottom). Even this class to learn subtleties of the class features (see supple- though we did not impose a class constraint, we observe that ment). Thus, prototypes highlight waveform structures that the algorithm found two separate features within the moder- the algorithm deemed as important when trying to learn the ate/severe class that were important in the classification task classification of bradycardia. This finding aligns with the idea (i.e. prototypes 2 and 10 shown at the top of the (Fig 6). of clinicians using visible features present in a bradycardia These two prototypes explore two different cardiac timings (i.e. the increasing distance between QRS complexes) to de- as prototype 2 exhibits a progressive delay in cardiac firing, cide whether or not a bradycardia exists in an image. while prototype 10 exhibits a large spontaneous delay. The We compare the latent space of [Li et al., 2017] to the incorporation of the prototype diversity loss encouraged this latent space of our model with prototype diversity loss via t- exploration of the latent space. These results suggest that SNE projections, where proximity in 2-D space suggests that there are physiologic dependencies (i.e. clustering based on points are “close” in distance in the original latent space. We cardiac morphology and function) that can be learned using represent the learned prototypes by mapping each prototype our model to investigate physiological phenomena, and possi- to its nearest neighbor (Fig 4). We find that by increasing bly applied to other clinical areas, like cardiac ischemia or ap- our loss term, P DL, our model increases the local cover- nea of prematurity in respiration - both exhibit visible, abnor- age of the prototypes compared with the baseline model (i.e. mal waveform behavior. This work provides a visualization λpd = 0). However, if we regularize our loss term too much tool for clinician experts to evaluate different morphological (i.e. λpd > 104 ), we begin to introduce clustering of proto- of physiological time-series data2 . types and diversity suffers. Thus with the additional proto- type distance penalty, we achieve higher diversity scores and 3.3 Classification of Apnea in Respiration classification accuracies for various hyperparameters (Fig 9). Apnea of prematurity is common among preterm infants, and is visually apparent as a pause of inhalation and exhalation 3.2 Case Study with Prototypes: Exploring ECG (i.e. absence of sinusoidal behavior) in the respiratory sig- Morphology and Bradycardia Classification. nal. We next test our prototype implementation on respiration We observe that ECG events in a local neighborhood share 2 similar QRS complex morphology, despite having different https://github.com/alangee/ijcai19-ts-prototypes 19 Figure 6: Learned prototypes showcase the diversity of features that are important for understanding ECG morphology while classifying bradycardia events. ( 10-prototypes, λpd = 104 ). Figure 7: Learned prototypes showcase the diversity of features Figure 8: Learned prototypes from audio waveforms of spoken dig- across classes that are important for understanding respiration mor- its by Nicolas from the FSDD (λpd = 500). phology while classifying apnea events. For this classification task, we observe a variety of prototypes (at epoch 500) that learn vari- ous cases with cessation of breathing (6 and 9 second gaps) and the ological examples and generate learned prototypes that dis- global features within the segment that are important for the model’s tinctly relate to physiological behavior. For example, in Fig. classification. (8-prototypes, λpd = 500). 7, we see that algorithm finds segments that are related to periodic breathing of 9 second duration (moderate/severe). waveforms that are related to apnea in a 3-class classifica- These segments are physiologically different from normal ap- tion task. We treat the input waveforms as 2-D images again, neas of 6 seconds (mild), and clearly different from normal since clinicians evaluate apneas through visual inspection of breathing with periodicity of 1 second (Fig 7). In the set of the respiration signal. eight learned prototypes, the algorithm finds three different classes easily, each with different respiratory phenomena, that We observe more diverse prototypes and comparable or are critical in the classifying various types of apneas. better test accuracy with our model 82.3±3.8% compared with 81.4±3.6% from the baseline model, and with overall unique nearest neighbors (ΨN = 1) and class diversity (ΨC = 3.4 Spoken Digits Classification and Analysis 1) (Table 1). Both models have difficulty separating between Speech abnormalities can be suggestive of underlying patho- the event classes because data near these two class boundaries logical dysfunction, and common features that clinicians vis- are difficult to visually discern (i.e. 6 second gap versus 7 sec- ibly discern in waveforms to assess speech include cadence, ond gap) and have common behavior with regular respiratory prosody, and syllable articulation. To aid in speech fea- function that is found in the normal class. We find that the ture detection, we assess our model on high-frequency audio addition of a prototype diversity loss maintains or improves waveforms of spoken digits (FSDD) from medically-normal performance and yields more diverse prototypes (Table 1). individuals. These digits are treated as 2-D images for 4 class We also note that the algorithm is able to discern physi- speaker and 10 digit classification tasks with 4 and 10 proto- 20 ences in the time series features, or conversely accentuate nuanced differences in learned prototypes as clinically impor- tant signs of impending adverse outcomes. Therefore, our im- plementation offers a collaborative method for clinician ex- perts to use their insight interactively with machine learning algorithms: increasing λpd promotes large observable differ- ences in the prototypes, while decreasing λpd promotes di- verse features and prototypes. In turn, our model enables a closed-loop feedback framework to accelerate phenotype dis- covery to lead clinicians to better-informed decision. We evaluate the performance of our model on increas- ingly difficult physiological datasets to demonstrate the ef- fect of λpd . The ECG signal is more robust against move- Figure 9: Accuracy and diversity metrics for the spoken digits ex- ment artifact and produces a cleaner signal for the 2-D vi- periments using the FSDD. We divide this dataset into two tasks: (1) sualization task, whereas the respiration signal, which is classifying the person speaking and (2) classifying the digit spoken the resultant voltage change across diaphragm movement, is within each person. highly susceptible to signal artifact. Additionally, speech waveforms are compressed, high-frequency waveforms (kHz) which make it difficult to visibly extract high-resolution fea- types, respectively. The waveform envelope and syllables of tures. We find that our model allocates more prototypes to these spoken digits are discernible to the eye (see “six” and learn the intricacies of the more indistinguishable classes (i.e. “se-ven” in Fig 2) and, as such, make good candidates for mild and moderate/severe) that are hard for a human to dis- our image-based explainability model. We demonstrate some cern, especially the mild cases because this class is a mixture of the learned prototypes in Fig. 8, which show representa- and intermediary of the two extreme classes. tions the model finds useful in classifying digits for a given We observe, however, that the high number of loss terms speaker. Experiments show that by varying regularization of creates a trade-off between prototype interpretability and the prototype diversity penalty, we observe slightly better or model accuracy. For example, we observe that for a small similar accuracies when compared to the baseline model (Fig. number of prototypes, we achieve near-perfect prototype re- 9). With a fine-tuned λpd we can increase diversity of the construction but at the cost of classification accuracy. When prototypes and correspondingly see improved accuracy and the number of prototypes was large, we achieve higher ac- data coverage (see supplement). For example, λpd = 500 curacy but received noisy prototypes. In future implementa- gives a higher diversity score across all tasks, indicating pro- tions, we can replace the front-end autoencoder with a model totypes with more unique nearest neighbors as compared with that operates well on 1-D time series, like an recurrent neural the baseline model (Fig 9). network, to balance accuracy and prototype interpretability. Experiments show that increasing the depth of the network There has also been work on computing prototypical and fine-tuning the learning rate lead to both increased accu- patches over 2-D images to generate explainable sub-features racy and diversity over all tasks. Similarly, recent data aug- [Chen et al., 2018]. Extending the idea of patches to 1-D mentation techniques in medical [Bahadori and Lipton, 2019] time-series signals would allow for parsing the signal for sub- and speech recognition [Park et al., 2019] domains could help frequencies and features that could better explain how events further improve performance. The purpose of this work, how- are triggered. Nonetheless, the work presented in this paper ever, is not to obtain the best performance on these tasks, but provides a more robust prototype model to help explain al- rather to show the utility of learned prototypes as faithful ex- gorithmic behavior and decision-making in deep time-series planations of decisions made by a model. classification tasks with promising results in clinically rele- vant datasets. 4 Discussion We presented a new autoencoder-prototype model that pro- Acknowledgement motes diversity in learned prototypes by penalizing proto- The authors would like to thank Sinead Williamson and the types that are too close in squared L2 distance in the la- three reviewers for providing helpful feedback and critical re- tent space. The new term, λpd P DL(p1 , ..., pm ), in the loss views of our work. function (Eq. 2) promotes prototype diversity while improv- ing classification accuracy and prototype coverage of data References represented in the latent space. These prototypes help ex- plain which global features and representative segments in [Bahadori and Lipton, 2019] Mohammad Taha Bahadori and the training data are most useful for deep time-series classi- Zachary Chase Lipton. Temporal-clustering invariance fication. This in-process generation of prototypes offers ex- in irregular healthcare time series. arXiv preprint plainable insights into deep classifiers. arXiv:1904.12206, 2019. Our model and results provide an important significance [Caruana et al., 2015] Rich Caruana, Yin Lou, Johannes that previous works lack. Depending on the clinical context Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. of the case, experts may want to either trivialize big differ- Intelligible models for healthcare: Predicting pneumonia 21 risk and hospital 30-day readmission. In Proceedings [Park et al., 2019] Daniel S Park, William Chan, Yu Zhang, of the 21th ACM SIGKDD International Conference on Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Knowledge Discovery and Data Mining, KDD ’15, pages Quoc V Le. Specaugment: A simple data augmentation 1721–1730, New York, NY, USA, 2015. ACM. method for automatic speech recognition. arXiv preprint [Chen et al., 2018] Chaofan Chen, Oscar Li, Alina Barnett, arXiv:1904.08779, 2019. Jonathan Su, and Cynthia Rudin. This looks like that: [Perlman and Volpe, 1985] Jeffrey M. Perlman and Joseph J. deep learning for interpretable image recognition. CoRR, Volpe. Episodes of apnea and bradycardia in the preterm abs/1806.10574, 2018. newborn: Impact on cerebral circulation. Pediatrics, [Di Fiore et al., 2015] J.M. Di Fiore, E Gauda, R.J. Martin, 76(3):333–338, 1985. and P MacFarlane. Cardiorespiratory events in preterm [Pichler et al., 2003] G. Pichler, B. Urlesberger, and infants: interventions and consequences. Journal Of Peri- W. Muller. Impact of bradycardia on cerebral oxygenation natology, 36(251), 2015. and cerebral blood volume using apnoea in preterm [Faust et al., 2018] Oliver Faust, Yuki Hagiwara, Tan Jen infants. Physio. Measurement, 24(3):671–680, 2003. Hong, Oh Shu Lih, and U Rajendra Acharya. Deep learn- [Poets et al., 2015] Christian F. Poets, Robin S. Roberts, ing for healthcare applications based on physiological sig- Barbara Schmidt, Robin K. Whyte, Elizabeth V. Aszta- nals: A review. Computer Methods and Programs in los, David Bader, Aida Bairam, Diane Moddemann, Abra- Biomedicine, 161:1 – 13, 2018. ham Peliowski, Yacov Rabi, Alfonso Solimano, and Har- [Fawaz et al., 2018] Hassan Ismail Fawaz, Germain vey Nelson. Association between intermittent hypoxemia Forestier, Jonathan Weber, Lhassane Idoumghar, and or bradycardia and late death or disability in extremely Pierre-Alain Muller. Deep learning for time series preterm infants. JAMA, 314(6):595–603, 08 2015. classification: a review. CoRR, abs/1809.04356, 2018. [Pons et al., 2017] Jordi Pons, Oriol Nieto, Matthew [Gee et al., 2017] A. H. Gee, R. Barbieri, D. Paydarfar, and Prockup, Erik M. Schmidt, Andreas F. Ehmann, and Xavier Serra. End-to-end learning for music audio P. Indic. Predicting bradycardia in preterm infants using tagging at scale. CoRR, abs/1711.02520, 2017. point process analysis of heart rate. IEEE Transactions on Biomedical Engineering, 64(9):2300–2308, 2017. [Ribeiro et al., 2016] Marco Túlio Ribeiro, Sameer Singh, [Goldberger et al., 2000] Ary L. Goldberger, Luis A. N. and Carlos Guestrin. ”why should I trust you?”: Explaining the predictions of any classifier. CoRR, Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. abs/1602.04938, 2016. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Phys- [Rudin, 2018] Cynthia Rudin. Please stop explaining ioBank, PhysioToolkit, and PhysioNet: Components of black box models for high stakes decisions. CoRR, a new research resource for complex physiologic signals. abs/1811.10154, 11 2018. Circulation, 101(23):e215–e220, June 2000. [Schmid et al., 2015] M.B. Schmid, R.J. Hopfner, S. Lenhof, [Goodfellow et al., 2018] Sebastian Goodfellow, Andrew H.D. Hummler, and H. Fuchs. Cerebral oxygenation dur- Goodwin, Danny Eytan, Robert Greer, Mjaye Mazwi, and ing intermittent hypoxemia and bradycardia in preterm in- Peter Laussen. Towards understanding ecg rhythm clas- fants. Neonatology, 107:137–146, 2015. sification using convolutional neural networks and atten- [Van der Maaten and Hinton, 2008] L. Van der Maaten and tion mappings. In Proceedings of Machine Learning for G. Hinton. Visualizing data using t-sne. Journal of Ma- Healthcare, MLHC ’18, pages 2243–2251, 08 2018. chine Learning Research, 9:2579–2605, 2008. [Jackson et al., 2018] Zohar Jackson, César Souza, Yuxin [Williamson et al., 2013] James R. Williamson, Daniel W. Flaks, Jason; Pan, Hereman Nicolas, and Adhish Thite. Bliss, and David Paydarfar. Forecasting respiratory col- Free spoken digit dataset (fsdd). 2018. lapse: Theory and practice for averting life-threatening [Li et al., 2017] Oscar Li, Hao Liu, Chaofan Chen, and infant apneas. Respiratory Physiology & Neurobiology, Cynthia Rudin. Deep learning for case-based reasoning 189(2):223 – 231, 2013. through prototypes: A neural network that explains its pre- [Yildirim et al., 2018] Ozal Yildirim, Pawel Plawiak, Ru- dictions. CoRR, abs/1710.04806, 2017. San Tan, and U. Rajendra Acharya. Arrhythmia detec- [Martin and Wilson, 2012] Richard J. Martin and Christo- tion using deep convolutional neural network with long pher G. Wilson. Apnea of prematurity. pages 2923–2931, duration ecg signals. Computers in Biology and Medicine, 2012. 102:411 – 420, 2018. [Mehrotra et al., 2018] Rishabh Mehrotra, James McIner- [Zhou et al., 2015] Bolei Zhou, Aditya Khosla, Àgata ney, Hugues Bouchard, Mounia Lalmas, and Fernando Lapedriza, Aude Oliva, and Antonio Torralba. Learn- Diaz. Towards a fair marketplace: Counterfactual eval- ing deep features for discriminative localization. CoRR, uation of the trade-off between relevance, fairness & satis- abs/1512.04150, 2015. faction in recommendation systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 2243–2251, 2018. 22