-

Using Features from Pre-trained TimeNet for Clinical Predictions

Priyanka Gupta

Pankaj Malhotra

Lovekesh Vig

Gautam Shroff TCS Research

New Delhi

India

priyanka.g

malhotra.pankaj

lovekesh.vig

gautam.shroffg@tcs.com

Predictive models based on Recurrent Neural Networks (RNNs) for clinical time series have been successfully used for various tasks such as phenotyping, in-hospital mortality prediction, and diagnostics. However, RNNs require large labeled data for training and are computationally expensive to train. Pre-training a network for some supervised or unsupervised tasks on a dataset, and then fine-tuning via transfer learning for a related end-task can be an efficient way to leverage deep models for scenarios that lack in either computational resources or labeled data, or both. In this work, we consider an approach to leverage a deep RNN - namely TimeNet [Malhotra et al., 2017] - that is pre-trained on a large number of diverse publicly available time-series from UCR Repository [Chen et al., 2015]. TimeNet maps varyinglength time series to fixed-dimensional feature vectors and acts as an off-the-shelf feature extractor. TimeNet-based approach overcome the need for hand-crafted features, and allows for use of traditional easy-to-train and interpretable linear models for the end-task, while still leveraging the features from a deep neural network. Empirical evaluation of the proposed approach on MIMIC-III1 data suggests promising direction for future exploration: our results are comparable to existing benchmarks while our models require lesser training and hyperparameter tuning effort.

There has been a growing interest in using deep learning models for various clinical prediction tasks from Electronic Health Records, e.g. Doctor AI [Choi et al., 2016] for medical diagnosis, Deep Patient [Miotto et al., 2016] to predict future diseases in patients, DeepR [Nguyen et al., 2017] to predict unplanned readmission after discharge, etc. With various medical parameters being recorded over a period of time in EHR databases, Recurrent Neural Networks (RNNs) can 1TimeNet-based features for MIMIC-III time series are available on request from authors. be an effective way to model the sequential aspects of EHR data, e.g. diagnoses [Lipton et al., 2015; Che et al., 2016; Choi et al., 2016], mortality prediction and estimating length of stay [Harutyunyan et al., 2017; Purushotham et al., 2017; Rajkomar et al., 2018].

However, training RNNs requires large labeled training data like any other deep learning approach, and can be computationally inefficient because of sequential nature of computations. On the other hand, training a deep network on diverse instances can provide generic features for unseen instances, e.g. VGGNet [Simonyan and Zisserman, 2014] for images. Also, fine-tuning a pre-trained network with transfer learning is often faster and easier than constructing and training a new network from scratch [Bengio, 2012]. The advantage of learning in such a manner is that the pre-trained network has already learned a rich set of features that can then be applied to a wide range of other similar tasks.

Deep RNNs have been shown to perform hierarchical processing of time series with different layers tackling different time scales [Hermans and Schrauwen, 2013; Malhotra et al., 2015]. TimeNet [Malhotra et al., 2017] is a general-purpose multi-layered RNN trained on large number of diverse time series from UCR Time Series Archive [Chen et al., 2015] (refer Section 3 for details) that has been shown to be useful as off-the-shelf feature extractor for time series. TimeNet has been trained on 18 different datasets simultaneously via an RNN autoencoder in an unsupervised manner for reconstruction task. Features extracted from TimeNet have been found to be useful for classification task on 25 datasets not seen during training of TimeNet, proving its ability to provide meaningful features for unseen datasets.

In this work, we provide an efficient way to learn prediction models for clinical time series by leveraging generalpurpose features via TimeNet. TimeNet maps variable-length clinical time series to fixed-dimensional feature vectors, that are subsequently used for patient phenotyping and in-hospital mortality prediction tasks on MIMIC-III database [Johnson et al., 2016] via easily trainable non-temporal linear classification models. We observe that TimeNet-based features can be used to build such classification models with very little training effort while yielding performance comparable to models with hand-crafted features or carefully trained domainspecific RNNs, as benchmarked in [Harutyunyan et al., 2017; Song et al., 2017]. Further, we propose a simple mechanism to leverage the weights of the linear classification models to provide insights into the relevance of each raw input feature (physiological parameter) for a given phenotype (discussed in Section 4.2). 2

Related Work

TimeNet-based features have been shown to be useful for various tasks including ECG classification [Malhotra et al., 2017]. In this work, we consider application of TimeNet to phenotyping and in-hospital mortality tasks for multivariate clinical time series classification. Deep Patient [Miotto et al., 2016] proposes leveraging features from a pre-trained stacked-autoencoder for EHR data. However, it does not leverage the temporal aspect of the data and uses a nontemporal model based on stacked-autoencoders. Our approach extracts temporal features via TimeNet incorporating the sequential nature of EHR data. Doctor AI [Choi et al., 2016] uses discretized medical codes (e.g. diagnosis, medication, procedure) from longitudinal patient visits via a purely supervised setting while we use real-valued time series. While approaches like Doctor AI require training a deep RNN from scratch, our approach leverages a general-purpose RNN for feature extraction.

[Harutyunyan et al., 2017] consider training a deep RNN model for multiple prediction tasks simultaneously including phenotyping and in-hospital mortality to learn a generalpurpose deep RNN for clinical time series. They show that it is possible to train a single network for multiple tasks simultaneously by capturing generic features that work across different tasks. We also consider leveraging generic features for clinical time series but using an RNN that is pre-trained on diverse time series across domains, making our approach more efficient. Further, we provide an approach to rank the raw input features in order of their relevance that helps validate the models learned. 3

Background: TimeNet

TimeNet [Malhotra et al., 2017] is a pre-trained off-the-shelf feature extractor for univariate time series with three recurrent layers having 60 Gated Recurrent Units (GRUs) [Cho et al., 2014] each. TimeNet is an RNN trained via an autoencoder consisting of an encoder RNN and a decoder RNN trained simultaneously using the sequence-to-sequence learning framework [Sutskever et al., 2014; Bahdanau et al., 2014] as shown in Figure 1(a). RNN autoencoder is trained to obtain the parameters WE of the encoder RNN fE via reconstruction task such that for input x1:::T = x1; x2; :::; xT (xi 2 R), the target output time series xT :::1 = xT ; xT 1; :::; x1 is reverse of the input.

The RNN encoder fE provides a non-linear mapping of the univariate input time series to a fixed-dimensional vector representation zT : zT = fE (x1:::T ; WE ), followed by an RNN decoder fD based non-linear mapping of zT to univariate time series: x^T :::1 = fD(zT ; WD); where WE and WD are the parameters of the encoder and decoder, respectively. The model is trained to minimize the average squared reconstruction error. Training on 18 diverse datasets simultaneously results in robust time series features getting captured in zT : the decoder relies on zT as the only input to reconstruct the time series, forcing the encoder to capture all the relevant information in the time series into the fixed-dimensional vector zT . This vector zT is used as the feature vector for input x1:::T . This feature vector is then used to train a simpler classifier (e.g. SVM, as used in [Malhotra et al., 2017]) for the end task. TimeNet maps a univariate input time series to 180-dimensional feature vector, where each dimension corresponds to final output of one of the 60 GRUs in the 3 recurrent layers. 4

TimeNet Features for Clinical Time Series

Consider a set D of labeled time series instances from an EHR database: D = f(x(i); y(i))gi=1, where x(i) is a multivariate

N time series, y(i) 2 fy1; : : : ; yC g, C is the number of classes, N is the number of unique patients (in our experiments, we consider each episode of hospital stay for a patient as a separate data instance). In this work, we consider presence or absence of a phenotype as a binary classification task such that C = 2. We learn an independent model for each phenotype (unlike [Harutyunyan et al., 2017] which consider phenotyping as a multi-label classification problem) . This allows us to build simple linear binary classification models as described next in Section 4.1. In practice, the outputs of these binary classifiers can then be considered together to estimate the set of phenotypes present in a patient. Similarly, mortality prediction is considered to be a binary classification task where the goal is to classify whether the patient will survive (after admission to ICU) or not. 4.1

Classification using TimeNet features Feature Extraction for Multivariate Clinical Time Series

For a multivariate time series x = x1x2 : : : xT , where xt 2 Rn, we consider time series for each of the n raw input features (physiological parameters, e.g. glucose level, heart rate, etc.) independently, to obtain univariate time series xj = xj1xj2 : : : xjT , j = 1 : : : n. (Note: We use x instead of x(i) and omit superscript (i) for ease of notation). We obtain the vector representation zjT = fE (xj ; WE ) for xj , where zjT 2 Rc using TimeNet as fE with c = 180 (as described in Section 3). In general, time series length T also depends on i, e.g. based on length of stay in hospital. We omit this for sake of clarity without loss of generality. In practice, we convert each time series to have equal length T by suitable pre/postpadding with 0s. We concatenate the TimeNet-features zjT for each raw input feature j to get the final feature vector zT = [z1T ; z2T ; : : : ; znT ] for time series x, where zT 2 Rm, m = n c as illustrated in Figure 1(b).

Using TimeNet-based Features for Classification

The final concatenated feature vector zT is used as input for the phenotyping and mortality prediction classification tasks. We note that since c = 180 is large, zT has large number of features m 180. We consider a linear mapping from input TimeNet features zT to the target label y s.t. the estimate y^ = w zT , where w 2 Rm. We constrain the linear model with weights w to use only a few of these large number of features. The weights are obtained using LASSO-regularized Decoder ... where y(i) 2 f0; 1g, jjwjj1 = Pjn=1 Pck=1 jwjkj is the L1norm, where wjk represents the weight assigned to the k-th TimeNet-feature for the j-th raw feature, and controls the extent of sparsity – with higher implying more sparsity, i.e. fewer TimeNet features are selected for the final classifier. 4.2

Obtaining Relevance Scores for Raw Features

Determining relevance of the n raw input features for a given phenotype is potentially useful to obtain insights into the obtained classification model. The sparse weights w are easy to interpret and can give interesting insights into relevant features for a classification task (e.g. as used in [Micenkova´ et al., 2013]) . We obtain the relevance rj of the j-th raw input feature as the sum of the absolute values of the weights wjk assigned to the corresponding TimeNet features zjT as shown in Figure 1(c), s.t.

c rj = X jwjkj; j = 1 : : : n: k=1 (2) Further, rj is normalized using min-max normalization such that rj0 = rmrjax rmrminin 2 [0; 1]; rmin is minimum of fr1; : : : ; rng, rmax is maximum of fr1; : : : ; rng. In practice, this kind of relevance scores for the raw features help to interpret and validate the overall model. For example, one would expect blood glucose level feature to have a high relevance score when learning a model to detect diabetes mellitus phenotype (we provide such insights later in Section 5). 5 5.1

Experimental Evaluation Dataset Details

We use MIMIC-III (v1.4) clinical database [Johnson et al., 2016] which consists of over 60,000 ICU stays across 40,000 critical care patients. We use same experimental setup as in [Harutyunyan et al., 2017], with same splits and features for train, validation and test datasets2 based on 17 physiological time series with 12 real-valued and 5 categorical time series, sampled at 1 hour intervals. The categorical variables are converted to one-hot vectors such that final multivariate time series has n = 76 raw input features (59 actual features and 17 masking features to denote missing values).

For phenotyping task, the goal is to classify 25 phenotypes common in adult ICUs. For in-hospital mortality task, the goal is to predict whether the patient will survive or not given the time series observations up to 48 hours. In all our experiments, we restrict training time series data up to first 48 hours in ICU stay, such that T = 48 while training all models to imitate practical scenario where early predictions are important, unlike [Harutyunyan et al., 2017; Song et al., 2017] which use entire time series for training the classifier for phenotyping task. 5.2

Evaluation

We have n = 76 raw input features resulting in m = 13; 680dimensional (m = 76 180) TimeNet feature vector for each admission. We use = 0:0001 for phenotype classifiers and use = 0:0003 for in-hospital mortality classifier ( is chosen based on hold-out validation set). Table 1 summarizes the results and provides comparison with existing benchmarks. Refer Table 2 for detailed phenotype-wise results.

We consider two variants of classifier models for phenotyping task: i) TimeNet-x using data from current episode, ii) TimeNet-x-Eps using data from previous episode of a patient as well (whenever available) via an additional input feature related to presence or absence of the phenotype in previous episode. Each classifier is trained using up to first 48 hours of data after ICU admission. However, we consider two classifier variants depending upon hours of data x used to estimate the target class at test time. For x = 48, data up to first 48 hours after admission is used for determining the phenotype. For x = All, the learned classifier is applied to all 48-hours windows (overlapping with shift of 24 hours) over the entire ICU stay period of a patient, and the average phenotype probability across windows is used as the final estimate of 2https://github.com/yerevann/mimic3-benchmarks the target class. In TimeNet-x-Eps, the additional feature is related to the presence (1) or absence (0) of the phenotype during the previous episode. We use the ground-truth value for this feature during training time, and the probability of presence of phenotype during previous episode (as given via LASSO-based classifier) at test time. 5.3

Observations Classification Tasks

For the phenotyping task, we make following observations from Table 1: 1. TimeNet-48 vs LR: TimeNet-based features perform significantly better than hand-crafted features as used in LR (logistic regression), while using first 48 hours of data only unlike the LR approach that uses entire episode’s data. This proves the effectiveness of TimeNet features for MIMIC-III data. Further, it only requires tuning a single hyperparameter for LASSO, unlike other approaches like LSTM [Harutyunyan et al., 2017] that would involve tuning number of hidden units, layers, learning rate, etc. 2. TimeNet-x vs TimeNet-x-Eps: Leveraging previous episode’s time series data for a patient significantly improves the classification performance. 3. TimeNet-48-Eps performs better than existing benchmarks, while still being practically more feasible as it looks at only up to 48 hours of current episode of a patient rather than the entire current episode. For in-hospital mortality task, we observe comparable performance to existing benchmarks.

Training linear models is significantly fast and it took around 30 minutes for obtaining any of the binary classifiers while tuning for 2 [10 5 10 3] (five equally-spaced values) on a 32GB RAM machine with Quad Core i7 2.7GHz processor.

We observe that LASSO leads to 96:2 0:8 % sparsity (i.e. percentage of weights wjk 0) for all classifiers leading to around 550 useful features (out of 13,680) for each phenotype classification.

Relevance Scores for Raw Input Features

We observe intuitive interpretation for relevance of raw input features using the weights assigned to various TimeNet features (refer Equation 2): For example, as shown in Figure 2, we obtain highest relevance scores for Glucose Level (feature 1) and Systolic Blood Pressure (feature 20) for Diabetes Mellitus with Complications (Figure 2(a)), and Essential Hypertension (Figure 2(b)), respectively. Refer Supplementary Material Figure 3 for more details. We conclude that even though TimeNet was never trained on MIMIC-III data, it still provides meaningful general-purpose features from time series of raw input features, and LASSO helps to select the most relevant ones for end-task by using labeled data. Further, extracting features using a deep recurrent neural network model for time series of each raw input feature independently – rather than considering a multivariate time series – eventually allows to easily assign relevance scores to raw features in the input domain, allowing a high-level basic model validation by domain-experts. 6

Discussion and Future Work

In this work, we leverage deep learning models efficiently via TimeNet for phenotyping and mortality prediction tasks, with little hyperparameter tuning effort. TimeNet-based features can be efficiently transferred to train linear interpretable classifiers for the end tasks considered while still achieving classification performance similar to more compute-intensive deep models trained from scratch. In future, evaluating a domain-specific TimeNet-like model for clinical time series (e.g. trained only on MIMIC-III database) will be interesting. Figure 3: Feature relevance scores for 25 phenotypes. Refer Table 2 for names of phenotypes, and Table 3 for names of raw features. 1 2 3 4 5 6 7 8 9 e 10 typ 1112 o 13 henP 11114567 0.8 0.6 0.4 0.2 0.0

Glucose

Glascow coma scale total ! 7 Glascow coma scale verbal response ! Incomprehensible sounds

Diastolic blood pressure

Weight

Glascow coma scale total ! 8 Glascow coma scale motor response ! Obeys Commands

Glascow coma scale eye opening ! None Glascow coma scale eye opening ! To Pain

Glascow coma scale total ! 6 Glascow coma scale verbal response ! 1.0 ET/Trach

Glascow coma scale total ! 5 Glascow coma scale verbal response ! 5 Oriented

Glascow coma scale total ! 3

Glascow coma scale verbal response ! No Response Glascow coma scale motor response ! 3 Abnorm flexion Glascow coma scale verbal response ! 3 Inapprop words

Capillary refill rate ! 1.0 Glascow coma scale verbal response ! Inappropriate Words

Systolic blood pressure Glascow coma scale motor response ! Flex-withdraws

Glascow coma scale total ! 10 Glascow coma scale motor response ! Obeys Commands Glascow coma scale verbal response ! No Response-ETT

Glascow coma scale eye opening ! 2 To pain

Heart Rate

Respiratory rate

Glascow coma scale verbal response ! Oriented Glascow coma scale motor response ! Localizes Pain

Temperature

[Bahdanau et al., 2014 ]

Dzmitry

Bahdanau , Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate . arXiv preprint arXiv:1409.0473 , 2014 .

[Bengio , 2012]

Yoshua

Bengio . Deep learning of representations for unsupervised and transfer learning . In Proceedings of ICML Workshop on Unsupervised and Transfer Learning , pages 17 - 36 , 2012 .

[Che et al., 2016 ]

Zhengping

Che , Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values . arXiv preprint arXiv:1606 . 01865 , 2016 .

[Chen et al., 2015 ]

Yanping

Chen , Eamonn Keogh, Bing Hu,

Nurjahan

Begum , et al. The ucr time series classification archive , July 2015 . www.cs.ucr.edu/˜eamonn/time_series_ data/.

[Cho et al., 2014 ]

Kyunghyun

Cho , Bart Van Merrie¨nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and

Yoshua

Bengio . Learning phrase representations using RNN encoder-decoder for statistical machine translation . arXiv preprint arXiv:1406.1078 , 2014 .

[Choi et al., 2016 ]

Edward

Choi , Mohammad Taha Bahadori, Andy Schuetz, Walter F Stewart,

and Jimeng

Sun . Doctor ai: Predicting clinical events via recurrent neural networks . In Machine Learning for Healthcare Conference , pages 301 - 318 , 2016 .

[Harutyunyan et al., 2017 ]

Hrayr

Harutyunyan , Hrant Khachatrian, David C Kale, and

Aram

Galstyan . Multitask learning and benchmarking with clinical time series data . arXiv preprint arXiv:1703.07771 , 2017 .

[Hermans and Schrauwen , 2013]

Michiel

Hermans and

Benjamin

Schrauwen . Training and analysing deep recurrent neural networks . In Advances in Neural Information Processing Systems , pages 190 - 198 , 2013 .

[Johnson et al., 2016] Alistair EW Johnson , Tom J Pollard, Lu Shen,

H Lehman

Li-wei , Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database . Scientific data , 3 : 160035 , 2016 .

[Lipton et al., 2015 ] Zachary C Lipton , David C Kale, Charles Elkan , and

Randall

Wetzel . Learning to diagnose with lstm recurrent neural networks . arXiv preprint arXiv:1511.03677 , 2015 .

[Malhotra et al., 2015 ]

Pankaj

Malhotra , Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. Long Short Term Memory Networks for Anomaly Detection in Time Series . In ESANN, 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages 89 - 94 , 2015 .

[Malhotra et al., 2017 ]

Pankaj

Malhotra , Vishnu

, Lovekesh

Vig

, Puneet Agarwal, and Gautam Shroff. TimeNet: Pre-trained deep recurrent neural network for time series classification . In 25th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , 2017 .

[Micenkova ´ et al., 2013 ] Barbora Micenkova´, Xuan-Hong

Dang

, Ira Assent, and Raymond T Ng. Explaining outliers by subspace separability . In Data Mining (ICDM) , 2013 IEEE 13th International Conference on, pages 518 - 527 . IEEE, 2013 .

[Miotto et al., 2016 ]

Riccardo

Miotto , Li

, Brian A Kidd, and Joel T Dudley. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records . Scientific reports , 6 : 26094 , 2016 .

[Nguyen et al., 2017 ]

Phuoc

Nguyen , Truyen Tran, Nilmini Wickramasinghe, and

Svetha

Venkatesh . Deepr: A convolutional net for medical records . IEEE journal of biomedical and health informatics , 21 ( 1 ): 22 - 30 , 2017 .

[Purushotham et al., 2017 ]

Sanjay

Purushotham , Chuizheng Meng, Zhengping Che, and Yan Liu. Benchmark of deep learning models on large healthcare mimic datasets . arXiv preprint arXiv:1710.08531 , 2017 .

[Rajkomar et al., 2018 ]

Alvin

Rajkomar , Eyal Oren, Kai Chen, Andrew M Dai,

Nissan

Hajaj , Peter J Liu , Xiaobing Liu, Mimi Sun, Patrik Sundberg, Hector Yee , et al. Scalable and accurate deep learning for electronic health records . arXiv preprint arXiv:1801.07860 , 2018 .

[Simonyan and Zisserman , 2014]

Karen

Simonyan and

Andrew

Zisserman . Very deep convolutional networks for large-scale image recognition . arXiv preprint arXiv:1409.1556 , 2014 .

[Song et al., 2017 ]

Huan

Song , Deepta Rajan, Jayaraman J Thiagarajan , and Andreas Spanias . Attend and diagnose: Clinical time series analysis using attention models . arXiv preprint arXiv:1711.03905 , 2017 .

[Sutskever et al., 2014 ]

Ilya

Sutskever , Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks . In Advances in Neural Information Processing Systems , pages 3104 - 3112 , 2014 .

[Tibshirani , 1996]

Robert

Tibshirani . Regression shrinkage and selection via the lasso . Journal of the Royal Statistical Society. Series B (Methodological) , pages 267 - 288 , 1996 . Glascow coma scale eye opening ! 3 To speech Height Glascow coma scale motor response ! 5 Localizes Pain Glascow coma scale total ! 14 Fraction inspired oxygen Glascow coma scale total ! 12 Glascow coma scale verbal response ! Confused Glascow coma scale motor response ! 1 No Response Mean blood pressure Glascow coma scale total ! 4 Glascow coma scale eye opening ! To Speech Glascow coma scale total ! 15

Glascow coma scale motor response ! 4 Flex-withdraws Glascow coma scale motor response ! No response Glascow coma scale eye opening ! Spontaneously Glascow coma scale verbal response ! 4 Confused Capillary refill rate ! 0.0 Glascow coma scale total ! 13 Glascow coma scale eye opening ! 1 No Response Glascow coma scale motor response ! Abnormal extension Glascow coma scale total ! 11

Glascow coma scale verbal response ! 2 Incomp sounds Glascow coma scale total ! 9

Glascow coma scale motor response ! Abnormal Flexion Glascow coma scale verbal response ! 1 No Response Glascow coma scale motor response ! 2 Abnorm extensn pH Glascow coma scale eye opening ! 4 Spontaneously Oxygen saturation