Using Features from Pre-trained TimeNet for Clinical Predictions

                      Priyanka Gupta, Pankaj Malhotra, Lovekesh Vig, Gautam Shroff
                                       TCS Research, New Delhi, India
                     {priyanka.g35, malhotra.pankaj, lovekesh.vig, gautam.shroff}@tcs.com,


                           Abstract                                   be an effective way to model the sequential aspects of EHR
                                                                      data, e.g. diagnoses [Lipton et al., 2015; Che et al., 2016;
     Predictive models based on Recurrent Neural Net-                 Choi et al., 2016], mortality prediction and estimating length
     works (RNNs) for clinical time series have been                  of stay [Harutyunyan et al., 2017; Purushotham et al., 2017;
     successfully used for various tasks such as phe-                 Rajkomar et al., 2018].
     notyping, in-hospital mortality prediction, and di-
                                                                         However, training RNNs requires large labeled training
     agnostics. However, RNNs require large labeled
                                                                      data like any other deep learning approach, and can be com-
     data for training and are computationally expen-
                                                                      putationally inefficient because of sequential nature of com-
     sive to train. Pre-training a network for some su-
                                                                      putations. On the other hand, training a deep network on di-
     pervised or unsupervised tasks on a dataset, and
                                                                      verse instances can provide generic features for unseen in-
     then fine-tuning via transfer learning for a related
                                                                      stances, e.g. VGGNet [Simonyan and Zisserman, 2014] for
     end-task can be an efficient way to leverage deep
                                                                      images. Also, fine-tuning a pre-trained network with trans-
     models for scenarios that lack in either computa-
                                                                      fer learning is often faster and easier than constructing and
     tional resources or labeled data, or both. In this
                                                                      training a new network from scratch [Bengio, 2012]. The ad-
     work, we consider an approach to leverage a deep
                                                                      vantage of learning in such a manner is that the pre-trained
     RNN – namely TimeNet [Malhotra et al., 2017]
                                                                      network has already learned a rich set of features that can
     – that is pre-trained on a large number of diverse
                                                                      then be applied to a wide range of other similar tasks.
     publicly available time-series from UCR Reposi-
     tory [Chen et al., 2015]. TimeNet maps varying-                     Deep RNNs have been shown to perform hierarchical pro-
     length time series to fixed-dimensional feature vec-             cessing of time series with different layers tackling different
     tors and acts as an off-the-shelf feature extractor.             time scales [Hermans and Schrauwen, 2013; Malhotra et al.,
     TimeNet-based approach overcome the need for                     2015]. TimeNet [Malhotra et al., 2017] is a general-purpose
     hand-crafted features, and allows for use of tradi-              multi-layered RNN trained on large number of diverse time
     tional easy-to-train and interpretable linear mod-               series from UCR Time Series Archive [Chen et al., 2015]
     els for the end-task, while still leveraging the fea-            (refer Section 3 for details) that has been shown to be use-
     tures from a deep neural network. Empirical evalu-               ful as off-the-shelf feature extractor for time series. TimeNet
     ation of the proposed approach on MIMIC-III1 data                has been trained on 18 different datasets simultaneously via
     suggests promising direction for future exploration:             an RNN autoencoder in an unsupervised manner for recon-
     our results are comparable to existing benchmarks                struction task. Features extracted from TimeNet have been
     while our models require lesser training and hyper-              found to be useful for classification task on 25 datasets not
     parameter tuning effort.                                         seen during training of TimeNet, proving its ability to pro-
                                                                      vide meaningful features for unseen datasets.
                                                                         In this work, we provide an efficient way to learn predic-
1   Introduction                                                      tion models for clinical time series by leveraging general-
There has been a growing interest in using deep learning              purpose features via TimeNet. TimeNet maps variable-length
models for various clinical prediction tasks from Electronic          clinical time series to fixed-dimensional feature vectors, that
Health Records, e.g. Doctor AI [Choi et al., 2016] for med-           are subsequently used for patient phenotyping and in-hospital
ical diagnosis, Deep Patient [Miotto et al., 2016] to predict         mortality prediction tasks on MIMIC-III database [Johnson et
future diseases in patients, DeepR [Nguyen et al., 2017] to           al., 2016] via easily trainable non-temporal linear classifica-
predict unplanned readmission after discharge, etc. With var-         tion models. We observe that TimeNet-based features can be
ious medical parameters being recorded over a period of time          used to build such classification models with very little train-
in EHR databases, Recurrent Neural Networks (RNNs) can                ing effort while yielding performance comparable to mod-
                                                                      els with hand-crafted features or carefully trained domain-
   1
     TimeNet-based features for MIMIC-III time series are available   specific RNNs, as benchmarked in [Harutyunyan et al., 2017;
on request from authors.                                              Song et al., 2017]. Further, we propose a simple mechanism
to leverage the weights of the linear classification models to        zT : the decoder relies on zT as the only input to reconstruct
provide insights into the relevance of each raw input feature         the time series, forcing the encoder to capture all the rele-
(physiological parameter) for a given phenotype (discussed in         vant information in the time series into the fixed-dimensional
Section 4.2).                                                         vector zT . This vector zT is used as the feature vector for in-
                                                                      put x1...T . This feature vector is then used to train a simpler
2   Related Work                                                      classifier (e.g. SVM, as used in [Malhotra et al., 2017]) for
                                                                      the end task. TimeNet maps a univariate input time series to
TimeNet-based features have been shown to be useful for               180-dimensional feature vector, where each dimension corre-
various tasks including ECG classification [Malhotra et al.,          sponds to final output of one of the 60 GRUs in the 3 recurrent
2017]. In this work, we consider application of TimeNet               layers.
to phenotyping and in-hospital mortality tasks for multivari-
ate clinical time series classification. Deep Patient [Miotto
et al., 2016] proposes leveraging features from a pre-trained         4     TimeNet Features for Clinical Time Series
stacked-autoencoder for EHR data. However, it does not                Consider a set D of labeled time series instances from an EHR
leverage the temporal aspect of the data and uses a non-              database: D = {(x(i) , y (i) )}Ni=1 , where x
                                                                                                                    (i)
                                                                                                                        is a multivariate
temporal model based on stacked-autoencoders. Our ap-                                (i)
                                                                      time series, y ∈ {y1 , . . . , yC }, C is the number of classes,
proach extracts temporal features via TimeNet incorporat-             N is the number of unique patients (in our experiments, we
ing the sequential nature of EHR data. Doctor AI [Choi et             consider each episode of hospital stay for a patient as a sepa-
al., 2016] uses discretized medical codes (e.g. diagnosis,            rate data instance). In this work, we consider presence or ab-
medication, procedure) from longitudinal patient visits via a         sence of a phenotype as a binary classification task such that
purely supervised setting while we use real-valued time se-           C = 2. We learn an independent model for each phenotype
ries. While approaches like Doctor AI require training a deep         (unlike [Harutyunyan et al., 2017] which consider phenotyp-
RNN from scratch, our approach leverages a general-purpose            ing as a multi-label classification problem). This allows us to
RNN for feature extraction.                                           build simple linear binary classification models as described
   [Harutyunyan et al., 2017] consider training a deep RNN            next in Section 4.1. In practice, the outputs of these binary
model for multiple prediction tasks simultaneously includ-            classifiers can then be considered together to estimate the set
ing phenotyping and in-hospital mortality to learn a general-         of phenotypes present in a patient. Similarly, mortality pre-
purpose deep RNN for clinical time series. They show that             diction is considered to be a binary classification task where
it is possible to train a single network for multiple tasks si-       the goal is to classify whether the patient will survive (after
multaneously by capturing generic features that work across           admission to ICU) or not.
different tasks. We also consider leveraging generic features
for clinical time series but using an RNN that is pre-trained         4.1    Classification using TimeNet features
on diverse time series across domains, making our approach
                                                                      Feature Extraction for Multivariate Clinical Time Series
more efficient. Further, we provide an approach to rank the
raw input features in order of their relevance that helps vali-       For a multivariate time series x = x1 x2 . . . xT , where xt ∈
date the models learned.                                              Rn , we consider time series for each of the n raw input
                                                                      features (physiological parameters, e.g. glucose level, heart
                                                                      rate, etc.) independently, to obtain univariate time series
3   Background: TimeNet                                               xj = xj1 xj2 . . . xjT , j = 1 . . . n. (Note: We use x instead of
TimeNet [Malhotra et al., 2017] is a pre-trained off-the-shelf        x(i) and omit superscript (i) for ease of notation). We obtain
feature extractor for univariate time series with three recur-        the vector representation zjT = fE (xj ; WE ) for xj , where
rent layers having 60 Gated Recurrent Units (GRUs) [Cho               zjT ∈ Rc using TimeNet as fE with c = 180 (as described in
et al., 2014] each. TimeNet is an RNN trained via an au-              Section 3). In general, time series length T also depends on i,
toencoder consisting of an encoder RNN and a decoder RNN              e.g. based on length of stay in hospital. We omit this for sake
trained simultaneously using the sequence-to-sequence learn-          of clarity without loss of generality. In practice, we convert
ing framework [Sutskever et al., 2014; Bahdanau et al., 2014]         each time series to have equal length T by suitable pre/post-
as shown in Figure 1(a). RNN autoencoder is trained to obtain         padding with 0s. We concatenate the TimeNet-features zjT
the parameters WE of the encoder RNN fE via reconstruc-               for each raw input feature j to get the final feature vector
tion task such that for input x1...T = x1 , x2 , ..., xT (xi ∈ R),    zT = [z1T , z2T , . . . , znT ] for time series x, where zT ∈ Rm ,
the target output time series xT ...1 = xT , xT −1 , ..., x1 is re-   m = n × c as illustrated in Figure 1(b).
verse of the input.
   The RNN encoder fE provides a non-linear mapping of                Using TimeNet-based Features for Classification
the univariate input time series to a fixed-dimensional vector        The final concatenated feature vector zT is used as input for
representation zT : zT = fE (x1...T ; WE ), followed by an            the phenotyping and mortality prediction classification tasks.
RNN decoder fD based non-linear mapping of zT to univari-             We note that since c = 180 is large, zT has large number of
ate time series: x̂T ...1 = fD (zT ; WD ); where WE and WD            features m ≥ 180. We consider a linear mapping from input
are the parameters of the encoder and decoder, respectively.          TimeNet features zT to the target label y s.t. the estimate
The model is trained to minimize the average squared recon-           ŷ = w · zT , where w ∈ Rm . We constrain the linear model
struction error. Training on 18 diverse datasets simultane-           with weights w to use only a few of these large number of
ously results in robust time series features getting captured in      features. The weights are obtained using LASSO-regularized
                Output


      Decoder
                                     z’                                                                                                      Relevance score           r1          r2         r3     ...     rn
                                                                     Weights w

                                            Feature Vector zT     z1T       z2T          z3T       z4T         ...     znT                         Weights w        W11,..,W1c W21,..,W2c W31,..,W3c ... Wn1,..,Wnc


                                                                                                                                             Feature Vector z          z1T         z2T        z3T    ...    znT
                                           Embeddings                                                                                                           T
       TimeNet                             Using RNN z11        z12 ...   z1T     z21    z22 ...   z2T ... zn1           zn2 ...       znT      Time series
      (Encoder)                      z      Encoder                                                                                                                    F1          F2         F3     ...    Fn
                                                                                                                                              (Raw Features)
                                                         x11    x12 ...   x1T     x21    x22 ...   x2T           xn1         xn2 ...   xnT
                                             Time Series
                  Input
                                            (Raw Features)
                                                                F1                       F2              ...                 Fn

                   (a)                                                             (b)                                                                                       (c)

Figure 1: (a) TimeNet trained via RNN Encoder-Decoder with three hidden GRU layers. (b) TimeNet based Feature Extraction. TimeNet is
shown unrolled over time. (c) Obtaining relevance scores for raw input features. Here, T : time series length, n: number of raw input features.


loss function [Tibshirani, 1996]:                                                                   train, validation and test datasets2 based on 17 physiologi-
                                 N
                                                                                                    cal time series with 12 real-valued and 5 categorical time se-
                           1 X (i)         (i)                                                      ries, sampled at 1 hour intervals. The categorical variables
            arg min              (y − w · zT )2 + α||w||1                                (1)
                  w        N i=1                                                                    are converted to one-hot vectors such that final multivariate
                                                                                                    time series has n = 76 raw input features (59 actual features
                                 Pn Pc
where y (i) ∈ {0, 1}, ||w||1 = j=1 k=1 |wjk | is the L1 -                                           and 17 masking features to denote missing values).
norm, where wjk represents the weight assigned to the k-th                                             For phenotyping task, the goal is to classify 25 pheno-
TimeNet-feature for the j-th raw feature, and α controls the                                        types common in adult ICUs. For in-hospital mortality task,
extent of sparsity – with higher α implying more sparsity, i.e.                                     the goal is to predict whether the patient will survive or not
fewer TimeNet features are selected for the final classifier.                                       given the time series observations up to 48 hours. In all
                                                                                                    our experiments, we restrict training time series data up to
4.2      Obtaining Relevance Scores for Raw Features                                                first 48 hours in ICU stay, such that T = 48 while train-
Determining relevance of the n raw input features for a given                                       ing all models to imitate practical scenario where early pre-
phenotype is potentially useful to obtain insights into the ob-                                     dictions are important, unlike [Harutyunyan et al., 2017;
tained classification model. The sparse weights w are easy                                          Song et al., 2017] which use entire time series for training
to interpret and can give interesting insights into relevant fea-                                   the classifier for phenotyping task.
tures for a classification task (e.g. as used in [Micenková et
al., 2013]). We obtain the relevance rj of the j-th raw input                                       5.2              Evaluation
feature as the sum of the absolute values of the weights wjk                                        We have n = 76 raw input features resulting in m = 13, 680-
assigned to the corresponding TimeNet features zjT as shown                                         dimensional (m = 76 × 180) TimeNet feature vector for each
in Figure 1(c), s.t.                                                                                admission. We use α = 0.0001 for phenotype classifiers and
                                 c
                                 X                                                                  use α = 0.0003 for in-hospital mortality classifier (α is cho-
                          rj =            |wjk |, j = 1 . . . n.                         (2)        sen based on hold-out validation set). Table 1 summarizes the
                                 k=1
                                                                                                    results and provides comparison with existing benchmarks.
                                                                                                    Refer Table 2 for detailed phenotype-wise results.
Further, rj is normalized using min-max normalization such                                             We consider two variants of classifier models for pheno-
                     rj −rmin
that rj0 = rmax          −rmin ∈ [0, 1]; rmin is minimum of                                         typing task: i) TimeNet-x using data from current episode, ii)
{r1 , . . . , rn }, rmax is maximum of {r1 , . . . , rn }. In prac-                                 TimeNet-x-Eps using data from previous episode of a patient
tice, this kind of relevance scores for the raw features help                                       as well (whenever available) via an additional input feature
to interpret and validate the overall model. For example, one                                       related to presence or absence of the phenotype in previous
would expect blood glucose level feature to have a high rele-                                       episode. Each classifier is trained using up to first 48 hours of
vance score when learning a model to detect diabetes mellitus                                       data after ICU admission. However, we consider two classi-
phenotype (we provide such insights later in Section 5).                                            fier variants depending upon hours of data x used to estimate
                                                                                                    the target class at test time. For x = 48, data up to first 48
5      Experimental Evaluation                                                                      hours after admission is used for determining the phenotype.
                                                                                                    For x = All, the learned classifier is applied to all 48-hours
5.1      Dataset Details                                                                            windows (overlapping with shift of 24 hours) over the en-
We use MIMIC-III (v1.4) clinical database [Johnson et al.,                                          tire ICU stay period of a patient, and the average phenotype
2016] which consists of over 60,000 ICU stays across 40,000                                         probability across windows is used as the final estimate of
critical care patients. We use same experimental setup as in
[Harutyunyan et al., 2017], with same splits and features for                                              2
                                                                                                               https://github.com/yerevann/mimic3-benchmarks
Table 1: Classification Performance Comparison. Here, LR: Logistic regression, LSTM-Multi: LSTM-based multitask model, SAnD (Simply
Attend and Diagnose): Fully attention-based model, SAnD-Multi: SAnD-based multitask model. (Note: *For phenotyping, we compare
TimeNet-48-Eps with existing benchmarks over TimeNet-All-Eps as it is more applicable in practical scenarios. **Only TimeNet-48 variant
is applicable for in-hospital mortality task.)

                            [Harutyunyan et al., 2017]               [Song et al., 2017]   Proposed (Features using [Malhotra et al., 2017])
       Metric              LR    LSTM LSTM-Multi                    SAnD SAnD-Multi        TimeNet-48 TimeNet- TimeNet- TimeNet-
                                                                                                        All          48-Eps       All-Eps*
   Task 1: Phenotyping
    Micro AUC       0.801   0.821         0.817                     0.816      0.819          0.812        0.813       0.820         0.822
    Macro AUC       0.741    0.77         0.766                     0.766      0.771          0.761        0.764       0.772         0.775
   Weighted AUC 0.732       0.757         0.753                     0.754      0.759          0.751        0.754       0.765         0.768
   Task 2: In-Hospital Mortality Prediction**
      AUROC         0.845   0.854         0.863                     0.857      0.859          0.852        -           -             -
      AUPRC         0.472   0.516         0.517                     0.518      0.519          0.519        -           -             -
    min(Se,+ P)     0.469   0.491         0.499                      0.5       0.504          0.486        -           -             -


      1.0                            1.0
                                                                                around 30 minutes for obtaining any of the binary classifiers
                                                                                while tuning for α ∈ [10−5 − 10−3 ] (five equally-spaced val-
      0.8                            0.8
                                                                                ues) on a 32GB RAM machine with Quad Core i7 2.7GHz
      0.6                            0.6
                                                                                processor.
      0.4                            0.4
                                                                                   We observe that LASSO leads to 96.2 ± 0.8 % sparsity (i.e.
      0.2                            0.2
                                                                                percentage of weights wjk ≈ 0) for all classifiers leading to
      0.0
            10   20   30   40   50
                                     0.0
                                           10   20   30   40   50               around 550 useful features (out of 13,680) for each phenotype
                 (a) P1                         (b) P2                          classification.
Figure 2: Feature relevance after LASSO. x-axis: Feature Number,
y-axis: Relevance Score. Here, P1 : Diabetes Mellitus with Compli-              Relevance Scores for Raw Input Features
cations, P2 : Essential Hypertension.
                                                                                We observe intuitive interpretation for relevance of raw in-
                                                                                put features using the weights assigned to various TimeNet
the target class. In TimeNet-x-Eps, the additional feature is                   features (refer Equation 2): For example, as shown in Fig-
related to the presence (1) or absence (0) of the phenotype                     ure 2, we obtain highest relevance scores for Glucose Level
during the previous episode. We use the ground-truth value                      (feature 1) and Systolic Blood Pressure (feature 20) for Dia-
for this feature during training time, and the probability of                   betes Mellitus with Complications (Figure 2(a)), and Essen-
presence of phenotype during previous episode (as given via                     tial Hypertension (Figure 2(b)), respectively. Refer Supple-
LASSO-based classifier) at test time.                                           mentary Material Figure 3 for more details. We conclude that
                                                                                even though TimeNet was never trained on MIMIC-III data, it
5.3    Observations                                                             still provides meaningful general-purpose features from time
Classification Tasks                                                            series of raw input features, and LASSO helps to select the
For the phenotyping task, we make following observations                        most relevant ones for end-task by using labeled data. Fur-
from Table 1:                                                                   ther, extracting features using a deep recurrent neural network
1. TimeNet-48 vs LR: TimeNet-based features perform signif-                     model for time series of each raw input feature independently
icantly better than hand-crafted features as used in LR (lo-                    – rather than considering a multivariate time series – eventu-
gistic regression), while using first 48 hours of data only un-                 ally allows to easily assign relevance scores to raw features
like the LR approach that uses entire episode’s data. This                      in the input domain, allowing a high-level basic model vali-
proves the effectiveness of TimeNet features for MIMIC-III                      dation by domain-experts.
data. Further, it only requires tuning a single hyperparameter
α for LASSO, unlike other approaches like LSTM [Harutyun-
yan et al., 2017] that would involve tuning number of hidden                    6      Discussion and Future Work
units, layers, learning rate, etc.
2. TimeNet-x vs TimeNet-x-Eps: Leveraging previous                              In this work, we leverage deep learning models efficiently
episode’s time series data for a patient significantly improves                 via TimeNet for phenotyping and mortality prediction tasks,
the classification performance.                                                 with little hyperparameter tuning effort. TimeNet-based fea-
3. TimeNet-48-Eps performs better than existing benchmarks,                     tures can be efficiently transferred to train linear interpretable
while still being practically more feasible as it looks at only                 classifiers for the end tasks considered while still achieving
up to 48 hours of current episode of a patient rather than the                  classification performance similar to more compute-intensive
entire current episode. For in-hospital mortality task, we ob-                  deep models trained from scratch. In future, evaluating a
serve comparable performance to existing benchmarks.                            domain-specific TimeNet-like model for clinical time series
   Training linear models is significantly fast and it took                     (e.g. trained only on MIMIC-III database) will be interesting.
References                                                              [Nguyen et al., 2017] Phuoc Nguyen, Truyen Tran, Nilmini Wick-
[Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun Cho, and               ramasinghe, and Svetha Venkatesh. Deepr: A convolutional net
   Yoshua Bengio. Neural machine translation by jointly learning           for medical records. IEEE journal of biomedical and health in-
   to align and translate. arXiv preprint arXiv:1409.0473, 2014.           formatics, 21(1):22–30, 2017.
                                                                        [Purushotham et al., 2017] Sanjay Purushotham, Chuizheng Meng,
[Bengio, 2012] Yoshua Bengio. Deep learning of representations
                                                                           Zhengping Che, and Yan Liu. Benchmark of deep learning
   for unsupervised and transfer learning. In Proceedings of ICML
                                                                           models on large healthcare mimic datasets. arXiv preprint
   Workshop on Unsupervised and Transfer Learning, pages 17–36,
                                                                           arXiv:1710.08531, 2017.
   2012.
                                                                        [Rajkomar et al., 2018] Alvin Rajkomar, Eyal Oren, Kai Chen, An-
[Che et al., 2016] Zhengping       Che,     Sanjay     Purushotham,
                                                                           drew M Dai, Nissan Hajaj, Peter J Liu, Xiaobing Liu, Mimi
   Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent
                                                                           Sun, Patrik Sundberg, Hector Yee, et al. Scalable and accu-
   neural networks for multivariate time series with missing values.
                                                                           rate deep learning for electronic health records. arXiv preprint
   arXiv preprint arXiv:1606.01865, 2016.
                                                                           arXiv:1801.07860, 2018.
[Chen et al., 2015] Yanping Chen, Eamonn Keogh, Bing Hu, Nur-
                                                                        [Simonyan and Zisserman, 2014] Karen Simonyan and Andrew
   jahan Begum, et al. The ucr time series classification archive,
                                                                           Zisserman. Very deep convolutional networks for large-scale im-
   July 2015. www.cs.ucr.edu/˜eamonn/time_series_
                                                                           age recognition. arXiv preprint arXiv:1409.1556, 2014.
   data/.
                                                                        [Song et al., 2017] Huan Song, Deepta Rajan, Jayaraman J Thi-
[Cho et al., 2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar
                                                                           agarajan, and Andreas Spanias. Attend and diagnose: Clini-
   Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,             cal time series analysis using attention models. arXiv preprint
   and Yoshua Bengio. Learning phrase representations using                arXiv:1711.03905, 2017.
   RNN encoder-decoder for statistical machine translation. arXiv
   preprint arXiv:1406.1078, 2014.                                      [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V
                                                                           Le. Sequence to sequence learning with neural networks. In Ad-
[Choi et al., 2016] Edward Choi, Mohammad Taha Bahadori, Andy
                                                                           vances in Neural Information Processing Systems, pages 3104–
   Schuetz, Walter F Stewart, and Jimeng Sun. Doctor ai: Predicting        3112, 2014.
   clinical events via recurrent neural networks. In Machine Learn-
   ing for Healthcare Conference, pages 301–318, 2016.                  [Tibshirani, 1996] Robert Tibshirani. Regression shrinkage and se-
                                                                           lection via the lasso. Journal of the Royal Statistical Society.
[Harutyunyan et al., 2017] Hrayr Harutyunyan, Hrant Khachatrian,
                                                                           Series B (Methodological), pages 267–288, 1996.
   David C Kale, and Aram Galstyan. Multitask learning and
   benchmarking with clinical time series data. arXiv preprint
   arXiv:1703.07771, 2017.
[Hermans and Schrauwen, 2013] Michiel Hermans and Benjamin
   Schrauwen. Training and analysing deep recurrent neural net-
   works. In Advances in Neural Information Processing Systems,
   pages 190–198, 2013.
[Johnson et al., 2016] Alistair EW Johnson, Tom J Pollard,
   Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghas-
   semi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi,
   and Roger G Mark. Mimic-iii, a freely accessible critical care
   database. Scientific data, 3:160035, 2016.
[Lipton et al., 2015] Zachary C Lipton, David C Kale, Charles
   Elkan, and Randall Wetzel. Learning to diagnose with lstm re-
   current neural networks. arXiv preprint arXiv:1511.03677, 2015.
[Malhotra et al., 2015] Pankaj Malhotra, Lovekesh Vig, Gautam
   Shroff, and Puneet Agarwal. Long Short Term Memory Net-
   works for Anomaly Detection in Time Series. In ESANN, 23rd
   European Symposium on Artificial Neural Networks, Computa-
   tional Intelligence and Machine Learning, pages 89–94, 2015.
[Malhotra et al., 2017] Pankaj Malhotra, Vishnu TV, Lovekesh Vig,
   Puneet Agarwal, and Gautam Shroff. TimeNet: Pre-trained deep
   recurrent neural network for time series classification. In 25th
   European Symposium on Artificial Neural Networks, Computa-
   tional Intelligence and Machine Learning, 2017.
[Micenková et al., 2013] Barbora Micenková, Xuan-Hong Dang,
   Ira Assent, and Raymond T Ng. Explaining outliers by subspace
   separability. In Data Mining (ICDM), 2013 IEEE 13th Interna-
   tional Conference on, pages 518–527. IEEE, 2013.
[Miotto et al., 2016] Riccardo Miotto, Li Li, Brian A Kidd, and
   Joel T Dudley. Deep patient: an unsupervised representation to
   predict the future of patients from the electronic health records.
   Scientific reports, 6:26094, 2016.
                                   Table 2: Phenotype-wise Classification Performance in terms of AUROC.

S.No.                                Phenotype                            LSTM-        TimeNet-48     TimeNet-    TimeNet-       TimeNet-
                                                                          Multi                       All         48-Eps         All-Eps
  1                      Acute and unspecified renal failure              0.8035          0.7861      0.7887      0.7912         0.7941
  2                        Acute cerebrovascular disease                  0.9089          0.8989      0.9031      0.8986         0.9033
  3                         Acute myocardial infarction                   0.7695          0.7501      0.7478      0.7533         0.7509
  4                             Cardiac dysrhythmias                      0.684           0.6853      0.7005      0.7096         0.7239
  5                            Chronic kidney disease                     0.7771          0.7764      0.7888      0.7960         0.8061
  6           Chronic obstructive pulmonary disease and bronchiectasis    0.6786          0.7096      0.7236      0.7460         0.7605
  7             Complications of surgical procedures or medical care      0.7176          0.7061      0.6998      0.7092         0.7029
  8                             Conduction disorders                      0.726           0.7070      0.7111      0.7286         0.7324
  9                   Congestive heart failure; nonhypertensive           0.7608          0.7464      0.7541      0.7747         0.7805
 10                Coronary atherosclerosis and other heart disease       0.7922          0.7764      0.7760      0.8007         0.8016
 11                     Diabetes mellitus with complications              0.8738          0.8748      0.8800      0.8856         0.8887
 12                    Diabetes mellitus without complication             0.7897          0.7749      0.7853      0.7904         0.8000
 13                        Disorders of lipid metabolism                  0.7213          0.7055      0.7119      0.7217         0.7280
 14                            Essential hypertension                     0.6779          0.6591      0.6650      0.6757         0.6825
 15                        Fluid and electrolyte disorders                0.7405          0.7351      0.7301      0.7377         0.7328
 16                         Gastrointestinal hemorrhage                   0.7413          0.7364      0.7309      0.7386         0.7343
 17          Hypertension with complications and secondary hypertension   0.76            0.7606      0.7700      0.7792         0.7871
 18                              Other liver diseases                     0.7659          0.7358      0.7332      0.7573         0.7530
 19                        Other lower respiratory disease                0.688           0.6847      0.6897      0.6896         0.6922
 20                        Other upper respiratory disease                0.7599          0.7515      0.7565      0.7595         0.7530
 21                 Pleurisy; pneumothorax; pulmonary collapse            0.7027          0.6900      0.6882      0.6909         0.6997
 22                                  Pneumonia                            0.8082          0.7857      0.7916      0.7890         0.7943
 23                Respiratory failure; insufficiency; arrest (adult)     0.9015          0.8815      0.8856      0.8834         0.8876
 24                         Septicemia (except in labor)                  0.8426          0.8276      0.8140      0.8296         0.8165
 25                                     Shock                             0.876           0.8764      0.8564      0.8763         0.8562


                                                             Raw features
                       1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59
                   1                                                                                                       1.0
                   2
                   3
                   4
                   5                                                                                                       0.8
                   6
                   7
                   8
                   9
                  10                                                                                                       0.6
      Phenotype


                  11
                  12
                  13
                  14
                  15                                                                                                       0.4
                  16
                  17
                  18
                  19
                  20                                                                                                       0.2
                  21
                  22
                  23
                  24
                  25                                                                                                       0.0
 Figure 3: Feature relevance scores for 25 phenotypes. Refer Table 2 for names of phenotypes, and Table 3 for names of raw features.
                                              Table 3: List of raw input features.

1                               Glucose                               31         Glascow coma scale eye opening → 3 To speech
2                    Glascow coma scale total → 7                     32                             Height
3    Glascow coma scale verbal response → Incomprehensible sounds     33      Glascow coma scale motor response → 5 Localizes Pain
4                       Diastolic blood pressure                      34                 Glascow coma scale total → 14
5                                Weight                               35                    Fraction inspired oxygen
6                    Glascow coma scale total → 8                     36                 Glascow coma scale total → 12
7       Glascow coma scale motor response → Obeys Commands            37         Glascow coma scale verbal response → Confused
8              Glascow coma scale eye opening → None                  38      Glascow coma scale motor response → 1 No Response
9             Glascow coma scale eye opening → To Pain                39                      Mean blood pressure
10                   Glascow coma scale total → 6                     40                 Glascow coma scale total → 4
11        Glascow coma scale verbal response → 1.0 ET/Trach           41          Glascow coma scale eye opening → To Speech
12                   Glascow coma scale total → 5                     42                 Glascow coma scale total → 15
13         Glascow coma scale verbal response → 5 Oriented            43     Glascow coma scale motor response → 4 Flex-withdraws
14                   Glascow coma scale total → 3                     44       Glascow coma scale motor response → No response
15        Glascow coma scale verbal response → No Response            45        Glascow coma scale eye opening → Spontaneously
16      Glascow coma scale motor response → 3 Abnorm flexion          46        Glascow coma scale verbal response → 4 Confused
17      Glascow coma scale verbal response → 3 Inapprop words         47                   Capillary refill rate → 0.0
18                     Capillary refill rate → 1.0                    48                 Glascow coma scale total → 13
19     Glascow coma scale verbal response → Inappropriate Words       49        Glascow coma scale eye opening → 1 No Response
20                      Systolic blood pressure                       50    Glascow coma scale motor response → Abnormal extension
21       Glascow coma scale motor response → Flex-withdraws           51                 Glascow coma scale total → 11
22                  Glascow coma scale total → 10                     52     Glascow coma scale verbal response → 2 Incomp sounds
23      Glascow coma scale motor response → Obeys Commands            53                 Glascow coma scale total → 9
24      Glascow coma scale verbal response → No Response-ETT          54     Glascow coma scale motor response → Abnormal Flexion
25            Glascow coma scale eye opening → 2 To pain              55      Glascow coma scale verbal response → 1 No Response
26                             Heart Rate                             56     Glascow coma scale motor response → 2 Abnorm extensn
27                          Respiratory rate                          57                               pH
28          Glascow coma scale verbal response → Oriented             58       Glascow coma scale eye opening → 4 Spontaneously
29        Glascow coma scale motor response → Localizes Pain          59                       Oxygen saturation
30                            Temperature