Dilated Recurrent Neural Network for Short-Time Prediction of Glucose
                                   Concentration

              Jianwei Chen∗ , Kezhi Li∗ , Pau Herrero, Taiyu Zhu, Pantelis Georgiou
Department of Electronic and Electrical Engineering, Imperial College London, London SW5 7AZ, UK
                     jianwei.chen17@imperial.ac.uk, kezhi.li@imperial.ac.uk ∗ †


                           Abstract                                    Eren-Oruklu et al., 2009], and their extrapolation algorithmic
                                                                       derivatives [Eren-Oruklu et al., 2009; Gani et al., 2009], and
     Diabetes is one of the diseases affecting 415 mil-                methods regarding neural networks [Zecchin et al., 2012].
     lion people in the world. Developing a robust blood                  In particular, the blood glucose level prediction (BGLP)
     glucose (BG) prediction model has a profound in-                  challenge provides a platform for artificial intelligence (AI)
     fluence especially important for the diabetes man-                researches to evaluate the performance of different types of
     agement. Subjects with diabetes need to adjust in-                ML approaches on the real data. The OhioT1DM dataset
     sulin doses according to the blood glucose levels to              provided by BGLP challenge records the eight weeks CGMs
     maintain blood glucose in a target range. An accu-                data as well as the corresponding daily events from six type-
     rate glucose level prediction is able to provide sub-             1 diabetes patients, which are referred by ID 559, 563, 570,
     jects with diabetes with the future glucose levels,               575, 588 and 591 [Marling and Bunescu, 2018]. The CGM
     so that proper actions could be taken to avoid short-             data has a sampling rate of every 5 minutes. Since the BGLP
     term dangerous consequences or long-term compli-                  can be regarded as time series prediction problem, the nat-
     cations. With the developing of continuous glucose                ural structure of recurrent neural networks (RNN) provides
     monitoring (CGM) systems, the accuracy of pre-                    remarkable performance on the prediction of BG levels [Ala-
     dicting the glucose levels can be improved using                  nis et al., 2011]. Moreover, the BG levels will be affected
     the machine learning techniques. In this paper, a                 by different daily events, such as insulin injected, meals and
     new deep learning technique, which is based on the                exercises. Different types of events may have different tem-
     Dilated Recurrent Neural Network (DRNN) model,                    poral effects on the change of BG levels. Therefore, the so-
     is proposed to predict the future glucose levels for              lution is inspired by the recent research by [Shiyu Chang and
     prediction horizon (PH) of 30 minutes. And the                    Huang, 2017], which reveals the Dilated RNN (DRNN) with
     method also can be implemented in real-time pre-                  multi-resolution dilated recurrent skip connections allows the
     diction as well. The result reveals that using the                networks to learn different temporal dependencies at different
     dilated connection in the RNN network, it can im-                 layers. Lastly, before feeding the data into the model, the in-
     prove the accuracy of short-time glucose predic-                  terpolation, extrapolation and filtering techniques are utilized
     tions significantly (RMSE = 19.04 in the blood glu-               to process the data in order to fill the missing data in training
     cose level prediction (BGLP) on and only on all                   and testing set and remove the potential noise. Please note
     data points provided).                                            that in the testing dataset, the future glucose data points are
                                                                       not used in the extrapolation. Thus the algorithm can also be
                                                                       useful in real-time applications.
1    Introduction
The prediction of BG levels has always been a challenge be-            2     Data Processing
cause of the difficulty of modeling its nonlinearity and con-
sidering the effect of different life events. Machine learn-           2.1    Preprocessing
ing (ML) reveals a new approach to modeling the BG levels              Firstly, according to [Zecchin et al., 2012], the accuracy of
compared with traditional approaches, such as AR [Spara-               prediction based on neural networks can be improved by ex-
cino et al., 2007] and ARMR model [Sparacino et al., ;                 ploiting information on meals. Thus, each input batch of the
    ∗                                                                  DRNN model consists of the past 1 hour (12 points of data)
      This work is submitted to the Blood Glucose Level Prediction
Challenge, the 27th International Joint Conference on Artificial In-   CGMs, insulin doses, carbohydrate intake and time index.
telligence and the 23rd European Conference on Artificial Intelli-     The CGMs, insulin and carbohydrate intake are correspond-
gence (IJCAI-ECAI 2018), International Workshop on Knowledge           ing to the fields ‘glucose level’, ‘bolus’ and ‘bwz carb input’
Discovery in Healthcare Data.                                          respectively in the OhioT1DM dataset. The time index repre-
    †                                                                  sents the position of each CGM data in a day. Other fields in
      This work is supported by EPSRC, the ARISES project. J. Chen
and K. Li are the main contributors to the paper.                      the dataset have also been tried in the experiment, such as ex-
ercise, heart rate and skin temperature, which do not have the      be ruined and the model can not properly learn from the data.
significant effect on the accuracy of the model, but increasing     The window size is set to be 5 after comparing the results for
the variance of the model. It is worthy to note that, for some      different window sizes.
insulin and carbohydrate intake information, the timestamps
can not be exactly matched to the timestamps in CGM data.           2.4   Using Data from Multiple Subjects
They are set to associate to the timestamps in the CGM data         For the subject 575, there are 72 missing gaps and many gaps
with the smallest time difference.                                  are significantly large. The large gaps are discarded as dis-
   Secondly, the output of the model is the difference between      cussed in the previous section, hence the training data of sub-
the next 6th point and the 12th point of CGM data in the input      ject 575 are not long enough for the model to learn using
batch, which corresponds to the BG changes for the PH = 30.         ML techniques, which might also result in overfitting easily.
   Lastly, in order to improve the performance of DRNN              Therefore, the mixture of several patient’s training set is in-
model, the first-order linear interpolation and first-order ex-     troduced, which increases the training set by combining the
trapolation are applied to the training and testing set, respec-    data from different patients with different contributions, and
tively. The median filter is used only in the training set. These   the generalization of the model can be improved. This idea
techniques will be explained in the following sections in de-       comes from the transfer learning technique, which is popu-
tials. The data of subjects 591 and 575 have the considerable       lar in the deep learning techniques that makes use of other
amount of missing CGM data, the combination of training             related dataset to train the objective neural network. In this
data from all patients with different proportions is used in        work we use 50% of the target subject’s data plus 10% of
the training process. The idea comes from the transfer learn-       other subjects’ data to train the model first, and then train the
ing technique in machine learning. The results obtained dur-        final model based on the whole training set of the target sub-
ing the experiment shows that it improves the model perfor-         ject.
mance.                                                                 For example, for subject 575, 50% training data were used
                                                                    in the first phase. Different proportions of the training data
2.2   Interpolation and Extrapolation                               from other subjects are used in the training process as well
There are lots of missing CGM data for different patients in        (normally we use 10% of training data from other subjects).
both training and testing set. Without interpolation or ex-         For the second phase, all training data for subject 575 are
trapolation, the missing data will cause discontinuities in the     used to train the final model. By using the transfer learning
CGM curve. Moreover, the fake peaks caused by discontinu-           technique, the RMSE of subject 575 are decreased further by
ities will highly degrade the performance of the model. Thus,       about 0.7 compared with the result from only using its own
the first-order linear interpolation and first-order extrapola-     data. Moreover, it is found that this approach can also be
tion algorithm are applied to the training set and testing set      applied to the subject 591 to improve the result.
in this project, respectively. Based on the experiment result,
the performance of the first-order interpolation and first-order    2.5   Evaluation metric
extrapolation are similar for the testing set. The extrapolation    The overall performance of the model is evaluated by the
technique does not use the information of future values to fill     root-mean-square error (RMSE) between the prediction curve
the missing data. Thus, the testing set uses extrapolation tech-    and original testing curve. Since the output of the model is the
nique to fill the missing data, which enables the real-time pre-    changes of BG after 30 minutes, the prediction curve should
diction. Different interpolation algorithms have been tested,       be the firstly constructed based on the model output. The
such as cubic interpolation, but the first-order linear interpo-    RMSE is computed as (1),
lation provides the best result for the given data.                                             r
   Figure 1 shows an example of the linear interpolation. The                                       1 X
                                                                                    RM SE =                (ŷ − y)2             (1)
zero values in the original CGM data represents the missing                                        N
data. However, the missing data is discarded if the missed
time interval is significantly large. The purpose of this step      where ŷ is the predicted value, y is the original value and N
is to prevent the model from attempting to learn the interpo-       is the total number of points. However, since the interpola-
lation part instead of the real trend of CGM data. The data         tion/extrapolation is applied to both training and testing data,
before a long time missing interval will be connected to the        the imputed values should be removed when evaluating the
next nearest value in order to prevent any fake peaks. Fur-         RMSE in the testing phase, which guarantees the prediction
thermore, the insulin and carbohydrate intake in the missing        curve is compared with the original test data with the same
CGM interval is set to zero.                                        length. The total testing points for each patient are summa-
                                                                    rized in Table 1.
2.3   Median Filtering
The median filter is employed only for the training set after       3     DRNN Model
the interpolation process, in order to remove part of fast vari-    With the processed training and testing data, the rest of work
ations and some small spikes in the linear region of the curve,     is to design the structure of the DRNN model and to tune the
which might be the noise in the data. Moreover, the curve will      hyperparameters to obtain the best results. In this section, the
become more smooth and the trend in CGM data will become            DRNN model will be briefly introduced. The training and
more obvious as shown in Figure 2. However, the length of           testing phase will be explained. Lastly, the effect of different
filter window needs to be carefully set, otherwise the data will    hyperparameters will be investigated. In terms of the software
                                                                       Figure 2: CGM data after linear interpolation and filtering.
              Figure 1: CGM data after linear interpolation.


                                                                    Therefore, the DRNN provides a powerful approach to pro-
                                                                    cess the long sequence data.
                                                                       In this project, a 3-layered DRNN model is used, with 32
                                                                    cells in each layer. Since the dilation is recommended to be
                                                                    the exponential increase, 1, 2 and 4 dilations are implemented
                                                                    for the 3 layers respectively from bottom to top.
                                                                    3.2   Hyperparameters
                                                                    Three different RNN cells have experimented, and the re-
                                                                    sult shows that the vanilla RNN cell can achieve a better re-
 Figure 3: Structure of DRNN model. [Shiyu Chang et al., 2017]
                                                                    sult than LSTM and GRU cells. Moreover, the training time
                                                                    and testing time using LSTM and GRU cells are significantly
implementation, the model is built based on tensorflow 1.6.0        larger than vanilla RNN cell. This is because the structure
and runs under the environment of python 3.6.4.                     of LSTM and GRU cells are much more complex than the
                                                                    vanilla RNN cell. Therefore, by implementing the vanilla
3.1   Model Structure                                               RNN cell, better results can be obtained efficiently.
                                                                       The effect of the number of cells in layers and the num-
The DRNN model is characterised by its multi-resolution di-         ber of layers have also been investigated. It is found that the
lated recurrent skip connections as shown in Figure 3. The          performance is degraded as the number of cells and layers in-
                                  (l)
cell state for layer l at time t(ct ) is depending on the current   creased. This is because the larger model requires relatively
                  (l)                            (l)
input sample (xt ) and the cell state from ct−s as summa-           larger data set to converge. The training data points for each
rized in (2).                                                       patient is around 10, 000, which is not sufficient to train a
                       (l)
                               
                                  (l) (l)
                                                                   large model properly. Therefore, a relatively small model as
                      ct = f xt , c(t−s) ,                    (2)   described is found to have a better performance.
        (l)
where xt is the input to layer L at time t, s is the dilation and
                                                                    3.3   Training and Testing
f represents the output function of different types of RNN          At each epoch of training step, the output from the model
cell, namely vanilla RNN, long short-term memory (LSTM)             is used to reconstruct the prediction curve. Therefore, the
and gated recurrent unit (GRU). The multi-resolution dilated        RMSE is computed between prediction curve and the origi-
recurrent skip connections enable the model to capture the          nal curve. RMSProp optimizer [Ruder, 2017] with learning
information for different temporal dependency and alleviate         rate 0.001 is applied. Fixed batch size is set for all subjects.
the vanishing gradient problem. The dilation is usually set to      Varying the batch size also affects the accuracy of the model.
be increased exponentially [Shiyu Chang and Huang, 2017].           In the experiment, it is found that larger batch size helps to
                                                                    improve the prediction results in term of RMSE.
                                                                       When running the algorithm, an evaluation step is per-
                   Patients     Total Testing Points                formed every 500 epoch. The advantage is that it provides
                     591               2760                         a convenient way to monitor the training process and get the
                     570               2745                         trend of accuracy, thus an appropriate number of epochs of
                     563               2570                         training can be decided. Since the test data for each patient
                     559               2514                         is around 2000 points, the cost of computation for the testing
                     588               2791                         phase is relatively small. In this project, the 4000 to 5000
                     575               2590                         epochs is used in the training process. It should be noted
                                                                    that since the algorithm using past 12 points of data to pre-
              Table 1: Total testing points for each patient.       dict the next 6th point (PH = 30), the last 17 points of the
                                       Figure 4: Predicted and Original test data of patient 570.


                                       Figure 5: Predicted and Original test data of patient 575.


original training data should be appended at the beginning of         4    Experimental Results
the test set, which guarantees the length of prediction curve         With the data processed as described in Section 2 and model
is the same as the original length of the test data (it has been      built as shown in Section 3, the RMSE of the test data is sum-
approved by the BGLP challenge).                                      marized in Table 2, where SD denotes the standard deviation.
                                                                      The best RMSE and average RMSE results are all based on
                                                                      10 times simulation.
                                                                         As can be seen from the Table 2, the RMSE for each sub-
     Patients        Best RMSE      Average RMSE ± SD                 ject vary from 22 to 15. The best RMSE is obtained in subject
       591            21.1392         21.3354± 0.0981                 570, and the RMSE is relatively large for subject 591 and 575.
       570            15.2995         15.4638± 0.0907                 There are two reasons. Firstly, training data of 570 has rela-
       563            18.0470         18.1153± 0.0400                 tively less missing data. There are 782 and 1309 missing data
       559            18.6158         18.7769± 0.0672                 in subject 591 and 575, whereas subject 570 has 649 miss-
       588            17.6275         17.7270± 0.0701                 ing data. Secondly, through observing the curves of training
       575            22.7104         22.8316± 0.0941                 and testing set for all subject, the data of subject 570 con-
    Mean RMSE         18.9066             19.0417                     tains fewer fluctuations. The large fluctuation and continue
                                                                      peaks in both training and testing dataset will increase the
           Table 2: RMSE results of the DRNN model.                   difficulty of prediction, and degrade the model’s learning ca-
                                                                      pability, which can be observed from the result in Figure 5.
                                                                         Figure 4 and Figure 5 show the prediction results of patient
570 and 575, which corresponds to the best RMSE shown in           References
Table 2. As one can see that the test data of 575 is much          [Alanis et al., 2011] A. Y. Alanis, E. N. Sanchez, E. Ruiz-
fluctuant than 570, especially on the second half part of the         Velazquez, and B. S. Leon. Neural model of blood glucose
curve.                                                                level for type 1 diabetes mellitus patients. In The 2011
   More specifically, as shown in Figure 6, the relative lin-         International Joint Conference on Neural Networks, pages
ear region of the curve can be predicted with the small error.        2018–2023, July 2011.
However, the fast and continues variations in the curve are al-
most impossible to predict, which contributes to a significant     [Eren-Oruklu et al., 2009] Meriyan Eren-Oruklu, M.E., Ali
proportion of errors in terms of RMSE. Furthermore, a slight          Cinar, Lauretta Quinn, and Donald Smith. Estimation of
time delay in the prediction curve is observed, which is also         future glucose concentrations with subject-specific recur-
a primary contribution of the errors.                                 sive linear models. Diabetes Technology and Therapeu-
                                                                      tics, 11(4):243253, Apr. 2009.
                                                                   [Gani et al., 2009] A. Gani, A. V. Gribok, S. Rajaraman,
                                                                      W. K. Ward, and J. Reifman ∗ . Predicting subcutaneous
                                                                      glucose concentration in humans: Data-driven glucose
                                                                      modeling. IEEE Transactions on Biomedical Engineering,
                                                                      56(2):246–254, Feb. 2009.
                                                                   [Marling and Bunescu, 2018] Cindy Marling and Razvan
                                                                      Bunescu. The ohiot1dm dataset for blood glucose level
                                                                      prediction. 2018.
                                                                   [Ruder, 2017] S. Ruder. An overview of gradient descent op-
                                                                      timization algorithms. In arXiv:1609.04747v2. 2017.
                                                                   [Shiyu Chang and Huang, 2017] Wei Han Mo Yu Xiaox-
                                                                      iao Guo Wei Tan Xiaodong Cui Michael Witbrock
            Figure 6: An example of prediction result.                Mark Hasegawa-Johnson Shiyu Chang, Yang Zhang and
                                                                      Thomas S. Huang. Dilated recurrent neural networks. In
                                                                      31st Conference on Neural Information Processing Sys-
                                                                      tems (NIPS 2017), Long Beach, CA, USA. 2017.
5   Conclusion
                                                                   [Sparacino et al., ] Giovanni Sparacino, Andrea Facchinetti,
This project aims to design an accurate short-time BG pre-            Alberto Maran, and Claudio Cobelli. Continuous glucose
diction model with PH = 30 minutes. The recent technique              monitoring time series and hypo/hyperglycemia preven-
DRNN model has been exploited and applied in the project.             tion: Requirements, methods, open problems. Current Di-
The multi-resolution dilated recurrent skip connections of            abetes Reviews, 4(3):181.
DRNN enables the network to learn different temporal de-
pendencies in the sequential data. The data processing tech-       [Sparacino et al., 2007] G. Sparacino,       F. Zanderigo,
niques of first-order linear interpolation, median filter, and a      S. Corazza, A. Maran, A. Facchinetti, and C. Cobelli.
mixture of training set have been investigated. The results           Glucose concentration can be predicted ahead in time from
have shown the effectiveness of these data process methods.           continuous glucose monitoring sensor time-series. IEEE
   With the DRNN model and data processing techniques, the            Transactions on Biomedical Engineering, 54(5):931–937,
performance of the whole algorithm is evaluated based on the          May 2007.
OhioT1DM dataset. The RMSE results vary from 15.299 to             [Zecchin et al., 2012] C. Zecchin, A. Facchinetti, G. Spara-
22.710 for different subjects with diabetes. More specifically,       cino, G. De Nicolao, and C. Cobelli. Neural network in-
the missing data in the training and testing set, together with       corporating meal information improves accuracy of short-
the fast continuous fluctuations in the data are the two main         time prediction of glucose concentration. IEEE Transac-
factors which degrade the accuracy of the model. In terms of          tions on Biomedical Engineering, 59(6):1550–1560, Jun.
improvement, there is still a large number of unused fields in        2012.
the dataset. How to use these data properly and to feed them
into the model still remain a challenge.