Dilated Recurrent Neural Network for Short-Time Prediction of Glucose Concentration Jianwei Chen∗ , Kezhi Li∗ , Pau Herrero, Taiyu Zhu, Pantelis Georgiou Department of Electronic and Electrical Engineering, Imperial College London, London SW5 7AZ, UK jianwei.chen17@imperial.ac.uk, kezhi.li@imperial.ac.uk ∗ † Abstract Eren-Oruklu et al., 2009], and their extrapolation algorithmic derivatives [Eren-Oruklu et al., 2009; Gani et al., 2009], and Diabetes is one of the diseases affecting 415 mil- methods regarding neural networks [Zecchin et al., 2012]. lion people in the world. Developing a robust blood In particular, the blood glucose level prediction (BGLP) glucose (BG) prediction model has a profound in- challenge provides a platform for artificial intelligence (AI) fluence especially important for the diabetes man- researches to evaluate the performance of different types of agement. Subjects with diabetes need to adjust in- ML approaches on the real data. The OhioT1DM dataset sulin doses according to the blood glucose levels to provided by BGLP challenge records the eight weeks CGMs maintain blood glucose in a target range. An accu- data as well as the corresponding daily events from six type- rate glucose level prediction is able to provide sub- 1 diabetes patients, which are referred by ID 559, 563, 570, jects with diabetes with the future glucose levels, 575, 588 and 591 [Marling and Bunescu, 2018]. The CGM so that proper actions could be taken to avoid short- data has a sampling rate of every 5 minutes. Since the BGLP term dangerous consequences or long-term compli- can be regarded as time series prediction problem, the nat- cations. With the developing of continuous glucose ural structure of recurrent neural networks (RNN) provides monitoring (CGM) systems, the accuracy of pre- remarkable performance on the prediction of BG levels [Ala- dicting the glucose levels can be improved using nis et al., 2011]. Moreover, the BG levels will be affected the machine learning techniques. In this paper, a by different daily events, such as insulin injected, meals and new deep learning technique, which is based on the exercises. Different types of events may have different tem- Dilated Recurrent Neural Network (DRNN) model, poral effects on the change of BG levels. Therefore, the so- is proposed to predict the future glucose levels for lution is inspired by the recent research by [Shiyu Chang and prediction horizon (PH) of 30 minutes. And the Huang, 2017], which reveals the Dilated RNN (DRNN) with method also can be implemented in real-time pre- multi-resolution dilated recurrent skip connections allows the diction as well. The result reveals that using the networks to learn different temporal dependencies at different dilated connection in the RNN network, it can im- layers. Lastly, before feeding the data into the model, the in- prove the accuracy of short-time glucose predic- terpolation, extrapolation and filtering techniques are utilized tions significantly (RMSE = 19.04 in the blood glu- to process the data in order to fill the missing data in training cose level prediction (BGLP) on and only on all and testing set and remove the potential noise. Please note data points provided). that in the testing dataset, the future glucose data points are not used in the extrapolation. Thus the algorithm can also be useful in real-time applications. 1 Introduction The prediction of BG levels has always been a challenge be- 2 Data Processing cause of the difficulty of modeling its nonlinearity and con- sidering the effect of different life events. Machine learn- 2.1 Preprocessing ing (ML) reveals a new approach to modeling the BG levels Firstly, according to [Zecchin et al., 2012], the accuracy of compared with traditional approaches, such as AR [Spara- prediction based on neural networks can be improved by ex- cino et al., 2007] and ARMR model [Sparacino et al., ; ploiting information on meals. Thus, each input batch of the ∗ DRNN model consists of the past 1 hour (12 points of data) This work is submitted to the Blood Glucose Level Prediction Challenge, the 27th International Joint Conference on Artificial In- CGMs, insulin doses, carbohydrate intake and time index. telligence and the 23rd European Conference on Artificial Intelli- The CGMs, insulin and carbohydrate intake are correspond- gence (IJCAI-ECAI 2018), International Workshop on Knowledge ing to the fields ‘glucose level’, ‘bolus’ and ‘bwz carb input’ Discovery in Healthcare Data. respectively in the OhioT1DM dataset. The time index repre- † sents the position of each CGM data in a day. Other fields in This work is supported by EPSRC, the ARISES project. J. Chen and K. Li are the main contributors to the paper. the dataset have also been tried in the experiment, such as ex- ercise, heart rate and skin temperature, which do not have the be ruined and the model can not properly learn from the data. significant effect on the accuracy of the model, but increasing The window size is set to be 5 after comparing the results for the variance of the model. It is worthy to note that, for some different window sizes. insulin and carbohydrate intake information, the timestamps can not be exactly matched to the timestamps in CGM data. 2.4 Using Data from Multiple Subjects They are set to associate to the timestamps in the CGM data For the subject 575, there are 72 missing gaps and many gaps with the smallest time difference. are significantly large. The large gaps are discarded as dis- Secondly, the output of the model is the difference between cussed in the previous section, hence the training data of sub- the next 6th point and the 12th point of CGM data in the input ject 575 are not long enough for the model to learn using batch, which corresponds to the BG changes for the PH = 30. ML techniques, which might also result in overfitting easily. Lastly, in order to improve the performance of DRNN Therefore, the mixture of several patient’s training set is in- model, the first-order linear interpolation and first-order ex- troduced, which increases the training set by combining the trapolation are applied to the training and testing set, respec- data from different patients with different contributions, and tively. The median filter is used only in the training set. These the generalization of the model can be improved. This idea techniques will be explained in the following sections in de- comes from the transfer learning technique, which is popu- tials. The data of subjects 591 and 575 have the considerable lar in the deep learning techniques that makes use of other amount of missing CGM data, the combination of training related dataset to train the objective neural network. In this data from all patients with different proportions is used in work we use 50% of the target subject’s data plus 10% of the training process. The idea comes from the transfer learn- other subjects’ data to train the model first, and then train the ing technique in machine learning. The results obtained dur- final model based on the whole training set of the target sub- ing the experiment shows that it improves the model perfor- ject. mance. For example, for subject 575, 50% training data were used in the first phase. Different proportions of the training data 2.2 Interpolation and Extrapolation from other subjects are used in the training process as well There are lots of missing CGM data for different patients in (normally we use 10% of training data from other subjects). both training and testing set. Without interpolation or ex- For the second phase, all training data for subject 575 are trapolation, the missing data will cause discontinuities in the used to train the final model. By using the transfer learning CGM curve. Moreover, the fake peaks caused by discontinu- technique, the RMSE of subject 575 are decreased further by ities will highly degrade the performance of the model. Thus, about 0.7 compared with the result from only using its own the first-order linear interpolation and first-order extrapola- data. Moreover, it is found that this approach can also be tion algorithm are applied to the training set and testing set applied to the subject 591 to improve the result. in this project, respectively. Based on the experiment result, the performance of the first-order interpolation and first-order 2.5 Evaluation metric extrapolation are similar for the testing set. The extrapolation The overall performance of the model is evaluated by the technique does not use the information of future values to fill root-mean-square error (RMSE) between the prediction curve the missing data. Thus, the testing set uses extrapolation tech- and original testing curve. Since the output of the model is the nique to fill the missing data, which enables the real-time pre- changes of BG after 30 minutes, the prediction curve should diction. Different interpolation algorithms have been tested, be the firstly constructed based on the model output. The such as cubic interpolation, but the first-order linear interpo- RMSE is computed as (1), lation provides the best result for the given data. r Figure 1 shows an example of the linear interpolation. The 1 X RM SE = (ŷ − y)2 (1) zero values in the original CGM data represents the missing N data. However, the missing data is discarded if the missed time interval is significantly large. The purpose of this step where ŷ is the predicted value, y is the original value and N is to prevent the model from attempting to learn the interpo- is the total number of points. However, since the interpola- lation part instead of the real trend of CGM data. The data tion/extrapolation is applied to both training and testing data, before a long time missing interval will be connected to the the imputed values should be removed when evaluating the next nearest value in order to prevent any fake peaks. Fur- RMSE in the testing phase, which guarantees the prediction thermore, the insulin and carbohydrate intake in the missing curve is compared with the original test data with the same CGM interval is set to zero. length. The total testing points for each patient are summa- rized in Table 1. 2.3 Median Filtering The median filter is employed only for the training set after 3 DRNN Model the interpolation process, in order to remove part of fast vari- With the processed training and testing data, the rest of work ations and some small spikes in the linear region of the curve, is to design the structure of the DRNN model and to tune the which might be the noise in the data. Moreover, the curve will hyperparameters to obtain the best results. In this section, the become more smooth and the trend in CGM data will become DRNN model will be briefly introduced. The training and more obvious as shown in Figure 2. However, the length of testing phase will be explained. Lastly, the effect of different filter window needs to be carefully set, otherwise the data will hyperparameters will be investigated. In terms of the software Figure 2: CGM data after linear interpolation and filtering. Figure 1: CGM data after linear interpolation. Therefore, the DRNN provides a powerful approach to pro- cess the long sequence data. In this project, a 3-layered DRNN model is used, with 32 cells in each layer. Since the dilation is recommended to be the exponential increase, 1, 2 and 4 dilations are implemented for the 3 layers respectively from bottom to top. 3.2 Hyperparameters Three different RNN cells have experimented, and the re- sult shows that the vanilla RNN cell can achieve a better re- Figure 3: Structure of DRNN model. [Shiyu Chang et al., 2017] sult than LSTM and GRU cells. Moreover, the training time and testing time using LSTM and GRU cells are significantly implementation, the model is built based on tensorflow 1.6.0 larger than vanilla RNN cell. This is because the structure and runs under the environment of python 3.6.4. of LSTM and GRU cells are much more complex than the vanilla RNN cell. Therefore, by implementing the vanilla 3.1 Model Structure RNN cell, better results can be obtained efficiently. The effect of the number of cells in layers and the num- The DRNN model is characterised by its multi-resolution di- ber of layers have also been investigated. It is found that the lated recurrent skip connections as shown in Figure 3. The performance is degraded as the number of cells and layers in- (l) cell state for layer l at time t(ct ) is depending on the current creased. This is because the larger model requires relatively (l) (l) input sample (xt ) and the cell state from ct−s as summa- larger data set to converge. The training data points for each rized in (2). patient is around 10, 000, which is not sufficient to train a (l)  (l) (l)  large model properly. Therefore, a relatively small model as ct = f xt , c(t−s) , (2) described is found to have a better performance. (l) where xt is the input to layer L at time t, s is the dilation and 3.3 Training and Testing f represents the output function of different types of RNN At each epoch of training step, the output from the model cell, namely vanilla RNN, long short-term memory (LSTM) is used to reconstruct the prediction curve. Therefore, the and gated recurrent unit (GRU). The multi-resolution dilated RMSE is computed between prediction curve and the origi- recurrent skip connections enable the model to capture the nal curve. RMSProp optimizer [Ruder, 2017] with learning information for different temporal dependency and alleviate rate 0.001 is applied. Fixed batch size is set for all subjects. the vanishing gradient problem. The dilation is usually set to Varying the batch size also affects the accuracy of the model. be increased exponentially [Shiyu Chang and Huang, 2017]. In the experiment, it is found that larger batch size helps to improve the prediction results in term of RMSE. When running the algorithm, an evaluation step is per- Patients Total Testing Points formed every 500 epoch. The advantage is that it provides 591 2760 a convenient way to monitor the training process and get the 570 2745 trend of accuracy, thus an appropriate number of epochs of 563 2570 training can be decided. Since the test data for each patient 559 2514 is around 2000 points, the cost of computation for the testing 588 2791 phase is relatively small. In this project, the 4000 to 5000 575 2590 epochs is used in the training process. It should be noted that since the algorithm using past 12 points of data to pre- Table 1: Total testing points for each patient. dict the next 6th point (PH = 30), the last 17 points of the Figure 4: Predicted and Original test data of patient 570. Figure 5: Predicted and Original test data of patient 575. original training data should be appended at the beginning of 4 Experimental Results the test set, which guarantees the length of prediction curve With the data processed as described in Section 2 and model is the same as the original length of the test data (it has been built as shown in Section 3, the RMSE of the test data is sum- approved by the BGLP challenge). marized in Table 2, where SD denotes the standard deviation. The best RMSE and average RMSE results are all based on 10 times simulation. As can be seen from the Table 2, the RMSE for each sub- Patients Best RMSE Average RMSE ± SD ject vary from 22 to 15. The best RMSE is obtained in subject 591 21.1392 21.3354± 0.0981 570, and the RMSE is relatively large for subject 591 and 575. 570 15.2995 15.4638± 0.0907 There are two reasons. Firstly, training data of 570 has rela- 563 18.0470 18.1153± 0.0400 tively less missing data. There are 782 and 1309 missing data 559 18.6158 18.7769± 0.0672 in subject 591 and 575, whereas subject 570 has 649 miss- 588 17.6275 17.7270± 0.0701 ing data. Secondly, through observing the curves of training 575 22.7104 22.8316± 0.0941 and testing set for all subject, the data of subject 570 con- Mean RMSE 18.9066 19.0417 tains fewer fluctuations. The large fluctuation and continue peaks in both training and testing dataset will increase the Table 2: RMSE results of the DRNN model. difficulty of prediction, and degrade the model’s learning ca- pability, which can be observed from the result in Figure 5. Figure 4 and Figure 5 show the prediction results of patient 570 and 575, which corresponds to the best RMSE shown in References Table 2. As one can see that the test data of 575 is much [Alanis et al., 2011] A. Y. Alanis, E. N. Sanchez, E. Ruiz- fluctuant than 570, especially on the second half part of the Velazquez, and B. S. Leon. Neural model of blood glucose curve. level for type 1 diabetes mellitus patients. In The 2011 More specifically, as shown in Figure 6, the relative lin- International Joint Conference on Neural Networks, pages ear region of the curve can be predicted with the small error. 2018–2023, July 2011. However, the fast and continues variations in the curve are al- most impossible to predict, which contributes to a significant [Eren-Oruklu et al., 2009] Meriyan Eren-Oruklu, M.E., Ali proportion of errors in terms of RMSE. Furthermore, a slight Cinar, Lauretta Quinn, and Donald Smith. Estimation of time delay in the prediction curve is observed, which is also future glucose concentrations with subject-specific recur- a primary contribution of the errors. sive linear models. Diabetes Technology and Therapeu- tics, 11(4):243253, Apr. 2009. [Gani et al., 2009] A. Gani, A. V. Gribok, S. Rajaraman, W. K. Ward, and J. Reifman ∗ . Predicting subcutaneous glucose concentration in humans: Data-driven glucose modeling. IEEE Transactions on Biomedical Engineering, 56(2):246–254, Feb. 2009. [Marling and Bunescu, 2018] Cindy Marling and Razvan Bunescu. The ohiot1dm dataset for blood glucose level prediction. 2018. [Ruder, 2017] S. Ruder. An overview of gradient descent op- timization algorithms. In arXiv:1609.04747v2. 2017. [Shiyu Chang and Huang, 2017] Wei Han Mo Yu Xiaox- iao Guo Wei Tan Xiaodong Cui Michael Witbrock Figure 6: An example of prediction result. Mark Hasegawa-Johnson Shiyu Chang, Yang Zhang and Thomas S. Huang. Dilated recurrent neural networks. In 31st Conference on Neural Information Processing Sys- tems (NIPS 2017), Long Beach, CA, USA. 2017. 5 Conclusion [Sparacino et al., ] Giovanni Sparacino, Andrea Facchinetti, This project aims to design an accurate short-time BG pre- Alberto Maran, and Claudio Cobelli. Continuous glucose diction model with PH = 30 minutes. The recent technique monitoring time series and hypo/hyperglycemia preven- DRNN model has been exploited and applied in the project. tion: Requirements, methods, open problems. Current Di- The multi-resolution dilated recurrent skip connections of abetes Reviews, 4(3):181. DRNN enables the network to learn different temporal de- pendencies in the sequential data. The data processing tech- [Sparacino et al., 2007] G. Sparacino, F. Zanderigo, niques of first-order linear interpolation, median filter, and a S. Corazza, A. Maran, A. Facchinetti, and C. Cobelli. mixture of training set have been investigated. The results Glucose concentration can be predicted ahead in time from have shown the effectiveness of these data process methods. continuous glucose monitoring sensor time-series. IEEE With the DRNN model and data processing techniques, the Transactions on Biomedical Engineering, 54(5):931–937, performance of the whole algorithm is evaluated based on the May 2007. OhioT1DM dataset. The RMSE results vary from 15.299 to [Zecchin et al., 2012] C. Zecchin, A. Facchinetti, G. Spara- 22.710 for different subjects with diabetes. More specifically, cino, G. De Nicolao, and C. Cobelli. Neural network in- the missing data in the training and testing set, together with corporating meal information improves accuracy of short- the fast continuous fluctuations in the data are the two main time prediction of glucose concentration. IEEE Transac- factors which degrade the accuracy of the model. In terms of tions on Biomedical Engineering, 59(6):1550–1560, Jun. improvement, there is still a large number of unused fields in 2012. the dataset. How to use these data properly and to feed them into the model still remain a challenge.