A Deep Learning Algorithm For Personalized Blood Glucose Prediction Taiyu Zhu∗ , Kezhi Li∗ , Pau Herrero, Jianwei Chen, Pantelis Georgiou Department of Electronic and Electrical Engineering, Imperial College London, London SW5 7AZ, UK taiyu.zhu17@imperial.ac.uk, kezhi.li@imperial.ac.uk ∗ † Abstract minutes. The change between the current glucose value and the future glucose value is quantised to 256 target categories. A convolutional neural network (CNN) model is Under such circumstance, the prediction problem is converted presented to forecast the future glucose levels of the to a classification task, which can be properly solved. After patients with type 1 diabetes. The model is a mod- pre-processing datasets and building the modified WaveNet ified version of a recently proposed model called model, the prediction results for 30-minute prediction hori- WaveNet, which becomes very useful in acoustic zons (PH) are obtained. signal processing. By transferring the task into a classification problem, the model is mainly built by casual dilated CNN layers and employs fast 2 Data Pre-processing WaveNet algorithms. The OhioT1DM dataset is the source of the four input fields: glucose levels, 2.1 Database insulin events, carbohydrate intake and time index. The data is fed into the network along with the tar- The source of training and testing data is from the OhioT1DM gets of the glucose change in 30 minutes. Sev- dataset developed by [Marling and Bunescu, 2018]. There are eral pre-processing approaches such as interpola- six patients with type 1 diabetes wearing Medtronic 530G in- tion, combination and filtering are used to fill up the sulin pumps and Medtronic Enlite CGM sensors to collect missing data in the training sets, and they improve the data during the 8-week period. Each of the patients re- the performance. Finally, we obtain the predictions ports the data on daily events via an app on a smartphone and of the testing dataset and evaluate the results by the a fitness band. In the OhioT1DM dataset, the patients are root mean squared error (RMSE). The mean value numbered as 559, 563, 570, 575, 588 and 591. Two of them of the best RMSE of six patients is 21.72. are male with ID 563 and 570, while others are female. Three of the nineteen data fields, including previous CGM data by ’glucose level’, insulin value by ’bolus’, carbohydrate intake 1 Introduction by ’bwz carb input’ and time index normalised to the unit for each day are used as the four channels of inputs. The meal With an increasing incidence worldwide, type 1 diabetes is a information is obtained by the pump Bolus Wizard (BWZ), severe chronic that requires long-term management of blood which is input by patients to calculate the bolus. Other fields glucose relying on the glucose predictions [Daneman, 2006]. of the patient data are also added to the input batch, such as Aiming at improving the accuracy of the predictions, artifi- heart rate and skin temperature. However, in our experiment, cial intelligence researchers have been investigating machine these fields slightly degrade the performance of classification learning approaches to develop efficient forecasting models. and bring more variances to the model. In this paper, the main system is constructed by a deep con- volutional neural network (CNN). It origins from a proposed model called WaveNet, which is firstly developed by the firm 2.2 Interpolation and Extrapolation DeepMind to process raw audio signals [Van Den Oord et al., By observing the glucose data, several intervals miss values 2016]. The glucose data of the patients are sequentially ob- in both training and testing sets. Since the targets of the CNN tained by continuous glucose monitoring (CGM) in every five model rely on the differences between current and future data ∗ points, the discontinuities can cause negative influences. We This work is submitted to the Blood Glucose Level Prediction Challenge, the 27th International Joint Conference on Artificial In- fill up the missing values of the training dataset by the first- telligence and the 23rd European Conference on Artificial Intelli- order interpolation. For the testing data, the first-order extrap- gence (IJCAI-ECAI 2018), International Workshop on Knowledge olation is taken to ensure the future values are not involved. Discovery in Healthcare Data. The predictions by extrapolated intervals are ignored to guar- † antee that the result has the same length as the CGM testing This work is supported by EPSRC, the ARISES project. T. Zhu and K. Li are the main contributors to the paper. data when evaluating the performance. 2.3 Combination suitable model with a four-channel feeding dictionary and For the training dataset, it contains the data from six pa- non-linear glucose-insulin interaction. Moreover, our work is tients around 40 days and 115,200 CGM data points. Usu- based on WaveNet that is more time efficient for training and ally an effective machine learning model needs training data testing with smaller weights compared with recurrent neural with much larger size. Moreover, the missing intervals ap- networks (RNNs) [Borovykh et al., 2017]. It focuses on the pear frequently in the whole dataset. In our model, the data long-term relationships between channels and is conditioned points with large missing gaps are discarded, and we inter- on all previous samples [Van Den Oord et al., 2016], which polate the values only for short intervals. To have a longer is modelled as (1). training dataset and avoid the overfitting problem, we expand T the training set and improve the generalisation. We introduce Y p(x) = p(xt |x1 , ...., xt−1 ) (1) a part of the data with the longest continuous interval from t=1 other subjects and combine them into the current subject to form an extended training data. Notably, the strategy keeps where x = x1 , ..., xT and p(x) is the joint probability com- half proportion data of the current subject, and the other five puted by the product of conditional probabilities. The out- patients contribute to the other half of the training data with put dimension is the same as the input because there are no 10% each. The segments from other patients are selected by pooling layers. Convolutional layers model the conditional observation with the fewest missing values. Thus, the length probabilities, and a softmax layer is applied to maximise the of the dataset is expanded. This method significantly im- log-likelihood probabilities. proves the performance of patient 591, who has long missing data interval (967-point) in the training set. 3.1 The Causal CNN The main components in WaveNet are causal convolutional 2.4 Filtering layers. After shifting the outputs by several data points, the 1- After the interpolation and combination, it is found that there D causal convolution layers can be implemented. The causal- are many small spikes near the peaks or the turning points on ity is essential for CNN to forecast time series since it guar- the CGM data of the training dataset. Those spikes are a part antees that the model cannot use any information from future of variances when the batches are used to train the model. To timesteps. Particularly, one special ingredient is causal di- remove these variances, we use a median filter to filter out lated convolutional neural network (DCNN) layer. It largely the noises at the cost of raising the bias slightly. The window increases the receptive field of the input signal. The structure size needs to be appropriately chosen, which is five-point in is shown in Figure 2. Compared with regular causal convolu- this work. The median filter is not used on the testing data, tional layers, the dilated one involves a larger number of in- so the on-line prediction is still feasible. The outcome of data put nodes. This setting makes the system capable of learning pre-processing is shown in Figure 1. long-term dependencies by skipping certain steps that deter- mined by the dilation factor. Original Linear Interp 300 Median Filter Output Layer Dilation = 16 250 Hidden Layer Dilation = 8 Glucose Level 200 Hidden Layer Dilation = 4 150 Hidden Layer Dilation = 2 100 Hidden Layer Dilation = 1 50 Input 0 7500 7550 7600 7650 7700 7750 7800 Index Figure 2: One Dilated causal convolution block with five layers. Figure 1: A part of CGM training set after pre-processing. In this work, there are three blocks, and the DCNN block contains five layers of dilated convolution. The dilation in- 3 Methodology creases from one to a certain number in each block. The mo- Predicting future values for time series is one of the essential tivation behind this configuration is to exponentially grow the problems in data science. A conventional method is to use receptive field to cover more previous information and obtain autoregressivemoving-average (ARMA) process to model the a more efficient model with 1 × 32 convolution. patterns. However, the performance of the ARMA model is not satisfactory in this task since it is incapable to cap- 3.2 System Architecture ture non-linearities [Hamilton, 1994]. Feed-forward neural We adopt the fast approach to implement the WaveNet networks can overcome this difficulty and learn the patterns method, so it removes redundant convolutional operations of multivariate time series well by feeding the data with ex- and reduces the time complexity from O(2L ) to O(L) [Paine tremely large size [Zhang et al., 1998]. Thus it is a more et al., 2016], where L is the total number of the layers. The fundamental technique is to create convolution queues that change between the current value and the future value in the divide the operations into pop and push phases. The model PH. By quantisation, we put the change of glucose values in functionally acts as a single step of RNNs. The system model 256 classes/categories as targets with a difference of 1 mg/dl for this project with fast WaveNet is shown in Figure 3. between each class. The number of classes is chosen care- fully because a small number of classes cannot distinguish the Output Layer (i) difference while a large number of classes are not suitable for small training dataset. After investigating the training dataset, Residual Connection we think that the value of 256 is able to cover more than 95% + Predictions of difference values, referring to the glucose change in the range of ±128 mg/dl within 30 mins. ReLU Softmax 4.2 Weight Optimisation The training process is to find the weights that minimise the 256 cost function of the network. The cost function is one of the Dilated 1×1 most important indicators in the training phase, which rep- Convolution Convolution resents the error between targets and the network output. In 32 the proposed system, sparse Softmax cross entropy is applied to optimise the model. Generally, the optimisation follows Output Layer (i-1) Output Layer L the gradient descent method that calculates weights through backward propagation, and the weights are updated after each iteration. Here we use adaptive moment estimation (Adam) Figure 3: The system model. The output from the previous layer (i- Optimiser to adjust training steps by minimising the mean 1) is the input of the subsequent dilated convolutional layer (i). The cost, and the learning rate is set to 0.0001. Adam optimiser process is repeated until obtaining the final output layer. Then the has the promising performance on non-stationary time series output is fed into a 1 × 1 convolutional layer, and a Softmax layer [Kingma and Ba, 2014]. It uses first-order gradients and can computes the predictions. be implemented with high computational efficiency. We set the number of total training iteration as 1,000 to avoid the Compared with the work in [Van Den Oord et al., 2016], underfitting or overfitting problem. The cost function loss we use a rectified linear unit (ReLU) as the activation func- versus the global steps is shown in Figure 4. tion, instead of gated function, which is denoted as ReLU(x) := max(x, 0). How the model learns the non-linearities of the 5.8 data series largely depends on the activation function. It is found in [Borovykh et al., 2017] that the ReLU is very ef- 5.6 ficient for non-stationary or noisy time series. Moreover, it 5.4 also reduces the training time and simplifies the model fur- 5.2 Training Cost ther. The output from layer i is written as (2). 5 4.8 f i = [ReLU (w1i ∗d f i−1 ) + b, ..., ReLU (wTi i ∗d f i−1 ) + b] 4.6 (2) 4.4 where f i is the output of the CNN layer i after the dilated 4.2 convolution ∗d with weight filters wli , l = 1, 2, ..., Tl , and 4 b stands for the bias. To model the conditional probabilities, 0 100 200 300 400 500 600 Iterations of Global Steps 700 800 900 1000 Softmax is applied to calculate the training loss and output the predictions. The reason for this is because of its flexibility Figure 4: The training cost through the iterations which has no requirement of the data shape. Thus it works well for continuous 1-D data [Oord et al., 2016]. The input data has the same length as the outputs. The curve smoothly decreases in the first 500 iterations, and some 4 Training WaveNet spikes appear in iterations after 500, which is the conse- quence of mini-batch operations by Adam optimiser. It is After pre-processing patient data and constructing the noted that the cost is still converging after 1,000 global steps, WaveNet system, the following step is to train the network; but it causes over-fitting. The reason why over-fitting is likely then the test data can be fed into the trained model. to happen is the complexity of the model having limited train- 4.1 Make Batches ing data. The inputs of the neural network are four channels: CGM 5 Performance data, insulin event, carbohydrate intake, and time index. The batches of the testing phase have the same structure. The 5.1 Results PH is 30 minutes, so it requires to forecast the CGM values The unit of glucose level in this paper is mg/dl, and we use 6 points in the future. Therefore, we calculate the glucose one of the essential evaluation metrics to evaluate the predic- tion performance: root mean squared error (RMSE), which 400 Raw Data can be expressed as 350 Predictions Glucose Level (mg/dl) v 300 u u1 X N RMSE = t (x̂(t|t−P H) − xt )2 . (3) 250 N t=1 200 150 We mainly focus on RMSE for each patient and record the results for each patient after several runs, whose result are 100 shown in Table1. The mean of the best results is 21.7267 with 50 0 500 1000 1500 2000 2500 3000 the standard deviation of 2.5237. The performance varies Time Index with subjects, and the method for subject 570 obtain the best result. 340 Raw data Predictions Meal Event 320 Insulin Event Table 1: The RMSE results of six patients in 30-minute PH. Glucose Level (mg/dl) 300 280 Patient P559 P563 P570 P575 P588 P591 260 Avg 22.48 20.35 18.26 25.65 21.69 24.59 240 Best 21.72 20.17 18.03 24.80 21.42 24.22 220 Point (#) 2514 2570 2745 2590 2791 2760 200 Best Avg 21.7267±2.5237 180 0:00 6:00 12:00 18:00 24:00 Time 5.2 Analysis Figure 5: The curves of forecasting CGM series and original The forecasting curve and original CGM data are plotted in CGM recordings. Top: The whole test data set for patient 570, Figure 5. Notably, the predicted curve fits original CGM RMSE:18.027, MARD: 6.545. Bottom: The predictions and test recording with similar trends in general. However, there still data for patient 570 on 21/01/2022. exists a degree of difference that can be seen in the detail view of one-day CGM data. First, it is noted that there is an obvi- ous delay between predictions and raw data, especially for the RMSE results much. The length of the predictions is the same turning points and peaks. In the bottom plot, the dashed line as the testing dataset, as the point number shown in Tabel 1. stands for the insulin events, and the red circle is the meal The predictions for subject 575 and 591 performs much events. Intuitively, it is found the curve will change signifi- worse than other subjects. There are two reasons. On the one cantly after these events. The curve intensively fluctuates, and hand, it is the large gap in the training dataset, as in subject the error is high in these periods. The possible reason is that it 591. Although we use data combination approach to com- is difficult for the model to learn the biological model explic- pensate for this loss and successfully reduce the mean RMSE itly, and those glucose level changes are not determined only by around 0.2 mg/dl, the RMSE is still quite high compared by the input data. However, it is also found that the RMSE with others. On the other hand, it is the condition of the test- result improves by 0.8 mg/dl when feeding the four channels ing dataset. For subject 591, the CGM data of testing sets of data fields instead of solely CGM data. fluctuates a lot with plenty of spikes, and the error is high Another finding is the effect of extrapolation. There are near the turning point of the curve. However, those fluctua- some missing values around 18:00 and we use the first-order tions are determined by the biological model and the condi- extrapolation to fill up the time series. However, the error is tions of patient health, such as the plasma insulin model in still high for the data after these regions. Because the predic- [Lehmann and Deutsch, 1992]. Moreover, subject 570 and tive curve is calculated by adding the predicted differences 563 use ”Humalog” insulin while other four subject use ”No- onto the previous values, it heavily depends on the data 6 valog”. We found that predictions are more accurate for the timesteps before. We only extrapolate the CGM field, be- subjects with ”Humalog” insulin. Another possible feature cause the other three channels are discrete values. As shown of these two patients is they have the same gender. The data in Figure 5, the insulin and carbohydrate intake have signif- from a larger group of subjects is required to prove and ex- icant impacts on the future values, so it would cause more plore the correlations. error if these data points are missing. Several different inter- Compared with existing models, this paper presents a novel polation methods are also tested in the training phase, such as deep learning model based on CNN layers. The performance cubic and spline. The first-order interpolation performs best outperforms the simple autoregressive models using only the by reducing mean RMSE by 2.1 mg/dl because it captures same CGM data, which follows the structure in [Sparacino et linearities of 1-D signal well. al., 2007]. As for the results from other deep learning models, For the first six data points, the predictions are calculated the RMSE results cannot be compared directly, due to differ- on concatenating the last part of the training data at the front ent subjects and datasets, such as [Mougiakakou et al., 2006]. of the testing data. As long as the trend of concatenated data Nevertheless, the major advantages of this models are higher is the same as the subsequent timesteps, it will not affect the efficiency with less global training steps and small weights, and it has the fast algorithmic implementation with the low References time complexity O(L). [Borovykh et al., 2017] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. Conditional time series fore- 6 Conclusion casting with convolutional neural networks. arXiv preprint In this paper, the task to predict the glucose level in 30- arXiv:1703.04691, 2017. minute PH is converted into a classification problem, and a [Daneman, 2006] Denis Daneman. Type 1 diabetes. The new model based on modified WaveNet is developed. With Lancet, 367(9513):847–858, 2006. the pre-processed dataset and causal DCNN system architec- [Hamilton, 1994] James Douglas Hamilton. Time series ture, the network is trained to obtain the network weights. analysis, volume 2. Princeton university press Princeton, Four channels are selected to input the network because we 1994. find they have the strong correlations with glucose levels. The mean value of the best RMSE for six subjects is [Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. 21.7267 with standard deviation equals to 2.5237. The model Adam: A method for stochastic optimization. arXiv is different from existing RNN models and outperforms many preprint arXiv:1412.6980, 2014. current algorithms. The prediction performances mainly af- [Lehmann and Deutsch, 1992] ED Lehmann and T Deutsch. fected by the missing CGM values and the length of the train- A physiological model of glucose-insulin interaction in ing sets. By integrating other data fields with biological mod- type 1 diabetes mellitus. Journal of biomedical engineer- els is a potential approach to improve the prediction accuracy ing, 14(3):235–242, 1992. in the future work. [Marling and Bunescu, 2018] Cindy Marling and Razvan Bunescu. The OhioT1DM dataset for blood glucose level prediction. In The 3rd International Workshop on Knowl- edge Discovery in Healthcare Data, Stockholm, Swe- den, July 2018. CEUR proceedings in press, available at http://smarthealth.cs.ohio.edu/bglp/ OhioT1DM-dataset-paper.pdf, 2018. [Mougiakakou et al., 2006] Stavroula G Mougiakakou, Aikaterini Prountzou, Dimitra Iliopoulou, Konstantina S Nikita, Andriani Vazeou, and Christos S Bartsocas. Neural network based glucose-insulin metabolism models for children with type 1 diabetes. In Engineering in Medicine and Biology Society, 2006. EMBS’06. 28th Annual International Conference of the IEEE, pages 3545–3548. IEEE, 2006. [Oord et al., 2016] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016. [Paine et al., 2016] Tom Le Paine, Pooya Khorrami, Shiyu Chang, Yang Zhang, Prajit Ramachandran, Mark A Hasegawa-Johnson, and Thomas S Huang. Fast wavenet generation algorithm. arXiv preprint arXiv:1611.09482, 2016. [Sparacino et al., 2007] Giovanni Sparacino, Francesca Zan- derigo, Stefano Corazza, Alberto Maran, Andrea Facchinetti, and Claudio Cobelli. Glucose concentra- tion can be predicted ahead in time from continuous glu- cose monitoring sensor time-series. IEEE Transactions on biomedical engineering, 54(5):931–937, 2007. [Van Den Oord et al., 2016] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Ko- ray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. [Zhang et al., 1998] Guoqiang Zhang, B Eddy Patuwo, and Michael Y Hu. Forecasting with artificial neural net- works:: The state of the art. International journal of fore- casting, 14(1):35–62, 1998.