A Deep Learning Algorithm For Personalized Blood Glucose Prediction

              Taiyu Zhu∗ , Kezhi Li∗ , Pau Herrero, Jianwei Chen, Pantelis Georgiou
Department of Electronic and Electrical Engineering, Imperial College London, London SW5 7AZ, UK
                      taiyu.zhu17@imperial.ac.uk, kezhi.li@imperial.ac.uk ∗ †


                           Abstract                                    minutes. The change between the current glucose value and
                                                                       the future glucose value is quantised to 256 target categories.
     A convolutional neural network (CNN) model is                     Under such circumstance, the prediction problem is converted
     presented to forecast the future glucose levels of the            to a classification task, which can be properly solved. After
     patients with type 1 diabetes. The model is a mod-                pre-processing datasets and building the modified WaveNet
     ified version of a recently proposed model called                 model, the prediction results for 30-minute prediction hori-
     WaveNet, which becomes very useful in acoustic                    zons (PH) are obtained.
     signal processing. By transferring the task into a
     classification problem, the model is mainly built
     by casual dilated CNN layers and employs fast                     2     Data Pre-processing
     WaveNet algorithms. The OhioT1DM dataset is
     the source of the four input fields: glucose levels,              2.1    Database
     insulin events, carbohydrate intake and time index.
     The data is fed into the network along with the tar-              The source of training and testing data is from the OhioT1DM
     gets of the glucose change in 30 minutes. Sev-                    dataset developed by [Marling and Bunescu, 2018]. There are
     eral pre-processing approaches such as interpola-                 six patients with type 1 diabetes wearing Medtronic 530G in-
     tion, combination and filtering are used to fill up the           sulin pumps and Medtronic Enlite CGM sensors to collect
     missing data in the training sets, and they improve               the data during the 8-week period. Each of the patients re-
     the performance. Finally, we obtain the predictions               ports the data on daily events via an app on a smartphone and
     of the testing dataset and evaluate the results by the            a fitness band. In the OhioT1DM dataset, the patients are
     root mean squared error (RMSE). The mean value                    numbered as 559, 563, 570, 575, 588 and 591. Two of them
     of the best RMSE of six patients is 21.72.                        are male with ID 563 and 570, while others are female. Three
                                                                       of the nineteen data fields, including previous CGM data by
                                                                       ’glucose level’, insulin value by ’bolus’, carbohydrate intake
1    Introduction                                                      by ’bwz carb input’ and time index normalised to the unit for
                                                                       each day are used as the four channels of inputs. The meal
With an increasing incidence worldwide, type 1 diabetes is a           information is obtained by the pump Bolus Wizard (BWZ),
severe chronic that requires long-term management of blood             which is input by patients to calculate the bolus. Other fields
glucose relying on the glucose predictions [Daneman, 2006].            of the patient data are also added to the input batch, such as
Aiming at improving the accuracy of the predictions, artifi-           heart rate and skin temperature. However, in our experiment,
cial intelligence researchers have been investigating machine          these fields slightly degrade the performance of classification
learning approaches to develop efficient forecasting models.           and bring more variances to the model.
   In this paper, the main system is constructed by a deep con-
volutional neural network (CNN). It origins from a proposed
model called WaveNet, which is firstly developed by the firm           2.2    Interpolation and Extrapolation
DeepMind to process raw audio signals [Van Den Oord et al.,            By observing the glucose data, several intervals miss values
2016]. The glucose data of the patients are sequentially ob-           in both training and testing sets. Since the targets of the CNN
tained by continuous glucose monitoring (CGM) in every five            model rely on the differences between current and future data
    ∗                                                                  points, the discontinuities can cause negative influences. We
      This work is submitted to the Blood Glucose Level Prediction
Challenge, the 27th International Joint Conference on Artificial In-   fill up the missing values of the training dataset by the first-
telligence and the 23rd European Conference on Artificial Intelli-     order interpolation. For the testing data, the first-order extrap-
gence (IJCAI-ECAI 2018), International Workshop on Knowledge           olation is taken to ensure the future values are not involved.
Discovery in Healthcare Data.                                          The predictions by extrapolated intervals are ignored to guar-
    †                                                                  antee that the result has the same length as the CGM testing
      This work is supported by EPSRC, the ARISES project. T. Zhu
and K. Li are the main contributors to the paper.                      data when evaluating the performance.
2.3                   Combination                                                     suitable model with a four-channel feeding dictionary and
For the training dataset, it contains the data from six pa-                           non-linear glucose-insulin interaction. Moreover, our work is
tients around 40 days and 115,200 CGM data points. Usu-                               based on WaveNet that is more time efficient for training and
ally an effective machine learning model needs training data                          testing with smaller weights compared with recurrent neural
with much larger size. Moreover, the missing intervals ap-                            networks (RNNs) [Borovykh et al., 2017]. It focuses on the
pear frequently in the whole dataset. In our model, the data                          long-term relationships between channels and is conditioned
points with large missing gaps are discarded, and we inter-                           on all previous samples [Van Den Oord et al., 2016], which
polate the values only for short intervals. To have a longer                          is modelled as (1).
training dataset and avoid the overfitting problem, we expand                                                  T
the training set and improve the generalisation. We introduce                                                  Y
                                                                                                      p(x) =         p(xt |x1 , ...., xt−1 )          (1)
a part of the data with the longest continuous interval from
                                                                                                               t=1
other subjects and combine them into the current subject to
form an extended training data. Notably, the strategy keeps                           where x = x1 , ..., xT and p(x) is the joint probability com-
half proportion data of the current subject, and the other five                       puted by the product of conditional probabilities. The out-
patients contribute to the other half of the training data with                       put dimension is the same as the input because there are no
10% each. The segments from other patients are selected by                            pooling layers. Convolutional layers model the conditional
observation with the fewest missing values. Thus, the length                          probabilities, and a softmax layer is applied to maximise the
of the dataset is expanded. This method significantly im-                             log-likelihood probabilities.
proves the performance of patient 591, who has long missing
data interval (967-point) in the training set.
                                                                                      3.1   The Causal CNN
                                                                                      The main components in WaveNet are causal convolutional
2.4                   Filtering                                                       layers. After shifting the outputs by several data points, the 1-
After the interpolation and combination, it is found that there                       D causal convolution layers can be implemented. The causal-
are many small spikes near the peaks or the turning points on                         ity is essential for CNN to forecast time series since it guar-
the CGM data of the training dataset. Those spikes are a part                         antees that the model cannot use any information from future
of variances when the batches are used to train the model. To                         timesteps. Particularly, one special ingredient is causal di-
remove these variances, we use a median filter to filter out                          lated convolutional neural network (DCNN) layer. It largely
the noises at the cost of raising the bias slightly. The window                       increases the receptive field of the input signal. The structure
size needs to be appropriately chosen, which is five-point in                         is shown in Figure 2. Compared with regular causal convolu-
this work. The median filter is not used on the testing data,                         tional layers, the dilated one involves a larger number of in-
so the on-line prediction is still feasible. The outcome of data                      put nodes. This setting makes the system capable of learning
pre-processing is shown in Figure 1.                                                  long-term dependencies by skipping certain steps that deter-
                                                                                      mined by the dilation factor.

                                                                      Original
                                                                      Linear Interp
                    300
                                                                      Median Filter
                                                                                                                                                   Output Layer
                                                                                                                                                   Dilation = 16
                    250
                                                                                                                                                   Hidden Layer
                                                                                                                                                   Dilation = 8
    Glucose Level


                    200
                                                                                                                                                   Hidden Layer
                                                                                                                                                   Dilation = 4
                    150
                                                                                                                                                   Hidden Layer
                                                                                                                                                   Dilation = 2
                    100
                                                                                                                                                   Hidden Layer
                                                                                                                                                   Dilation = 1
                     50
                                                                                                                                                   Input

                      0
                          7500    7550   7600   7650    7700   7750      7800
                                                Index
                                                                                       Figure 2: One Dilated causal convolution block with five layers.
             Figure 1: A part of CGM training set after pre-processing.
                                                                                         In this work, there are three blocks, and the DCNN block
                                                                                      contains five layers of dilated convolution. The dilation in-
3              Methodology                                                            creases from one to a certain number in each block. The mo-
Predicting future values for time series is one of the essential                      tivation behind this configuration is to exponentially grow the
problems in data science. A conventional method is to use                             receptive field to cover more previous information and obtain
autoregressivemoving-average (ARMA) process to model the                              a more efficient model with 1 × 32 convolution.
patterns. However, the performance of the ARMA model
is not satisfactory in this task since it is incapable to cap-                        3.2   System Architecture
ture non-linearities [Hamilton, 1994]. Feed-forward neural                            We adopt the fast approach to implement the WaveNet
networks can overcome this difficulty and learn the patterns                          method, so it removes redundant convolutional operations
of multivariate time series well by feeding the data with ex-                         and reduces the time complexity from O(2L ) to O(L) [Paine
tremely large size [Zhang et al., 1998]. Thus it is a more                            et al., 2016], where L is the total number of the layers. The
fundamental technique is to create convolution queues that               change between the current value and the future value in the
divide the operations into pop and push phases. The model                PH. By quantisation, we put the change of glucose values in
functionally acts as a single step of RNNs. The system model             256 classes/categories as targets with a difference of 1 mg/dl
for this project with fast WaveNet is shown in Figure 3.                 between each class. The number of classes is chosen care-
                                                                         fully because a small number of classes cannot distinguish the
                           Output Layer (i)                              difference while a large number of classes are not suitable for
                                                                         small training dataset. After investigating the training dataset,
               Residual
              Connection                                                 we think that the value of 256 is able to cover more than 95%
                                  +               Predictions
                                                                         of difference values, referring to the glucose change in the
                                                                         range of ±128 mg/dl within 30 mins.

                                ReLU               Softmax               4.2    Weight Optimisation
                                                                         The training process is to find the weights that minimise the
                                                         256             cost function of the network. The cost function is one of the
                               Dilated               1×1
                                                                         most important indicators in the training phase, which rep-
                             Convolution          Convolution            resents the error between targets and the network output. In
                                                         32
                                                                         the proposed system, sparse Softmax cross entropy is applied
                                                                         to optimise the model. Generally, the optimisation follows
                           Output Layer (i-1)   Output Layer L           the gradient descent method that calculates weights through
                                                                         backward propagation, and the weights are updated after each
                                                                         iteration. Here we use adaptive moment estimation (Adam)
Figure 3: The system model. The output from the previous layer (i-       Optimiser to adjust training steps by minimising the mean
1) is the input of the subsequent dilated convolutional layer (i). The   cost, and the learning rate is set to 0.0001. Adam optimiser
process is repeated until obtaining the final output layer. Then the
                                                                         has the promising performance on non-stationary time series
output is fed into a 1 × 1 convolutional layer, and a Softmax layer
                                                                         [Kingma and Ba, 2014]. It uses first-order gradients and can
computes the predictions.
                                                                         be implemented with high computational efficiency. We set
                                                                         the number of total training iteration as 1,000 to avoid the
   Compared with the work in [Van Den Oord et al., 2016],                underfitting or overfitting problem. The cost function loss
we use a rectified linear unit (ReLU) as the activation func-            versus the global steps is shown in Figure 4.
tion, instead of gated function, which is denoted as ReLU(x)
:= max(x, 0). How the model learns the non-linearities of the
                                                                                                  5.8
data series largely depends on the activation function. It is
found in [Borovykh et al., 2017] that the ReLU is very ef-                                        5.6


ficient for non-stationary or noisy time series. Moreover, it                                     5.4


also reduces the training time and simplifies the model fur-                                      5.2
                                                                                  Training Cost


ther. The output from layer i is written as (2).                                                   5


                                                                                                  4.8


f i = [ReLU (w1i ∗d f i−1 ) + b, ..., ReLU (wTi i ∗d f i−1 ) + b]                                 4.6


                                                              (2)                                 4.4

where f i is the output of the CNN layer i after the dilated                                      4.2

convolution ∗d with weight filters wli , l = 1, 2, ..., Tl , and                                   4
b stands for the bias. To model the conditional probabilities,                                          0   100   200   300       400    500    600
                                                                                                                              Iterations of Global Steps
                                                                                                                                                           700   800   900   1000


Softmax is applied to calculate the training loss and output the
predictions. The reason for this is because of its flexibility                    Figure 4: The training cost through the iterations
which has no requirement of the data shape. Thus it works
well for continuous 1-D data [Oord et al., 2016].                           The input data has the same length as the outputs. The
                                                                         curve smoothly decreases in the first 500 iterations, and some
4     Training WaveNet                                                   spikes appear in iterations after 500, which is the conse-
                                                                         quence of mini-batch operations by Adam optimiser. It is
After pre-processing patient data and constructing the                   noted that the cost is still converging after 1,000 global steps,
WaveNet system, the following step is to train the network;              but it causes over-fitting. The reason why over-fitting is likely
then the test data can be fed into the trained model.                    to happen is the complexity of the model having limited train-
4.1    Make Batches                                                      ing data.
The inputs of the neural network are four channels: CGM                  5     Performance
data, insulin event, carbohydrate intake, and time index. The
batches of the testing phase have the same structure. The                5.1    Results
PH is 30 minutes, so it requires to forecast the CGM values              The unit of glucose level in this paper is mg/dl, and we use
6 points in the future. Therefore, we calculate the glucose              one of the essential evaluation metrics to evaluate the predic-
tion performance: root mean squared error (RMSE), which                                                       400
                                                                                                                                                                                  Raw Data
can be expressed as                                                                                           350
                                                                                                                                                                                  Predictions


                                                                                      Glucose Level (mg/dl)
                    v                                                                                         300
                    u
                    u1 X  N
          RMSE = t           (x̂(t|t−P H) − xt )2 . (3)                                                       250
                       N t=1                                                                                  200

                                                                                                              150
   We mainly focus on RMSE for each patient and record the
results for each patient after several runs, whose result are                                                 100

shown in Table1. The mean of the best results is 21.7267 with                                                  50
                                                                                                                    0   500          1000      1500      2000             2500                  3000
the standard deviation of 2.5237. The performance varies                                                                                    Time Index
with subjects, and the method for subject 570 obtain the best
result.                                                                                                       340                                                 Raw data
                                                                                                                                                                  Predictions
                                                                                                                                                                  Meal Event
                                                                                                              320                                                 Insulin Event
      Table 1: The RMSE results of six patients in 30-minute PH.


                                                                               Glucose Level (mg/dl)
                                                                                                              300

                                                                                                              280
 Patient         P559     P563     P570 P575          P588         P591                                       260
 Avg             22.48    20.35    18.26 25.65        21.69        24.59                                      240
 Best            21.72    20.17    18.03 24.80        21.42        24.22                                      220
 Point (#)       2514     2570     2745 2590          2791         2760                                       200
 Best Avg                         21.7267±2.5237                                                              180

                                                                                                               0:00           6:00             12:00            18:00                       24:00
                                                                                                                                               Time

5.2     Analysis
                                                                           Figure 5: The curves of forecasting CGM series and original
The forecasting curve and original CGM data are plotted in                 CGM recordings. Top: The whole test data set for patient 570,
Figure 5. Notably, the predicted curve fits original CGM                   RMSE:18.027, MARD: 6.545. Bottom: The predictions and test
recording with similar trends in general. However, there still             data for patient 570 on 21/01/2022.
exists a degree of difference that can be seen in the detail view
of one-day CGM data. First, it is noted that there is an obvi-
ous delay between predictions and raw data, especially for the             RMSE results much. The length of the predictions is the same
turning points and peaks. In the bottom plot, the dashed line              as the testing dataset, as the point number shown in Tabel 1.
stands for the insulin events, and the red circle is the meal                 The predictions for subject 575 and 591 performs much
events. Intuitively, it is found the curve will change signifi-            worse than other subjects. There are two reasons. On the one
cantly after these events. The curve intensively fluctuates, and           hand, it is the large gap in the training dataset, as in subject
the error is high in these periods. The possible reason is that it         591. Although we use data combination approach to com-
is difficult for the model to learn the biological model explic-           pensate for this loss and successfully reduce the mean RMSE
itly, and those glucose level changes are not determined only              by around 0.2 mg/dl, the RMSE is still quite high compared
by the input data. However, it is also found that the RMSE                 with others. On the other hand, it is the condition of the test-
result improves by 0.8 mg/dl when feeding the four channels                ing dataset. For subject 591, the CGM data of testing sets
of data fields instead of solely CGM data.                                 fluctuates a lot with plenty of spikes, and the error is high
   Another finding is the effect of extrapolation. There are               near the turning point of the curve. However, those fluctua-
some missing values around 18:00 and we use the first-order                tions are determined by the biological model and the condi-
extrapolation to fill up the time series. However, the error is            tions of patient health, such as the plasma insulin model in
still high for the data after these regions. Because the predic-           [Lehmann and Deutsch, 1992]. Moreover, subject 570 and
tive curve is calculated by adding the predicted differences               563 use ”Humalog” insulin while other four subject use ”No-
onto the previous values, it heavily depends on the data 6                 valog”. We found that predictions are more accurate for the
timesteps before. We only extrapolate the CGM field, be-                   subjects with ”Humalog” insulin. Another possible feature
cause the other three channels are discrete values. As shown               of these two patients is they have the same gender. The data
in Figure 5, the insulin and carbohydrate intake have signif-              from a larger group of subjects is required to prove and ex-
icant impacts on the future values, so it would cause more                 plore the correlations.
error if these data points are missing. Several different inter-              Compared with existing models, this paper presents a novel
polation methods are also tested in the training phase, such as            deep learning model based on CNN layers. The performance
cubic and spline. The first-order interpolation performs best              outperforms the simple autoregressive models using only the
by reducing mean RMSE by 2.1 mg/dl because it captures                     same CGM data, which follows the structure in [Sparacino et
linearities of 1-D signal well.                                            al., 2007]. As for the results from other deep learning models,
   For the first six data points, the predictions are calculated           the RMSE results cannot be compared directly, due to differ-
on concatenating the last part of the training data at the front           ent subjects and datasets, such as [Mougiakakou et al., 2006].
of the testing data. As long as the trend of concatenated data             Nevertheless, the major advantages of this models are higher
is the same as the subsequent timesteps, it will not affect the            efficiency with less global training steps and small weights,
and it has the fast algorithmic implementation with the low       References
time complexity O(L).                                             [Borovykh et al., 2017] Anastasia Borovykh, Sander Bohte,
                                                                     and Cornelis W Oosterlee. Conditional time series fore-
6   Conclusion                                                       casting with convolutional neural networks. arXiv preprint
In this paper, the task to predict the glucose level in 30-          arXiv:1703.04691, 2017.
minute PH is converted into a classification problem, and a       [Daneman, 2006] Denis Daneman. Type 1 diabetes. The
new model based on modified WaveNet is developed. With               Lancet, 367(9513):847–858, 2006.
the pre-processed dataset and causal DCNN system architec-        [Hamilton, 1994] James Douglas Hamilton. Time series
ture, the network is trained to obtain the network weights.          analysis, volume 2. Princeton university press Princeton,
Four channels are selected to input the network because we           1994.
find they have the strong correlations with glucose levels.
   The mean value of the best RMSE for six subjects is            [Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba.
21.7267 with standard deviation equals to 2.5237. The model          Adam: A method for stochastic optimization. arXiv
is different from existing RNN models and outperforms many           preprint arXiv:1412.6980, 2014.
current algorithms. The prediction performances mainly af-        [Lehmann and Deutsch, 1992] ED Lehmann and T Deutsch.
fected by the missing CGM values and the length of the train-        A physiological model of glucose-insulin interaction in
ing sets. By integrating other data fields with biological mod-      type 1 diabetes mellitus. Journal of biomedical engineer-
els is a potential approach to improve the prediction accuracy       ing, 14(3):235–242, 1992.
in the future work.                                               [Marling and Bunescu, 2018] Cindy Marling and Razvan
                                                                     Bunescu. The OhioT1DM dataset for blood glucose level
                                                                     prediction. In The 3rd International Workshop on Knowl-
                                                                     edge Discovery in Healthcare Data, Stockholm, Swe-
                                                                     den, July 2018. CEUR proceedings in press, available
                                                                     at http://smarthealth.cs.ohio.edu/bglp/
                                                                     OhioT1DM-dataset-paper.pdf, 2018.
                                                                  [Mougiakakou et al., 2006] Stavroula G Mougiakakou,
                                                                     Aikaterini Prountzou, Dimitra Iliopoulou, Konstantina S
                                                                     Nikita, Andriani Vazeou, and Christos S Bartsocas.
                                                                     Neural network based glucose-insulin metabolism models
                                                                     for children with type 1 diabetes. In Engineering in
                                                                     Medicine and Biology Society, 2006. EMBS’06. 28th
                                                                     Annual International Conference of the IEEE, pages
                                                                     3545–3548. IEEE, 2006.
                                                                  [Oord et al., 2016] Aaron van den Oord, Nal Kalchbrenner,
                                                                     and Koray Kavukcuoglu. Pixel recurrent neural networks.
                                                                     arXiv preprint arXiv:1601.06759, 2016.
                                                                  [Paine et al., 2016] Tom Le Paine, Pooya Khorrami, Shiyu
                                                                     Chang, Yang Zhang, Prajit Ramachandran, Mark A
                                                                     Hasegawa-Johnson, and Thomas S Huang. Fast wavenet
                                                                     generation algorithm. arXiv preprint arXiv:1611.09482,
                                                                     2016.
                                                                  [Sparacino et al., 2007] Giovanni Sparacino, Francesca Zan-
                                                                     derigo, Stefano Corazza, Alberto Maran, Andrea
                                                                     Facchinetti, and Claudio Cobelli. Glucose concentra-
                                                                     tion can be predicted ahead in time from continuous glu-
                                                                     cose monitoring sensor time-series. IEEE Transactions on
                                                                     biomedical engineering, 54(5):931–937, 2007.
                                                                  [Van Den Oord et al., 2016] Aaron Van Den Oord, Sander
                                                                     Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals,
                                                                     Alex Graves, Nal Kalchbrenner, Andrew Senior, and Ko-
                                                                     ray Kavukcuoglu. Wavenet: A generative model for raw
                                                                     audio. arXiv preprint arXiv:1609.03499, 2016.
                                                                  [Zhang et al., 1998] Guoqiang Zhang, B Eddy Patuwo, and
                                                                     Michael Y Hu. Forecasting with artificial neural net-
                                                                     works:: The state of the art. International journal of fore-
                                                                     casting, 14(1):35–62, 1998.