=Paper=
{{Paper
|id=Vol-2148/paper10
|storemode=property
|title=Automatic Blood Glucose Prediction with Confidence Using Recurrent Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2148/paper10.pdf
|volume=Vol-2148
|authors=John Martinsson,Alexander Schliep,Bjorn Eliasson,Christian Meijner,Simon Persson,Olof Mogren
|dblpUrl=https://dblp.org/rec/conf/ijcai/MartinssonSEMPM18
}}
==Automatic Blood Glucose Prediction with Confidence Using Recurrent Neural Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-2148/paper10.pdf</pdf>
<pre>
                        Automatic blood glucose prediction with confidence
                                using recurrent neural networks
                     John Martinsson1 , Alexander Schliep2 , Björn Eliasson3 ,
                       Christian Meijner1 , Simon Persson1 , Olof Mogren4
                               1
                                 Chalmers University of Technology
                                     2
                                       Gothenburg University
                                 3
                                   Sahlgrenska University Hospital
                                            4
                                              RISE AI
    john.martinsson@gmail.com, alexander@schlieplab.org, bjorn.eliasson@gu.se, olof@mogren.one
                         Abstract                                 times natural, obviously important and well-studied variables
                                                                  (e.g. caloric intake for diabetics) might be too inconvenient
     Low-cost sensors continuously measuring blood                to measure for end-users. On the other hand deep learn-
     glucose levels in intervals of a few minutes                 ing approaches are a step towards automated machine learn-
     and mobile platforms combined with machine-                  ing, as features, classifiers and predictors are simultaneously
     learning (ML) solutions enable personalized pre-             learned. Thus they present a possibly more scalable solu-
     cision health and disease management. ML solu-               tion to the myriad of machine learning problems in precision
     tions must be adapted to different sensor technolo-          health management resulting from technology changes alone.
     gies, analysis tasks and individuals. This raises the           The hypothesis underlying our approach are:
     issue of scale for creating such adapted ML solu-
     tions. We present an approach for predicting blood              • It is feasible to predict glucose levels from glucose levels
     glucose levels for diabetics up to one hour into the               alone.
     future. The approach is based on recurrent neural               • Appropriate models can be trained by non-experts with-
     networks trained in an end-to-end fashion, requir-                 out feature engineering or complicated training proce-
     ing nothing but the glucose level history for the pa-              dures.
     tient. The model outputs the prediction along with              • Models can quantify uncertainty in their predictions to
     an estimate of its certainty, helping users to inter-              alert users to the need for extra caution or additional in-
     pret the predicted levels. The approach needs no                   put.
     feature engineering or data pre-processing, and is
     computationally inexpensive.                                    • Physiologically motivated loss functions improve the
                                                                        quality of predictions.
                                                                     We trained and evaluated our method on the Ohio T1DM
1   Introduction                                                  Dataset for Blood Glucose Level Prediction; see [Marling and
Our future will be recorded and quantified in unprecedented       Bunescu, 2018] for details.
temporal resolution and with a rapidly increasing variety of
variables describing activities we engage in as well as phys-     2   Methodology
iologically and medically relevant phenomena. One exam-           A recurrent neural network (RNN) is a feed forward artificial
ple is the increasingly wide adoption of continuous blood         neural network that can model a sequence of arbitrary length,
glucose monitoring systems (CGM) which has given type-1           using weight sharing between each position in the sequence.
diabetics (T1D) a valuable tool for closely monitoring and        In the basic RNN variant, the transition function is a linear
reacting to their current blood glucose levels and trends.        transformation of the hidden state and the input, followed by
Blood glucose levels adhere to complex dynamics that de-          a pointwise nonlinearity:
pend on many different variables (such as carbohydrate in-
take, recent insulin injections, physical activity, stress lev-                  ht = tanh(W xt + U ht−1 + b),
els, the presence of an infection in the body, sleeping pat-      where W and U are weight matrices, b is a bias vector, and
terns, hormonal patterns, etc) [Bremer and Gough, 1999;           tanh is the selected nonlinearity. W , U , and b are typi-
Cryer et al., 2003]. This makes predicting the short term         cally trained using some variant of stochastic gradient descent
blood glucose changes (up to a few hours) a challenging task,     (SGD).
and developing machine learning (ML) approaches an obvi-             Basic RNNs struggle with learning long dependencies and
ous approach for improving patient care. Variations in sensor     suffer from the vanishing gradient problem. This makes them
technologies must be reflected in the ML method. However,         difficult to train [Hochreiter, 1998; Bengio et al., 1994], and
acquiring domain expertise, understanding sensors, and hand-      has motivated the development of the Long Short Term Mem-
crafting features is expensive and not easy to scale up. Some-    ory (LSTM) [Hochreiter and Schmidhuber, 1997], that to
                                                                        The negative log-likelihood (NLL) loss function is based
                                                                      on the Gaussian probability density function,
                                                                                             k
                                                                                          1X
                                                                                                − log N (yi |µi , σi2 ) ,
                                                                                                                       
                                                                                    L=
                                                                                          k i=0
                                                                      where yi is the target value from the data, and µi , σi are the
                                                                      network’s output given the input sequence xi . This way of
                                                                      modeling the prediction facilitates basing decisions on the
                                                                      predictions.

Figure 1: High-level illustration of the LSTM network used in this    Physiological loss function: We also trained the model
work. Each cell updates the internal memory vector ci with infor-     with a glucose-specific loss function [Favero et al., 2012],
mation from the current input, and outputs a vector hi . ci and hi    which is a metric that combines the mean squared error with
is passed on to the next cell, and finally ht is used as input to a   a penalty term for predictions that would lead to clinically
fully connected output layer which applies a linear transformation    dangerous treatments.
and outputs the predicted µ, σ 2 .
                                                                      2.1   Experimental setup
some extent solves these shortcomings. An LSTM is an RNN              The only preprocessing done on the glucose values are scal-
where the cell at each step t contains an internal memory             ing by 0.01 as in [Mirshekarian et al., 2017] to get the glucose
vector ct , and three gates controlling what parts of the in-         values into a range fit for training.
ternal memory will be kept (the forget gate ft ), what parts of          Hyperparmeter selection was performed by selecting pa-
the input that will be stored in the internal memory (the in-         tient 559 and 591 in the Ohio T1DM Dataset for Blood Glu-
put gate it ), as well as what will be included in the output         cose Level Prediction [Marling and Bunescu, 2018] and train
(the output gate ot ). In essence, this means that the follow-        on the first 60% of the training data for each patient, using the
ing expressions are evaluated at each step in the sequence, to        next 20% of the data for early stopping, selecting the hyper-
compute the new internal memory ct and the cell output ht .           parameters by the performance on the last 20% of the data.
Here “ ” represents element-wise multiplication.                      We then proceeded to train five models, with different ran-
                                                                      dom initializations, on a set of different configurations using
                                                                      30, 120 and 240 minutes of history in combination with an
              it = σ(Wi xt + Ui ht−1 + bi ),                          LSTM state size of 8, 32, 96 and 128. Each model was al-
             ft = σ(Wf xt + Uf ht−1 + bf ),                           lowed a maximum of 200 epochs and early stopping with a
             ot = σ(Wo xt + Uo ht−1 + bo ),                           patience of 8. The configuration which generalized best for
                                                                      the two patients was using 30 minutes of glucose level his-
             ut = tanh(Wu xt + Uu ht−1 + bu ),                        tory and 128 LSTM states. This can be seen in Fig. 2; note
             ct = it ut + ft ct−1 ,                                   the blue line. Using 30 minutes of history in combination
             ht = ot tanh(ct ).                                       with few LSTM states results in a high RMSE score for both
                                                                      patients, but 30 minutes of history in combination with 128
                                                                      LSTM states works well both patients. The problem of se-
   We model the blood glucose levels using a recurrent neu-           lecting the proper model and the amount of glucose level his-
ral network (see Fig. 1), working on the sequence of input            tory that the model should use to make the future prediction
data provided by the CGM sensor system. The network con-              is something that warrants further research, and which should
sists of Long short-term memory (LSTM) cells [Hochreiter              be addressed in future work.
and Schmidhuber, 1997]. The whole model takes as input
a sequence of blood glucose measurements from the CGM
                                                                      Final models: The final models were trained using 30 min-
system and outputs one prediction regarding the blood glu-
                                                                      utes of glucose level history for predictions 30 and 60 minutes
cose level after time T (we present experimental evaluation
                                                                      into the future, respectively. The setup for the final training
for T ∈ {30, 60} minutes). An RNN is designed to take a
                                                                      was to train on the first 80% of the glucose level training data
vector of inputs at each timestep, but in the case of feeding
                                                                      for each patient, and validate on the last 20%. The final mod-
the network with blood glucose measurements only, the input
                                                                      els were given a low learning rate of 10−5 , a maximum num-
vectors are one-dimensional (effectively scalar valued).
                                                                      ber of 10, 000 epochs, and an early stopping patience of 256
   The output vector from the final LSTM cell (see ht in
                                                                      to allow them more time to converge. These final models
Fig. 1) in the sequence is fed through a fully connected output
                                                                      were then the only models run on the supplied test data. Note
layer having two outputs with a linear activation function,
                                                                      that the there are values in the test data for which no predic-
                     [µ, σ 2 ] = Wl ht + bl .                         tions have been made.
The output is modeled as a univariate Gaussian distribu-
tion [Graves, 2013], using one value for the mean, µ, and             Missing data: The number of missing predictions depends
one value for the variance, σ 2 . This gives us an estimate of        on the number of gaps in the data, i.e., the number of pair-
the confidence in the models’ predictions.                            wise consecutive measurements in the glucose level data
                 Patient 559                    Patient 591            Table 1: We show results individually per patient and averages in
                past steps 6                  past steps 6             predicting glucose levels with a 30 respectively 60 min interval. The
           34   past steps 24                 past steps 24            table show the root mean squared error (RMSE) of the predictions
                past steps 48         36      past steps 48            when the LSTM is trained with the negative log-likelihood (NLL)
           32                                                          loss function and the mean squared error (MSE) loss function re-
                                                                       spectively. t0 refers to the naive baseline of predicting the last value.

           30                         35
    RMSE


                                                                                                   30 min horizon       60 min horizon
           28                                                               Patient ID            NLL MSE       t0     NLL MSE       t0
                                      34
                                                                                        559       19.5   19.5   23.4   34.8    34.4      39.7
           26                                                                           570       16.4   16.5   19.0   28.8    28.6      31.9
                                      33                                                588       19.3   19.2   21.8   32.5    33.1      35.8
           24                                                                           563       19.0   19.0   20.8   30.8    29.9      34.0
                                                                                        575       24.8   24.2   25.6   38.4    37.3      39.7
                   50           100              50           100                       591       25.4   22.0   24.4   36.0    36.0      38.6
                  LSTM states                   LSTM states                              µ        20.7   20.1   22.5   33.6    33.2      36.6
                                                                                         σ        ±3.2   ±2.5   ±2.2   ±3.2    ±3.1      ±3.0
Figure 2: We display the RMSE score on the validation data to select
the number of LSTM states and the number of previous time-steps
(history) of the blood glucose signal that should be used to predict
the future value.

                                                                                       400
where the time-step is not exactly five minutes. We do not
interpolate to fill the missing values since it is unclear how                         350
much bias this would introduce, and instead only use data for                          300
which it is possible to create the (x, y) pairs of glucose his-
                                                                                       250
                                                                       Glucose level


tory and regression targets at the given horizon. The greatest
number of gaps in the test data is 11 for patient 559. Using 30                        200
minutes of history (6 time-steps) and predicting 30 minutes
into the future (6 time-steps) results in 12 ∗ 11 = 132 values                         150
which have no predictions, since we need at least 12 consec-                           100
utive measurements to create a (x, y) pair. The test portion of
the dataset contains 2514, 2570, 2745, 2590, 2791 and 2760                             50
test points, which gives us a upper-bound of roughly 5% of                              0
missing predictions for each patient. See the discussion of                                   0    100    200    300    400      500      600
                                                                                                                Time
missing data for further explanation.
                                                                                       400
Computational requirements: In our experimental setup                                  350
training of the model could be performed on a commodity
laptop. The model is small enough to fit in, and be used on                            300
mobile devices (e.g. mobile phones, blood glucose monitor-                             250
                                                                       Glucose level


ing devices, etc). Training could initially be performed offline
and then incremental training would be light enough to allow                           200
for training either on the devices or offline.                                         150
                                                                                       100
3      Results
The results presented in Table 1 are the root mean squared                             50
error (RMSE) for the model when trained with the mean                                   0
squared error (MSE) loss function and the negative log-                                       0    100    200    300    400      500      600
                                                                                                                Time
likelihood (NLL) loss function. The results indicate that
the model performs comparably when trained with NLL and
MSE, but with the added benefit of estimating the variance of          Figure 3: We display the prediction (purple) and standard deviation
                                                                       (shaded pink) compared to the ground truth (green) for patient 570
the prediction.
                                                                       (top) and 591 (bottom). Note the much larger uncertainty for patient
   The glucose level of patient 591 is harder to predict than the      591.
glucose level for patient 570, which can be seen in the Table 1
where the RMSE for patient 570 is 16.3 and the RMSE for
patient 591 is 24.6. Fig. 3 indicate that the model is able to
learn this by assigning a higher variance to the predictions
for patient 591 than for patient 570. The standard deviation
is illustrated by the pink shaded region in the figure. This is
further illustrated in the Clarke error grid plots in Fig. 4 where
we can see that for patient 570 most of the predictions are in
region A, which is considered as a clinically safe region, but
for patient 591 we can see that more predictions are in the B
region, which is still considered non-critical, but also in the
more critical D region. That is, the variance of the error in the
predictions is higher for patient 591 than for patient 570. In
particular, the model has a hard time predicting hypoglycemic                                                                     Patient 570 Clarke Error Grid
                                                                                                                    400
events.                                                                                                                       E             C           B
                                                                                                                    350


                                                                                 Prediction Concentration (mg/dl)
4   Discussion                                                                                                      300
As the competition will provide the benchmarking we focus
                                                                                                                    250
                                                                                                                                                                  B
on particular insights we have gained during the development
of the method.                                                                                                      200

Minimalistic ML: Compared to results in the literature for
                                                                                                                    150       D                                   D
other datasets our system based on recurrent neural networks                                                        100
can predict blood glucose levels in type-1 diabetes for hori-
                                                                                                                    50
zons of up to 60 minutes into the future using only blood                                                                     A             C                     E
glucose level as inputs. Generally, the minor improvement                                                            0
over a naive baseline algorithm demonstrate that the predic-                                                              0   50 100 150 200 250 300 350 400
tion problem is a rather difficult one, partly due to large intra
                                                                                                                              Reference Concentration (mg/dl)
and inter patient variation. Nevertheless, our results suggest                                                                    Patient 591 Clarke Error Grid
                                                                                                                    400
that a substantially reduced human effort—avoiding labor-
intensive prior work by experts hand-crafting features based
                                                                                                                              E             C           B
                                                                                                                    350
                                                                                 Prediction Concentration (mg/dl)

on extensive domain knowledge —in designing and training
machine learning methods for precision health management                                                            300
can be feasible.
                                                                                                                    250
                                                                                                                                                                  B
Quantifying uncertainty: Our model also outputs an es-                                                              200
timate of the variance of the prediction, thus measuring un-                                                        150       D                                   D
certainty in prediction. This is a useful aspect for a system
which will be used by continuous glucose monitoring users                                                           100
for making decisions about administration of insulin and/or                                                         50
caloric intake. We expect that large-scale data collection of                                                                 A             C                     E
data from many users will further improve results. The results                                                       0
in Fig. 3 show the two ends of the spectrum in this uncertainty                                                           0   50 100 150 200 250 300 350 400
                                                                                                                              Reference Concentration (mg/dl)
quantification.
   One principle problem is that disambiguating between
intra-patient variation and sensor errors is unlikely to be fea-     Figure 4: We show the Clarke error grid plots for patient 570 (top)
sible. An interesting research question concerns methods             and patient 591 (bottom). Note that the variance of the error in the
which can detect sensor degradation over time or identify de-        predictions is higher for patient 591 than for patient 570.
fects by comparing sensors for the same patient in long-term
physiological; it is unclear if the often smoothed data sup-
plied by sensors is sufficient for that.

Physiological loss function: To our surprise we did not see
improvements when using a physiologically motivated loss
function [Favero et al., 2012] (results not shown), essentially
a smoothed version of the Clarke error grid [Clarke et al.,
1987]. Of course our findings are not proof that such loss
functions cannot improve results. Possibly a larger-scale in-
vestigation, exploring in particular a larger area of the param-
eter space and different training regimes might provide fur-        the Keras API of TensorFlow which allows for rapid proto-
ther insights. Penalizing errors for hypo- or hyper-glycemic        typing of deep learning models, to implement our model and
states should lead to better real-world performance, as we ob-      loss functions.
served comparatively larger deviations in minima and max-
ima. One explanation for that is the relative class imbalance,      References
as extrema are rare. This could be countered with data aug-
                                                                    [Bengio et al., 1994] Yoshua Bengio, Patrice Simard, and
mentation techniques.
                                                                       Paolo Frasconi. Learning long-term dependencies with
                                                                       gradient descent is difficult. Neural Networks, IEEE
Model selection: The large inter-patient variation also sug-           Transactions on, 5(2):157–166, 1994.
gest that selecting one model for all patients might yield sub-     [Bremer and Gough, 1999] Troy Bremer and David A
optimal results, see Fig. 1. Consequently, precision health            Gough. Is blood glucose predictable from previous values?
apps should not only adapt parameters to individuals, but also         a solicitation for data. Diabetes, 48(3):445–451, 1999.
entertain increasing or decreasing model complexity. While
this is clearly undesirable from a regulatory point-of-view         [Clarke et al., 1987] William L Clarke, Daniel Cox, Linda A
(e.g., how to show efficacy in a trial), the differences we            Gonder-Frederick, William Carter, and Stephen L
observed seemed to suggest that adaption of complexity im-             Pohl. Evaluating clinical accuracy of systems for self-
proves quality of care.                                                monitoring of blood glucose. Diabetes care, 10(5):622–
                                                                       628, 1987.
Missing data: There are gaps in the training data with miss-        [Cryer et al., 2003] Philip E Cryer, Stephen N Davis, and
ing values. Most of the gaps are less than 10 hours, but some          Harry Shamoon. Hypoglycemia in diabetes. Diabetes
of the gaps are more than 24 hours. The number of missing              care, 26(6):1902–1912, 2003.
data points account for roughly 23 out of 263 days, or 9% of        [Favero et al., 2012] Simone       Del     Favero,    Andrea
the data. The gaps could be filled using interpolation, but it is      Facchinetti, and Claudio Cobelli. A Glucose-Specific
not immediately clear how this would affect either the train-          Metric to Assess Predictors and Identify Models.
ing of the models, or the evaluation of the models, since this         59(5):1281–1290, 2012.
would introduce artificial values. Filling a gap of 24 hours us-    [Graves, 2013] Alex Graves. Generating sequences with re-
ing interpolation would not result in realistic data. Instead we       current neural networks. arXiv preprint arXiv:1308.0850,
have chosen not to fill the gaps with artifical values and limit       2013.
our models to be trained and evaluated only on real data. This
has its own limitations since we can not predict the initial val-   [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and
ues after a gap, but the advantage is that model training and          Jürgen Schmidhuber. Long short-term memory. Neural
evaluation is not biased by the introduction of artificial val-        computation, 9(8):1735–1780, 1997.
ues.                                                                [Hochreiter, 1998] Sepp Hochreiter. The vanishing gradient
                                                                       problem during learning recurrent neural nets and problem
Conclusion: The field is certainly in desperate need of                solutions. International Journal of Uncertainty, Fuzziness
larger data sets and standards for the evaluation. Crowd               and Knowledge-Based Systems, 6(02):107–116, 1998.
sourcing from patient associations would be one possibil-           [Marling and Bunescu, 2018] Cindy Marling and Razvan
ity, but differences in sensor types and sensor revisions, life        Bunescu. The ohiot1dm dataset for blood glucose level
styles, and genetic markup are all obvious confounding fac-            prediction. Glucose Prediction News, 2018.
tors. Understanding sensor errors by measuring glucose level        [Mirshekarian et al., 2017] Sadegh Mirshekarian, Razvan
in vivo, for example in diabetes animal models, with several           Bunescu, Cindy Marling, and Frank Schwartz. Using
sensors simultaneously would be very insightful, and likely            LSTMs to learn physiological models of blood glucose
improve prediction quality. Another question concerns pre-             behavior. Proceedings of the Annual International Con-
processing in the sensors, which might be another confound-            ference of the IEEE Engineering in Medicine and Biology
ing factor in the prediction. While protection of proprietary          Society, EMBS, pages 2887–2891, 2017.
intellectual property is necessary, there has been examples,
e.g. DNA microarray technology, where only a completely
open analysis process from the initial steps usually performed
with vendor’s software tools to the final result helped to real-
ize the full potential of the technology.

Software
The software including all scripts to reproduce the com-
putational experiments is released under an open-source
license and available from https://github.com/
johnmartinsson/blood-glucose-prediction.
We have used Googles TensorFlow framework, in particular

</pre>