=Paper=
{{Paper
|id=Vol-2148/paper10
|storemode=property
|title=Automatic Blood Glucose Prediction with Confidence Using Recurrent Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2148/paper10.pdf
|volume=Vol-2148
|authors=John Martinsson,Alexander Schliep,Bjorn Eliasson,Christian Meijner,Simon Persson,Olof Mogren
|dblpUrl=https://dblp.org/rec/conf/ijcai/MartinssonSEMPM18
}}
==Automatic Blood Glucose Prediction with Confidence Using Recurrent Neural Networks==
Automatic blood glucose prediction with confidence
using recurrent neural networks
John Martinsson1 , Alexander Schliep2 , Björn Eliasson3 ,
Christian Meijner1 , Simon Persson1 , Olof Mogren4
1
Chalmers University of Technology
2
Gothenburg University
3
Sahlgrenska University Hospital
4
RISE AI
john.martinsson@gmail.com, alexander@schlieplab.org, bjorn.eliasson@gu.se, olof@mogren.one
Abstract times natural, obviously important and well-studied variables
(e.g. caloric intake for diabetics) might be too inconvenient
Low-cost sensors continuously measuring blood to measure for end-users. On the other hand deep learn-
glucose levels in intervals of a few minutes ing approaches are a step towards automated machine learn-
and mobile platforms combined with machine- ing, as features, classifiers and predictors are simultaneously
learning (ML) solutions enable personalized pre- learned. Thus they present a possibly more scalable solu-
cision health and disease management. ML solu- tion to the myriad of machine learning problems in precision
tions must be adapted to different sensor technolo- health management resulting from technology changes alone.
gies, analysis tasks and individuals. This raises the The hypothesis underlying our approach are:
issue of scale for creating such adapted ML solu-
tions. We present an approach for predicting blood • It is feasible to predict glucose levels from glucose levels
glucose levels for diabetics up to one hour into the alone.
future. The approach is based on recurrent neural • Appropriate models can be trained by non-experts with-
networks trained in an end-to-end fashion, requir- out feature engineering or complicated training proce-
ing nothing but the glucose level history for the pa- dures.
tient. The model outputs the prediction along with • Models can quantify uncertainty in their predictions to
an estimate of its certainty, helping users to inter- alert users to the need for extra caution or additional in-
pret the predicted levels. The approach needs no put.
feature engineering or data pre-processing, and is
computationally inexpensive. • Physiologically motivated loss functions improve the
quality of predictions.
We trained and evaluated our method on the Ohio T1DM
1 Introduction Dataset for Blood Glucose Level Prediction; see [Marling and
Our future will be recorded and quantified in unprecedented Bunescu, 2018] for details.
temporal resolution and with a rapidly increasing variety of
variables describing activities we engage in as well as phys- 2 Methodology
iologically and medically relevant phenomena. One exam- A recurrent neural network (RNN) is a feed forward artificial
ple is the increasingly wide adoption of continuous blood neural network that can model a sequence of arbitrary length,
glucose monitoring systems (CGM) which has given type-1 using weight sharing between each position in the sequence.
diabetics (T1D) a valuable tool for closely monitoring and In the basic RNN variant, the transition function is a linear
reacting to their current blood glucose levels and trends. transformation of the hidden state and the input, followed by
Blood glucose levels adhere to complex dynamics that de- a pointwise nonlinearity:
pend on many different variables (such as carbohydrate in-
take, recent insulin injections, physical activity, stress lev- ht = tanh(W xt + U ht−1 + b),
els, the presence of an infection in the body, sleeping pat- where W and U are weight matrices, b is a bias vector, and
terns, hormonal patterns, etc) [Bremer and Gough, 1999; tanh is the selected nonlinearity. W , U , and b are typi-
Cryer et al., 2003]. This makes predicting the short term cally trained using some variant of stochastic gradient descent
blood glucose changes (up to a few hours) a challenging task, (SGD).
and developing machine learning (ML) approaches an obvi- Basic RNNs struggle with learning long dependencies and
ous approach for improving patient care. Variations in sensor suffer from the vanishing gradient problem. This makes them
technologies must be reflected in the ML method. However, difficult to train [Hochreiter, 1998; Bengio et al., 1994], and
acquiring domain expertise, understanding sensors, and hand- has motivated the development of the Long Short Term Mem-
crafting features is expensive and not easy to scale up. Some- ory (LSTM) [Hochreiter and Schmidhuber, 1997], that to
The negative log-likelihood (NLL) loss function is based
on the Gaussian probability density function,
k
1X
− log N (yi |µi , σi2 ) ,
L=
k i=0
where yi is the target value from the data, and µi , σi are the
network’s output given the input sequence xi . This way of
modeling the prediction facilitates basing decisions on the
predictions.
Figure 1: High-level illustration of the LSTM network used in this Physiological loss function: We also trained the model
work. Each cell updates the internal memory vector ci with infor- with a glucose-specific loss function [Favero et al., 2012],
mation from the current input, and outputs a vector hi . ci and hi which is a metric that combines the mean squared error with
is passed on to the next cell, and finally ht is used as input to a a penalty term for predictions that would lead to clinically
fully connected output layer which applies a linear transformation dangerous treatments.
and outputs the predicted µ, σ 2 .
2.1 Experimental setup
some extent solves these shortcomings. An LSTM is an RNN The only preprocessing done on the glucose values are scal-
where the cell at each step t contains an internal memory ing by 0.01 as in [Mirshekarian et al., 2017] to get the glucose
vector ct , and three gates controlling what parts of the in- values into a range fit for training.
ternal memory will be kept (the forget gate ft ), what parts of Hyperparmeter selection was performed by selecting pa-
the input that will be stored in the internal memory (the in- tient 559 and 591 in the Ohio T1DM Dataset for Blood Glu-
put gate it ), as well as what will be included in the output cose Level Prediction [Marling and Bunescu, 2018] and train
(the output gate ot ). In essence, this means that the follow- on the first 60% of the training data for each patient, using the
ing expressions are evaluated at each step in the sequence, to next 20% of the data for early stopping, selecting the hyper-
compute the new internal memory ct and the cell output ht . parameters by the performance on the last 20% of the data.
Here “ ” represents element-wise multiplication. We then proceeded to train five models, with different ran-
dom initializations, on a set of different configurations using
30, 120 and 240 minutes of history in combination with an
it = σ(Wi xt + Ui ht−1 + bi ), LSTM state size of 8, 32, 96 and 128. Each model was al-
ft = σ(Wf xt + Uf ht−1 + bf ), lowed a maximum of 200 epochs and early stopping with a
ot = σ(Wo xt + Uo ht−1 + bo ), patience of 8. The configuration which generalized best for
the two patients was using 30 minutes of glucose level his-
ut = tanh(Wu xt + Uu ht−1 + bu ), tory and 128 LSTM states. This can be seen in Fig. 2; note
ct = it ut + ft ct−1 , the blue line. Using 30 minutes of history in combination
ht = ot tanh(ct ). with few LSTM states results in a high RMSE score for both
patients, but 30 minutes of history in combination with 128
LSTM states works well both patients. The problem of se-
We model the blood glucose levels using a recurrent neu- lecting the proper model and the amount of glucose level his-
ral network (see Fig. 1), working on the sequence of input tory that the model should use to make the future prediction
data provided by the CGM sensor system. The network con- is something that warrants further research, and which should
sists of Long short-term memory (LSTM) cells [Hochreiter be addressed in future work.
and Schmidhuber, 1997]. The whole model takes as input
a sequence of blood glucose measurements from the CGM
Final models: The final models were trained using 30 min-
system and outputs one prediction regarding the blood glu-
utes of glucose level history for predictions 30 and 60 minutes
cose level after time T (we present experimental evaluation
into the future, respectively. The setup for the final training
for T ∈ {30, 60} minutes). An RNN is designed to take a
was to train on the first 80% of the glucose level training data
vector of inputs at each timestep, but in the case of feeding
for each patient, and validate on the last 20%. The final mod-
the network with blood glucose measurements only, the input
els were given a low learning rate of 10−5 , a maximum num-
vectors are one-dimensional (effectively scalar valued).
ber of 10, 000 epochs, and an early stopping patience of 256
The output vector from the final LSTM cell (see ht in
to allow them more time to converge. These final models
Fig. 1) in the sequence is fed through a fully connected output
were then the only models run on the supplied test data. Note
layer having two outputs with a linear activation function,
that the there are values in the test data for which no predic-
[µ, σ 2 ] = Wl ht + bl . tions have been made.
The output is modeled as a univariate Gaussian distribu-
tion [Graves, 2013], using one value for the mean, µ, and Missing data: The number of missing predictions depends
one value for the variance, σ 2 . This gives us an estimate of on the number of gaps in the data, i.e., the number of pair-
the confidence in the models’ predictions. wise consecutive measurements in the glucose level data
Patient 559 Patient 591 Table 1: We show results individually per patient and averages in
past steps 6 past steps 6 predicting glucose levels with a 30 respectively 60 min interval. The
34 past steps 24 past steps 24 table show the root mean squared error (RMSE) of the predictions
past steps 48 36 past steps 48 when the LSTM is trained with the negative log-likelihood (NLL)
32 loss function and the mean squared error (MSE) loss function re-
spectively. t0 refers to the naive baseline of predicting the last value.
30 35
RMSE
30 min horizon 60 min horizon
28 Patient ID NLL MSE t0 NLL MSE t0
34
559 19.5 19.5 23.4 34.8 34.4 39.7
26 570 16.4 16.5 19.0 28.8 28.6 31.9
33 588 19.3 19.2 21.8 32.5 33.1 35.8
24 563 19.0 19.0 20.8 30.8 29.9 34.0
575 24.8 24.2 25.6 38.4 37.3 39.7
50 100 50 100 591 25.4 22.0 24.4 36.0 36.0 38.6
LSTM states LSTM states µ 20.7 20.1 22.5 33.6 33.2 36.6
σ ±3.2 ±2.5 ±2.2 ±3.2 ±3.1 ±3.0
Figure 2: We display the RMSE score on the validation data to select
the number of LSTM states and the number of previous time-steps
(history) of the blood glucose signal that should be used to predict
the future value.
400
where the time-step is not exactly five minutes. We do not
interpolate to fill the missing values since it is unclear how 350
much bias this would introduce, and instead only use data for 300
which it is possible to create the (x, y) pairs of glucose his-
250
Glucose level
tory and regression targets at the given horizon. The greatest
number of gaps in the test data is 11 for patient 559. Using 30 200
minutes of history (6 time-steps) and predicting 30 minutes
into the future (6 time-steps) results in 12 ∗ 11 = 132 values 150
which have no predictions, since we need at least 12 consec- 100
utive measurements to create a (x, y) pair. The test portion of
the dataset contains 2514, 2570, 2745, 2590, 2791 and 2760 50
test points, which gives us a upper-bound of roughly 5% of 0
missing predictions for each patient. See the discussion of 0 100 200 300 400 500 600
Time
missing data for further explanation.
400
Computational requirements: In our experimental setup 350
training of the model could be performed on a commodity
laptop. The model is small enough to fit in, and be used on 300
mobile devices (e.g. mobile phones, blood glucose monitor- 250
Glucose level
ing devices, etc). Training could initially be performed offline
and then incremental training would be light enough to allow 200
for training either on the devices or offline. 150
100
3 Results
The results presented in Table 1 are the root mean squared 50
error (RMSE) for the model when trained with the mean 0
squared error (MSE) loss function and the negative log- 0 100 200 300 400 500 600
Time
likelihood (NLL) loss function. The results indicate that
the model performs comparably when trained with NLL and
MSE, but with the added benefit of estimating the variance of Figure 3: We display the prediction (purple) and standard deviation
(shaded pink) compared to the ground truth (green) for patient 570
the prediction.
(top) and 591 (bottom). Note the much larger uncertainty for patient
The glucose level of patient 591 is harder to predict than the 591.
glucose level for patient 570, which can be seen in the Table 1
where the RMSE for patient 570 is 16.3 and the RMSE for
patient 591 is 24.6. Fig. 3 indicate that the model is able to
learn this by assigning a higher variance to the predictions
for patient 591 than for patient 570. The standard deviation
is illustrated by the pink shaded region in the figure. This is
further illustrated in the Clarke error grid plots in Fig. 4 where
we can see that for patient 570 most of the predictions are in
region A, which is considered as a clinically safe region, but
for patient 591 we can see that more predictions are in the B
region, which is still considered non-critical, but also in the
more critical D region. That is, the variance of the error in the
predictions is higher for patient 591 than for patient 570. In
particular, the model has a hard time predicting hypoglycemic Patient 570 Clarke Error Grid
400
events. E C B
350
Prediction Concentration (mg/dl)
4 Discussion 300
As the competition will provide the benchmarking we focus
250
B
on particular insights we have gained during the development
of the method. 200
Minimalistic ML: Compared to results in the literature for
150 D D
other datasets our system based on recurrent neural networks 100
can predict blood glucose levels in type-1 diabetes for hori-
50
zons of up to 60 minutes into the future using only blood A C E
glucose level as inputs. Generally, the minor improvement 0
over a naive baseline algorithm demonstrate that the predic- 0 50 100 150 200 250 300 350 400
tion problem is a rather difficult one, partly due to large intra
Reference Concentration (mg/dl)
and inter patient variation. Nevertheless, our results suggest Patient 591 Clarke Error Grid
400
that a substantially reduced human effort—avoiding labor-
intensive prior work by experts hand-crafting features based
E C B
350
Prediction Concentration (mg/dl)
on extensive domain knowledge —in designing and training
machine learning methods for precision health management 300
can be feasible.
250
B
Quantifying uncertainty: Our model also outputs an es- 200
timate of the variance of the prediction, thus measuring un- 150 D D
certainty in prediction. This is a useful aspect for a system
which will be used by continuous glucose monitoring users 100
for making decisions about administration of insulin and/or 50
caloric intake. We expect that large-scale data collection of A C E
data from many users will further improve results. The results 0
in Fig. 3 show the two ends of the spectrum in this uncertainty 0 50 100 150 200 250 300 350 400
Reference Concentration (mg/dl)
quantification.
One principle problem is that disambiguating between
intra-patient variation and sensor errors is unlikely to be fea- Figure 4: We show the Clarke error grid plots for patient 570 (top)
sible. An interesting research question concerns methods and patient 591 (bottom). Note that the variance of the error in the
which can detect sensor degradation over time or identify de- predictions is higher for patient 591 than for patient 570.
fects by comparing sensors for the same patient in long-term
physiological; it is unclear if the often smoothed data sup-
plied by sensors is sufficient for that.
Physiological loss function: To our surprise we did not see
improvements when using a physiologically motivated loss
function [Favero et al., 2012] (results not shown), essentially
a smoothed version of the Clarke error grid [Clarke et al.,
1987]. Of course our findings are not proof that such loss
functions cannot improve results. Possibly a larger-scale in-
vestigation, exploring in particular a larger area of the param-
eter space and different training regimes might provide fur- the Keras API of TensorFlow which allows for rapid proto-
ther insights. Penalizing errors for hypo- or hyper-glycemic typing of deep learning models, to implement our model and
states should lead to better real-world performance, as we ob- loss functions.
served comparatively larger deviations in minima and max-
ima. One explanation for that is the relative class imbalance, References
as extrema are rare. This could be countered with data aug-
[Bengio et al., 1994] Yoshua Bengio, Patrice Simard, and
mentation techniques.
Paolo Frasconi. Learning long-term dependencies with
gradient descent is difficult. Neural Networks, IEEE
Model selection: The large inter-patient variation also sug- Transactions on, 5(2):157–166, 1994.
gest that selecting one model for all patients might yield sub- [Bremer and Gough, 1999] Troy Bremer and David A
optimal results, see Fig. 1. Consequently, precision health Gough. Is blood glucose predictable from previous values?
apps should not only adapt parameters to individuals, but also a solicitation for data. Diabetes, 48(3):445–451, 1999.
entertain increasing or decreasing model complexity. While
this is clearly undesirable from a regulatory point-of-view [Clarke et al., 1987] William L Clarke, Daniel Cox, Linda A
(e.g., how to show efficacy in a trial), the differences we Gonder-Frederick, William Carter, and Stephen L
observed seemed to suggest that adaption of complexity im- Pohl. Evaluating clinical accuracy of systems for self-
proves quality of care. monitoring of blood glucose. Diabetes care, 10(5):622–
628, 1987.
Missing data: There are gaps in the training data with miss- [Cryer et al., 2003] Philip E Cryer, Stephen N Davis, and
ing values. Most of the gaps are less than 10 hours, but some Harry Shamoon. Hypoglycemia in diabetes. Diabetes
of the gaps are more than 24 hours. The number of missing care, 26(6):1902–1912, 2003.
data points account for roughly 23 out of 263 days, or 9% of [Favero et al., 2012] Simone Del Favero, Andrea
the data. The gaps could be filled using interpolation, but it is Facchinetti, and Claudio Cobelli. A Glucose-Specific
not immediately clear how this would affect either the train- Metric to Assess Predictors and Identify Models.
ing of the models, or the evaluation of the models, since this 59(5):1281–1290, 2012.
would introduce artificial values. Filling a gap of 24 hours us- [Graves, 2013] Alex Graves. Generating sequences with re-
ing interpolation would not result in realistic data. Instead we current neural networks. arXiv preprint arXiv:1308.0850,
have chosen not to fill the gaps with artifical values and limit 2013.
our models to be trained and evaluated only on real data. This
has its own limitations since we can not predict the initial val- [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and
ues after a gap, but the advantage is that model training and Jürgen Schmidhuber. Long short-term memory. Neural
evaluation is not biased by the introduction of artificial val- computation, 9(8):1735–1780, 1997.
ues. [Hochreiter, 1998] Sepp Hochreiter. The vanishing gradient
problem during learning recurrent neural nets and problem
Conclusion: The field is certainly in desperate need of solutions. International Journal of Uncertainty, Fuzziness
larger data sets and standards for the evaluation. Crowd and Knowledge-Based Systems, 6(02):107–116, 1998.
sourcing from patient associations would be one possibil- [Marling and Bunescu, 2018] Cindy Marling and Razvan
ity, but differences in sensor types and sensor revisions, life Bunescu. The ohiot1dm dataset for blood glucose level
styles, and genetic markup are all obvious confounding fac- prediction. Glucose Prediction News, 2018.
tors. Understanding sensor errors by measuring glucose level [Mirshekarian et al., 2017] Sadegh Mirshekarian, Razvan
in vivo, for example in diabetes animal models, with several Bunescu, Cindy Marling, and Frank Schwartz. Using
sensors simultaneously would be very insightful, and likely LSTMs to learn physiological models of blood glucose
improve prediction quality. Another question concerns pre- behavior. Proceedings of the Annual International Con-
processing in the sensors, which might be another confound- ference of the IEEE Engineering in Medicine and Biology
ing factor in the prediction. While protection of proprietary Society, EMBS, pages 2887–2891, 2017.
intellectual property is necessary, there has been examples,
e.g. DNA microarray technology, where only a completely
open analysis process from the initial steps usually performed
with vendor’s software tools to the final result helped to real-
ize the full potential of the technology.
Software
The software including all scripts to reproduce the com-
putational experiments is released under an open-source
license and available from https://github.com/
johnmartinsson/blood-glucose-prediction.
We have used Googles TensorFlow framework, in particular