=Paper=
{{Paper
|id=Vol-2820/SP4HC_paper5
|storemode=property
|title=Model-Based Reinforcement Learning for Type 1 Diabetes Blood Glucose Control

|pdfUrl=https://ceur-ws.org/Vol-2820/AAI4H-14.pdf
|volume=Vol-2820
|authors=Taku Yamagata,Aisling O’Kane,Amid Ayobi,Dmitri Katz,Katarzyna Stawarz,Paul Marshall,Peter Flach,Raúl Santos-Rodríguez
|dblpUrl=https://dblp.org/rec/conf/ecai/YamagataOAKSMFS20
}}
==Model-Based Reinforcement Learning for Type 1 Diabetes Blood Glucose Control
==
<pdf width="1500px">https://ceur-ws.org/Vol-2820/AAI4H-14.pdf</pdf>
<pre>
           Model-Based Reinforcement Learning for Type 1
                  Diabetes Blood Glucose Control
    Taku Yamagata1 and Aisling O’Kane2 and Amid Ayobi3 and Dmitri Katz4 and Katarzyna Stawarz 5
                    and Paul Marshall6 and Peter Flach7 and Raúl Santos-Rodrı́guez8


Abstract. In this paper we investigate the use of model-based rein-            [18, 24, 25, 26, 29], which creates challenges to developing diabetes
forcement learning to assist people with Type 1 Diabetes with insulin          self-management technologies.
dose decisions. The proposed architecture consists of multiple Echo               In this paper we consider the benefits of using model-based rein-
State Networks to predict blood glucose levels combined with Model             forcement learning (MBRL) to assist decisions about bolus insulin
Predictive Controller for planning. Echo State Network is a version            injections. The goal of reinforcement learning (RL) is to learn se-
of recurrent neural networks which allows us to learn long term de-            quences of actions in an unknown environment [30]. The learner
pendencies in the input of time series data in an online manner. Ad-           (Agent) interacts with the environment, observes its consequences,
ditionally, we address the quantification of uncertainty for a more            and receives a reward (or a cost) signal, which is a numerical num-
robust control. Here, we used ensembles of Echo State Networks to              ber assessing current the situation. The agent decides a sequence of
capture model (epistemic) uncertainty. We evaluated the approach               actions to maximize the reward (or minimize the cost) as shown in
with the FDA-approved UVa/Padova Type 1 Diabetes simulator and                 Fig.1. RL is well-suited to this task because it can learn the model
compared the results against baseline algorithms such as Basal-Bolus           in an online manner with minimal assumptions about the underlying
controller and Deep Q-learning. The results suggest that the model-            process of the blood glucose behaviour and hence can adapt to dif-
based reinforcement learning algorithm can perform equally or better           ferent individuals or changes over time. MBRL is particularly well
than the baseline algorithms for the majority of virtual Type 1 Dia-           suited to this objective because it is more sample-efficient than alter-
betes person profiles tested.                                                  native RL approaches (model-free reinforcement learning (MFRL))
                                                                               and also allows us to generate predictions for consequences of coun-
                                                                               terfactual actions that can be used as explanations of the suggestion.
1    Introduction
                                                                               In our MBRL setting, we also can estimate the confidence level of the
Type 1 Diabetes is a chronic condition that is characterized by the            predictions by using the prediction uncertainty. It is very important to
lack of insulin secretion and resulting in uncontrolled blood glucose          show the explanation for the suggestion together with its confidence
level increase [1, 9]. High blood glucose levels for extended peri-            level so that the person that receives the suggestion can make a deci-
ods of time can result in permanent damage to the eyes, nerves, kid-           sion whether they would follow the recommended course of action.
neys and blood vessels, while low blood glucose levels can lead to
death [19, 20, 23]. To manage blood glucose level, people on multi-
dose injection (MDI) therapy usually take two types of insulin injec-
tions: basal and bolus. The basal is long-acting insulin, which pro-
vides a constant supply of insulin over 24-48 hours, helping main-
tain resting blood glucose levels. The bolus is fast-acting insulin
which helps to suppress the peak of the blood glucose levels caused
by meals or to counteract hyperglycemia [23]. People with diabetes
must make constant decisions of the timing and amount of these in-
sulin injections, which is often challenging as insulin requirements
for meals can change depending upon many factors such as exercise,
                                                                                       Figure 1. Reinforcement learning framework overview.
sleep, or stress. The idiosyncratic nature of the condition means that
triggers, symptoms and even treatments are often quite individual
1 University of Bristol, UK, email: taku.yamagata@bristol.ac.uk
2 University of Bristol, UK, email: a.okane@bristol.ac.uk
3 University of Bristol, UK, email: amid.ayobi@bristol.ac.uk
                                                                                  As a first step towards realising such a recommender system, we
4 The Open University, UK, email: dmitri.katz@open.ac.uk                       investigated how well MBRL can learn the insulin injection deci-
5 Cardiff University, UK, email: StawarzK@cardiff.ac.uk                        sion and compared it with both a typical MFRL algorithm (deep
6 University of Bristol, UK, email: p.marshall@bristol.ac.uk                   Q-Learning (DQN)) and an algorithm that mimics human decision-
7 University of Bristol, UK, email: Peter.Flach@bristol.ac.uk
                                                                               making (Basal-Bolus controller (BBController)). We used an FDA-
8 University of Bristol, UK, email: enrsr@bristol.ac.uk
                                                                               approved Type 1 Diabetes computer simulator and let the algorithms
 Copyright © 2020 for this paper by its authors. Use permitted under Cre-
  ative Commons License Attribution 4.0 International (CC BY 4.0). This
                                                                               decide the insulin injections and evaluated its blood glucose level
  volume is published and copyrighted by its editors. Advances in Artificial   behaviours.
  Intelligence for Healthcare, September 4, 2020, Virtual Workshop.               Our MBRL approach builds upon previous work on Echo State
Networks (ESNs) [13, 14], the ensembles of models for MBRL [5]             3.1    Cost function
and model predictive controller (MPC) for artificial pancreas [3, 4].
However we believe this is the first attempt to combine these algo-        For our task, it is natural to use as cost function a measure of risk
rithms for the Type 1 Diabetes blood glucose level control task, and       associated with the given blood glucose level. However it is not
evaluate its performance against non-MBRL algorithms.                      straightforward to define such a measure, as it presents different
   This paper is organized as follows. Section 2 introduces related        scales of risks between higher than normal blood glucose levels (hy-
work regarding the blood glucose control task. Section 3 describes         perglycemia) and lower than normal blood glucose levels (hypo-
our MBRL method. Section 4 presents our evaluation method, bench-          glycemia). Kovatchev et al. proposed the following expression to
mark algorithms and the evaluation results. Finally, Section 5 con-        symmetrize the risks of hyper and hypoglycemia [16]. This blood
cludes with a summary and possible future work.                            glucose risk function fr is defined as in Eq. 1. The blood glucose
                                                                           level transition from 180 to 250mg/dl would appear threefold larger
                                                                           than a transition from 70 to 50mg/dl, whereas these are similar in
2   Related Work                                                           terms of the risk function variations.
                                                                                                                                     2
Several attempts have been made for a closed-loop artificial pancreas,             fr (BGL) = 10 1.509 log(BGL)1.084 − 5.381                 (1)
especially in the control system society using MPC [3], proportional-
                                                                           where BGL is the blood glucose level in mg/dl. Fig. 2 shows the
integral-derivative control [28] and fuzzy logic [2].
                                                                           mapping between blood glucose level (x-axis) to the risk function
   However, there are relatively few studies on the blood glucose lev-
                                                                           (y-axis). We used the risk function value as the cost function, hence
els control task using RL approaches. Most of the early works em-
ploy compartmental blood glucose and insulin models to infer some
of insulin/glucose related internal states of human body, and then
learn its insulin injection policy with relatively simple MFRL algo-
rithms such as Q-Learning [21, 22] or Actor-Critic [7, 8]. Fox et
al. employed more recent RL techniques [12], such as deep neural
networks for the Q-Learning algorithm – arguably the most com-
mon MFRL algorithm. They showed that although the agent was not
given any prior knowledge of the blood glucose/insulin relations, it
learns its insulin injection policy and achieves performance compa-
rable with existing algorithms.
   In the field of model-based system control several approaches ex-
ist – we refer the reader to [3] and the references therein. The clos-
est to our work is [4], where the authors use a linear compartmental
model for predicting the mean and variance of the future blood glu-
cose levels. It exploits MPC for planning by taking into account the         Figure 2. Risk function proposed by Kovatchev et al. [16]. The figure
variance of the blood glucose level prediction. The main differences         shows the relationship between blood glucose level [mg/dl] and its risk
from our work are: (1) they employ a linear compartmental model                                         function value.
which has a small number of parameters and hence easier to learn,
whereas we use more generic recurrent neural networks, which have          our RL agent searches a policy minimising the total risk values over
greater flexibility to adapt to any personal blood glucose level be-       an episode.
haviour; (2) their model parameters are learnt off-line, whereas ours
are adjusted online; and (3) the handling of uncertainty – we measure
the model’s uncertainty while they measure the uncertainty involved        3.2    Echo State Networks
in meal events.                                                            ESNs were proposed as an alternative structure of standard recurrent
                                                                           neural networks in machine learning [14]. They are also called liquid
3   Methods                                                                state machine in computational neuroscience [13]. ESNs take an in-
                                                                           put sequence u = (u(1), u(2), ..., u(T )) by recursively processing
In order to apply RL algorithms to this problem, we formulate the          each symbol while maintaining its internal hidden state x. At each
task as Markov Decision Process (MDP), which has four tuples               time step t, the ESN takes input u(t) ∈ RK and updates its hidden
(S, A, p, c) where S is a set of states, A is a set of actions, p is the   state x(t) ∈ RN by:
state transition probabilities and c is a cost function. Essentially the
                                                                                       x̃(t)   = f (Win · u(t) + W · x(t − 1))                     (2)
blood glucose control task is a Partially Observable MDP, however
we see it as an MDP by defining state S as all history of insulin doses                x(t)     = (1 − α) · x(t − 1) + α · x̃(t),                  (3)
and carbohydrate intakes.                                                  where f is the internal unit activation function, which is tanh in our
   More precisely, the overall pipeline makes use of ESNs to store         model, Win ∈ RN ×K is the input weight matrix, W ∈ RN ×N is
the history in its hidden states, shown in Section3.2. The corre-          the internal connections weight matrix and α ∈ (0, 1] is the leak-
sponding actions A are the dosages of bolus insulin. We exploit the        age rate, which controls the speed of the hidden states change hence
risk function introduced in [16] as our cost function c, described         controls the output smoothness.
in Section 3.1. While we use the model-based reinforcement learn-             The output at time step t, y(t) ∈ RL is obtained from the hidden
ing (MBRL) algorithm with ESNs for the prediction of blood glucose         states and the inputs by:
levels, MPC generates the insulin dose suggestions from the blood                                               h             iT 
glucose level predictions (Section 3.4) and their uncertainty estima-                   y(t) = f out Wout · x(t)T , u(t)T           ,         (4)
tions (Section 3.3).
where f out is the output unit activation function (which is the iden-    A positive (negative) risk margin means our metric E[c(BGL)] dis-
tity function in our model as we are dealing with a regression task)      courages (encourages) taking risks. If we use a convex cost function
and Wout ∈ RL×(N +K) is the output weights matrix.                        as described in Section 3.1, RM is positive according to Jensen’s
   The matrices for updating the hidden states, Win and W, are ran-       inequality, hence it discourages risks.
domly initialized and fixed (not updated during learning process),
only the output weights matrix Wout is leaned to obtain the target        3.4    Model Predictive Controller
output sequences. As it only learns the output weights, it doesn’t re-
quire back propagation through the network nor time, hence it learns      Model predictive controller (MPC) is a planning method to facilitate
much faster than the normal recurrent neural networks. The downside       control of systems with a long time delay and non-linear characteris-
of using ESN is that it requires much higher number of hidden states      tics. The MPC uses a prediction model to estimate the consequences
to achieve good performance, hence it required more computational         of a sequence of actions and repeats the process for many action se-
power for inference.                                                      quences. Then it picks the sequence of actions that gives the best
   To make ESNs work properly, the fixed weights must satisfy             consequence and applies the first action of the sequence. In the next
the so-called echo state property: the internal states x(t) should be     time step this process is repeated. This effectively means it re-plans
uniquely defined only by the past inputs u(k)|k=...t [14]. The ac-        the sequence of actions based on the latest state information from the
tual method to initialise the weights can be found in [17], which also    environment, which makes the algorithm robust against any noise or
gives useful guidance for using ESNs.                                     prediction errors.
                                                                             There are several algorithms to generate the sequence of actions to
                                                                          test – such as random shooting [27] and cross entropy method [10].
   ESNs for the blood glucose level prediction task In our work,          In our work, we use a fixed table for the sequence of actions to test.
the ESN takes a sequence of bolus insulin injection and carbohydrate      The table has six action sequences, each of which takes a different
intakes as inputs, and predicts the blood glucose level.                  amount of bolus injection as its first action. The amount of bolus in-
   To learn the ESN output weights we use the Mean Squared Error          jection at the first action is {0, 5, 10, 20, 40, 80} times of the person’s
between predicted and observed blood glucose levels as loss func-         basal infusion rate. Following the approach of [12], the basal infusion
tion.                                                                     rate is given for each virtual person’s model, and we use it to scale
                              T
                           1 X                                            the bolus injection. While our model generates suggestions for bolus
                Ld (θ) =          (µθ (t) − BGL(t))2             (5)
                          T t=1                                           injections, for the basal injections, it assumes the person is taking the
                                                                          given basal infusion rate. The action sequence length (time horizon)
Here, µθ (t) is the predicted blood glucose level by ESN at time          is set to 48 time steps, which is 4 hours long as each time step repre-
step t, where θ is the optimization parameter (here it is Wout ) and      sents a five-minute period. Each action sequence has a bolus injection
BGL(t) is observed blood glucose level. As it can be seen as a lin-       as the first action of the sequence. We believe this is sensible because
ear regression problem, the output weights are derived by solving the     the bolus injections is normally taken just after or before a meal and
Normal equation [17].                                                     there is no meal announcement in our system at moment (the algo-
   To capture model (epistemic) uncertainty, it applies multiple in-      rithm does not know the meal event until it happens). Therefore, the
stances of ESNs, and each of them has different input and internal        best time to take bolus injection would be immediately after detect-
connection weights. ESNs are well suited for the ensemble approach        ing the meal event, which is the first action in the sequence. A proper
as it has fixed random internal weights which project the inputs se-      meal announcement mechanism is left for future work.
quence into different hidden states. So naturally they output different
values where there is no training data, capturing higher epistemic un-    4     Evaluation
certainty. In our evaluation, we employ five instances of ESNs, which
is suggested by [5].                                                      We empirically evaluated how well the model-based reinforcement
                                                                          learning (MBRL) can learn insulin injection decisions and compared
                                                                          it with a typical model-free reinforcement learning (MFRL) algo-
3.3    Uncertainty quantification                                         rithm and also with a non-RL algorithm designed to mimic human
                                                                          decision-making. In this paper, we did not compare the blood glucose
We employ multiple ESNs to capture the uncertainty in predicted           level prediction accuracy with other prediction models. Instead, we
blood glucose level. They produce multiple predictions of the blood       focused on evaluating the performance of the agents. The overview of
glucose levels from the ESN models for each action sequence. To           the evaluation system is shown in Fig. 3. We used an FDA-approved
quantify the cost (risk) of uncertainty, we take the mean of the          Type 1 Diabetes simulator, which takes meal and insulin injection in-
cost of the predicted  blood glucose levels for each of action se-        formation, then outputs a blood glucose level (BGL) as a continuous
                    −1 PM
quence M1T n+T                         m
             P
                t=n       m=1 c(BGLt ), where c(.) is a cost func-        glucose monitor (CGM) reading at each time step. The algorithms
           m
tion, BGLt is blood glucose levels prediction from ESN model m            (agents) receive the meal, insulin and blood glucose level informa-
at time step t, and M and T are number of ESN models and num-             tion and decides the amount of insulin taking in the next time step.
ber of time steps in the action sequence. We then select the action       We simulated the algorithms together with the Type 1 Diabetes sim-
sequence which minimises this mean cost.                                  ulator, and evaluated how well the blood glucose levels are managed.
   We encourage (optimistic or exploratory approach) or discour-
age (pessimistic or safe approach) taking risks by designing the cost
function accordingly. Here we define a risk margin RM as the dif-
                                                                          4.1    UVa/Padova Type 1 Diabetes simulator
ference between the averaged cost function and cost of the averaged
blood glucose level predictions.                                          The UVa/Padova Type 1 Diabetes Simulator [6] was the first com-
                                                                          puter model accepted by the FDA as a substitute for preclinical tri-
                RM = E[c(BGL)] − c(E[BGL]).                        (6)    als of certain insulin treatments, including closed-loop algorithms.
                                                                         4.3    Simulation Conditions
                                                                         Each episode lasts 24 hours, starting at 6am and finishing at 6am
                                                                         the next day. Three meals and three snack events are simulated with
                                                                         some randomness in terms of amount, timing and also whether they
                                                                         take the meal/snack. The timing follows a truncated normal distribu-
                                                                         tion and the amount is normally distributed. The meal parameters are
                                                                         shown in Table 1. The agent receives information from the environ-

                                                                                      Table 1.   Parameters for meal event generator.
            Figure 3. Evaluation system top level diagram.

                                                                                                          Time [hours]               Carbs. [g]
                                                                          Meal type     Prob.    lower    upper    mean      std.   mean     std.
The model takes carbohydrate intakes and insulin injection as inputs,                            bound    bound
simulates human body insulin/blood glucose behaviours and outputs
                                                                          Breakfast     0.95       5         9         7      1         45   10
the blood glucose level measurements. It has gastro-interstinal tract,    Snack#1        0.3        9        10       9.5    0.5        10    5
glucose kinetics and insulin kinetics sub models. Each of these sub       Lunch         0.95       10        14       12      1         70   10
models is defined with differential equations with parameters to sim-     Snack#2        0.3       14        16       15     0.5        10    5
ulate different individuals. Our simulator is based on an open source     Dinner        0.95       16        20       18      1         80   10
                                                                          Snack#3        0.3       20        23      21.5    0.5        10    5
implementation of the UVa/Padova Type 1 Diabetes simulator [15],
which comes with different profiles for 30 virtual people with type 1
diabetes – ten each for children, adolescents and adults. Our experi-    ment such as the meal (carbohydrate), insulin and blood glucose lev-
ments use nine virtual people, three of each age group.                  els, and decides the insulin dose for the next time step. Each time step
                                                                         is set to five minutes in length. In this evaluation, the person does not
4.2    Benchmark algorithms                                              take food to compensate for low blood glucose levels (the meal event
                                                                         always follows a pre-defined order as described above). While this
We used two benchmark algorithms to compare the proposed ap-             is not realistic, it is a good way to measure how well the algorithm
proach against, one from RL algorithms (GRU-DQN) and the other           works because ultimately we would like to develop an algorithm that
one from non-RL approaches (BBController). These are described           does not require any corrections from the user. The episode is termi-
below.                                                                   nated if the blood glucose level goes below 20 mg/dl or beyond 600
                                                                         mg/dl, as these limit are extreme and they are outside of the possible
    GRU-DQN Deep Q-Learning (DQN) is a common MFRL al-                   blood glucose level range considered by [16].
gorithm, which learns the action-value function Q(s, a) – expected
cumulative future rewards starting with state s and action a. It then
                                                                         4.4    Results
uses the learned action value function to decide which action to take
at time step t by at = argmaxa∈A Q(st , a). In our work, the agent       We train MBRL for 200 episodes and GRU-DQN for 1000 episodes,
observes the blood glucose levels from a CGM, carbohydrate intakes       then use the last 30 episodes to measure the percentage of episodes
and insulin injections, and infers the action value function. It is a    completed without termination due to extreme blood glucose levels.
partially observable model so we used gated recurrent units (GRU)        For BBController, we just run 30 episodes to measure, as it has pre-
to infer the hidden states and approximate the action value function.    optimized model parameters and no training is required.
GRU-DQN was successfully applied to this problem before [12] so             The results are given in Table 2. MBRL gives better results than
we followed their same set up which involves two GRU recurrent           GRU-DQN and comparable with BBController. MBRL struggles
layers of 128 hidden states and followed by a fully connected output     with child#002, #003 and adolescent#002. By looking into these
layer size of 128. However, our our states (the input of GRU-DQN)        cases, we found that MBRL fails due to the MPC time horizon not be-
include carbohydrate information, whereas [12] does not. We include      ing long enough. The MPC time horizon is set to 4 hours, hence the
it here to make our comparison fair against the MBRL algorithm,          agent could not foresee a possible hypoglycemia event in the early
which has acess to the carbohydrate information.                         morning after the person takes an evening meal. The agent suggests
                                                                         too much insulin, and it causes hypoglycemia in the early morning.
    BBController Basal-Bolus Controller mimics how an individ-           This can be fixed by increasing the MPC time horizon, but requires
ual with Type 1 Diabetes controls their blood glucose levels. The        some additional consideration as it might lead to inappropriate sug-
UVa/Padova simulator comes with the necessary parameters for this        gestions during the day.
algorithm for each of the virtual people with Type 1 Diabetes models,       Table 3 shows the percentage of time spent in a target blood glu-
such as basal insulin rates bas, a correction factor CF and a carbohy-   cose level range (70-180 mg/dl.) These are measured in the last 10 of
drate ratio CR. The simulator decides the amount of insulin injection    the completed episodes(i.e., not terminated). Here MBRL gives the
by bas + (ct > 0) · (ct /CR + (bt > 150) · (bt − btgt )/CF ), where      best overall results compared to the other agents. Note that no data
ct is carbohydrate intake at time step t, bt is the blood glucose mea-   is available for adolescent#002, as it fails to get any non-terminated
surements, btgt is a target blood glucose level. The last term is only   episode (due to the reason described above).
applied when the blood glucose measurement exceeds 150 mg/dl.               We also evaluated the effect of the uncertainty estimation by com-
We use the implemented model that comes with the Type 1 Diabetes         paring the results from MBRL with/without it. For MBRL without
simulator [15].                                                          uncertainty, we take an average over multiple ESNs predictions to
 Table 2. % of number of completed episodes without termination due to
                  extreme blood glucose level value.

    Person Profile          BBCont.        GRU-DQN               MBRL
    child#001                 30.0               3.3              100
    child#002                  90               23.3              53.3
    child#003                 66.7              43.3              30.0
    adolescent#001            100               100               100
    adolescent#002            66.7              56.7               0.0
    adolescent#003             90                20               100
    adult#001                 100               70.0              96.7
    adult#002                 100               100               100
    adult#003                 96.7              16.7              100


Table 3.     % of time spent in the target blood glucose level range (70 - 180
                                     mg/dl).                                         Figure 4. Comparison between MBRL with uncertainty and without
                                                                                   uncertainty models. The upper plot shows the learning curve for simulated
                                                                                 period for each episode, which goes up to 24 hours if the blood glucose level
    Person Profile          BBCont.        GRU-DQN               MBRL
                                                                                  is controlled well. The lower plot shows % of time spent in the target blood
    child#001                 44.0              28.3              59.6                                    glucose range (70-180mg/dl)
    child#002                 42.6              38.2              55.3
    child#003                 40.7              36.0              45.1
    adolescent#001            85.8              81.4             100.0           GRU-DQN and BBController. The results suggest that the MBRL
    adolescent#002            49.0              39.8               n/a           approach works better than the GRU-DQN algorithm and similar or
    adolescent#003            46.7              42.4              66.1           slightly better than the BBController. Also, our results show that tak-
    adult#001                 60.1              50.3             56.8            ing into account the model uncertainty improves its performance in
    adult#002                 73.3              66.9              73.3           the early stages of learning.
    adult#003                 58.7              46.9              68.8
                                                                                    There are several avenues for future work. At the present stage we
                                                                                 only tested our algorithms with the UVa/Padova Type 1 Diabetes sim-
come up with a single blood glucose prediction, and then we calcu-               ulator, which is good for single meal scenarios but not for multiple
late its cost. Whereas MBRL with uncertainty computes the cost of                meals [6]. This is primarily because the model has fixed parameters
the all predictions, then takes average of the costs as described in             for each person and does not simulate meal-by-meal nor day-by-day
Section 3.3.                                                                     parameter drifting. In addition, our current learning method must be
   Figure 4 shows the learning curves for these two MBRL algo-                   extended to adapt to parameter drifts. A possible approach for such
rithms with adult#001. The upper plot shows the episode period,                  an extension would be to introduce meta-learning [11].
which goes up to 24 hours if there is no termination, and the bot-                  Another area for further work relates to meal information. We as-
tom plot shows % of time spent in the target blood glucose range.                sumed all meal events are correctly given by the person when the
From the upper plot, the algorithm with uncertainty achieves “no                 event is happening; however, this may not be very realistic as it is
episode termination” (24 hours episode) much earlier than the one                a considerable burden for a person to put every single meal event
without estimating uncertainty. At an early stage of the learning pro-           into the algorithm. It is also hard to know the exact carbohydrate
cess, the prediction model is not very accurate, so it is much better            count of each meal. Some researchers therefore structure the blood
by taking into account its uncertainty. For the later stages, the pre-           glucose predictor without having a meal input. Another alternative
dictions become more accurate, hence it shows similar performance                would be to have a model to back-predict a meal event from the ob-
in both cases. Table 4 shows asymptotic results of the percentage of             served blood glucose levels. We think it is possible to learn the meal
time spent in the target blood glucose range, indicating that both have          event in conjunction with the blood glucose level prediction model
similar asymptotic performances.                                                 with occasional human inputs.

     Table 4.   % of time spent in the target blood glucose range (70 - 180
                                     mg/dl).                                     ACKNOWLEDGEMENTS
                                                                                 This project is funded by the Innovate UK Digital Catalyst Award –
    Person Profile              MBRL                         MBRL                Digital Health and is in partnership with Quin Technology.
                           (with uncertainty)          (without uncertainty)
    child#001                    59.6                          57.5
    adolescent#001               100.0                         95.9              REFERENCES
    adult#001                     56.8                         56.7              [1] Kurt George Matthew Mayer Alberti and Paul Z Zimmet, ‘Definition,
                                                                                     diagnosis and classification of diabetes mellitus and its complications.
                                                                                     part 1: diagnosis and classification of diabetes mellitus. provisional re-
                                                                                     port of a who consultation’, Diabetic medicine, 15(7), 539–553, (1998).
                                                                                 [2] Eran Atlas, Revital Nimri, Shahar Miller, Eli A. Grunberg, and Moshe
5      Conclusions and Future Work                                                   Phillip, ‘MD-logic artificial pancreas system: A pilot study in adults
                                                                                     with type 1 diabetes’, Diabetes Care, (2010).
We investigated the use of MBRL to assist Type 1 Diabetes decision-              [3] B. Wayne Bequette, ‘Algorithms for a closed-loop artificial pancreas:
making by evaluating MBRL with the FDA-approved UVa/Padova                           The case for model predictive control’, Journal of Diabetes Science and
simulator. We compared the results with two baseline algorithms,                     Technology, 7(6), 1632–1643, (2013).
 [4] Fraser Cameron, B. Wayne Bequette, Darrell M. Wilson, Bruce A.                   (2016).
     Buckingham, Hyunjin Lee, and Günter Niemeyer, ‘A closed-loop arti-         [25] Aisling Ann O’Kane, Sun Young Park, Helena Mentis, Ann Blandford,
     ficial pancreas based on risk management’, Journal of Diabetes Science           and Yunan Chen, ‘Turning to peers: integrating understanding of the
     and Technology, 5(2), 368–379, (2011).                                           self, the condition, and others’ experiences in making sense of complex
 [5] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey                    chronic conditions’, Computer Supported Cooperative Work (CSCW),
     Levine, ‘Deep Reinforcement Learning in a Handful of Trials using                25(6), 477–501, (2016).
     Probabilistic Dynamics Models’, in Advances in Neural Information           [26] Peter Pesl, Pau Herrero, Monika Reddy, Nick Oliver, Desmond G John-
     Processing Systems, volume 2018-Decem, (2018).                                   ston, Christofer Toumazou, and Pantelis Georgiou, ‘Case-based reason-
 [6] Chiara Dalla Man, Francesco Micheletto, Dayu Lv, Marc Breton, Boris              ing for insulin bolus advice: evaluation of case parameters in a six-week
     Kovatchev, and Claudio Cobelli, ‘The UVA/PADOVA type 1 diabetes                  pilot study’, Journal of diabetes science and technology, 11(1), 37–42,
     simulator: New features’, Journal of Diabetes Science and Technology,            (2017).
     8(1), 26–34, (2014).                                                        [27] Anil V. Rao, ‘A survey of numerical methods for optimal control’, in
 [7] Elena Daskalaki, Peter Diem, and Stavroula G. Mougiakakou, ‘An                   Advances in the Astronautical Sciences, (2010).
     Actor-Critic based controller for glucose regulation in type 1 diabetes’,   [28] Garry M. Steil, ‘Algorithms for a closed-loop artificial pancreas: The
     Computer Methods and Programs in Biomedicine, 109(2), 116–125,                   case for proportional-integral-derivative control’, Journal of Diabetes
     (2013).                                                                          Science and Technology, 7(6), 1621–1631, (2013).
 [8] Elena Daskalaki, Peter Diem, and Stavroula G Mougiakakou, ‘Person-          [29] Cristiano Storni, ‘Complexity in an uncertain and cosmopolitan world.
     alized tuning of a reinforcement learning control algorithm for glucose          rethinking personal health technology in diabetes with the tag-it-
     regulation’, in 2013 35th Annual international conference of the IEEE            yourself.’, PsychNology Journal, 9(2), (2011).
     engineering in medicine and biology society (EMBC), pp. 3487–3490.          [30] Richard S Sutton and Barto Andrew G., Reinforcement Learning, The
     IEEE, (2013).                                                                    MIT Press, 1998.
 [9] Asa K Davis, Stephanie N DuBose, Michael J Haller, Kellee M Miller,
     Linda A DiMeglio, Kathleen E Bethin, Robin S Goland, Ellen M
     Greenberg, David R Liljenquist, Andrew J Ahmann, et al., ‘Prevalence
     of detectable c-peptide according to age at diagnosis and duration of
     type 1 diabetes’, Diabetes care, 38(3), 476–481, (2015).
[10] Pieter Tjerk De Boer, Dirk P. Kroese, Shie Mannor, and Reuven Y.
     Rubinstein, ‘A tutorial on the cross-entropy method’, Annals of Opera-
     tions Research, (2005).
[11] Chelsea Finn, Pieter Abbeel, and Sergey Levine, ‘Model-agnostic meta-
     learning for fast adaptation of deep networks’, 34th International Con-
     ference on Machine Learning, ICML 2017, 3, 1856–1868, (2017).
[12] Ian Fox and Jenna Wiens, ‘Reinforcement Learning for Blood Glucose
     Control: Challenges and Opportunities’, (2019).
[13] Nurdan Gürbilek, ‘Real-Time Computing Without Stable States: A
     New Framework for Neural Computation Based on Perturbations’,
     Journal of Chemical Information and Modeling, 53(9), 1689–1699,
     (2013).
[14] Herbert Jaeger, ‘The “ echo state ” approach to analysing and train-
     ing recurrent neural networks – with an Erratum note 1’, GMD Report,
     (148), 1–47, (2010).
[15] Jinyu Xie. Simglucose v0.2.1 https://github.com/jxx123/simglucose,
     2018.
[16] Boris P. Kovatchev, Daniel J. Cox, Linda A. Gonder-Frederick, and
     William Clarke, ‘Symmetrization of the blood glucose measurement
     scale and its applications’, Diabetes Care, 20(11), 1655–1658, (1997).
[17] Mantas Lukoševičius, ‘A practical guide to applying echo state net-
     works’, Lecture Notes in Computer Science (including subseries Lec-
     ture Notes in Artificial Intelligence and Lecture Notes in Bioinformat-
     ics), 7700 LECTU, 659–686, (2012).
[18] B Mianowska, W Fendler, A Szadkowska, A Baranowska, E Grzelak-
     Agaciak, J Sadon, Hillary Keenan, and W Mlynarski, ‘Hba 1c levels in
     schoolchildren with type 1 diabetes are seasonally variable and depen-
     dent on weather conditions’, Diabetologia, 54(4), 749–756, (2011).
[19] Annemarie Mol and John Law, ‘Embodied action, enacted bodies: The
     example of hypoglycaemia’, Body & society, 10(2-3), 43–62, (2004).
[20] Elizabeth D Mynatt, Gregory D Abowd, Lena Mamykina, and Julie A
     Kientz, ‘Understanding the potential of ubiquitous computing for
     chronic disease management’, Health Informatics: A Patient-Centered
     Approach to Diabetes. Health Informatics, 85–106, (2010).
[21] Phuong D Ngo, Susan Wei, Anna Holubová, Jan Muzik, and Fred
     Godtliebsen, ‘Control of blood glucose for type-1 diabetes by using re-
     inforcement learning with feedforward algorithm’, Computational and
     mathematical methods in medicine, 2018, (2018).
[22] Phuong D Ngo, Susan Wei, Anna Holubová, Jan Muzik, and Fred
     Godtliebsen, ‘Reinforcement-learning optimal control for type-1 dia-
     betes’, in 2018 IEEE EMBS International Conference on Biomedical &
     Health Informatics (BHI), pp. 333–336. IEEE, (2018).
[23] NHS Choices. Type 1 diabetes https://www.nhs.uk/conditions/type-1-
     diabetes/, 2018.
[24] Aisling Ann O’Kane, Yi Han, and Rosa I Arriaga, ‘Varied & bespoke
     caregiver needs: organizing and communicating diabetes care for chil-
     dren in the diy era’, in Proceedings of the 10th EAI International Con-
     ference on Pervasive Computing Technologies for Healthcare, pp. 9–12,

</pre>