=Paper=
{{Paper
|id=Vol-2820/SP4HC_paper5
|storemode=property
|title=Model-Based Reinforcement Learning for Type 1 Diabetes Blood Glucose Control
|pdfUrl=https://ceur-ws.org/Vol-2820/AAI4H-14.pdf
|volume=Vol-2820
|authors=Taku Yamagata,Aisling O’Kane,Amid Ayobi,Dmitri Katz,Katarzyna Stawarz,Paul Marshall,Peter Flach,Raúl Santos-Rodríguez
|dblpUrl=https://dblp.org/rec/conf/ecai/YamagataOAKSMFS20
}}
==Model-Based Reinforcement Learning for Type 1 Diabetes Blood Glucose Control
==
Model-Based Reinforcement Learning for Type 1 Diabetes Blood Glucose Control Taku Yamagata1 and Aisling O’Kane2 and Amid Ayobi3 and Dmitri Katz4 and Katarzyna Stawarz 5 and Paul Marshall6 and Peter Flach7 and Raúl Santos-Rodrı́guez8 Abstract. In this paper we investigate the use of model-based rein- [18, 24, 25, 26, 29], which creates challenges to developing diabetes forcement learning to assist people with Type 1 Diabetes with insulin self-management technologies. dose decisions. The proposed architecture consists of multiple Echo In this paper we consider the benefits of using model-based rein- State Networks to predict blood glucose levels combined with Model forcement learning (MBRL) to assist decisions about bolus insulin Predictive Controller for planning. Echo State Network is a version injections. The goal of reinforcement learning (RL) is to learn se- of recurrent neural networks which allows us to learn long term de- quences of actions in an unknown environment [30]. The learner pendencies in the input of time series data in an online manner. Ad- (Agent) interacts with the environment, observes its consequences, ditionally, we address the quantification of uncertainty for a more and receives a reward (or a cost) signal, which is a numerical num- robust control. Here, we used ensembles of Echo State Networks to ber assessing current the situation. The agent decides a sequence of capture model (epistemic) uncertainty. We evaluated the approach actions to maximize the reward (or minimize the cost) as shown in with the FDA-approved UVa/Padova Type 1 Diabetes simulator and Fig.1. RL is well-suited to this task because it can learn the model compared the results against baseline algorithms such as Basal-Bolus in an online manner with minimal assumptions about the underlying controller and Deep Q-learning. The results suggest that the model- process of the blood glucose behaviour and hence can adapt to dif- based reinforcement learning algorithm can perform equally or better ferent individuals or changes over time. MBRL is particularly well than the baseline algorithms for the majority of virtual Type 1 Dia- suited to this objective because it is more sample-efficient than alter- betes person profiles tested. native RL approaches (model-free reinforcement learning (MFRL)) and also allows us to generate predictions for consequences of coun- terfactual actions that can be used as explanations of the suggestion. 1 Introduction In our MBRL setting, we also can estimate the confidence level of the Type 1 Diabetes is a chronic condition that is characterized by the predictions by using the prediction uncertainty. It is very important to lack of insulin secretion and resulting in uncontrolled blood glucose show the explanation for the suggestion together with its confidence level increase [1, 9]. High blood glucose levels for extended peri- level so that the person that receives the suggestion can make a deci- ods of time can result in permanent damage to the eyes, nerves, kid- sion whether they would follow the recommended course of action. neys and blood vessels, while low blood glucose levels can lead to death [19, 20, 23]. To manage blood glucose level, people on multi- dose injection (MDI) therapy usually take two types of insulin injec- tions: basal and bolus. The basal is long-acting insulin, which pro- vides a constant supply of insulin over 24-48 hours, helping main- tain resting blood glucose levels. The bolus is fast-acting insulin which helps to suppress the peak of the blood glucose levels caused by meals or to counteract hyperglycemia [23]. People with diabetes must make constant decisions of the timing and amount of these in- sulin injections, which is often challenging as insulin requirements for meals can change depending upon many factors such as exercise, Figure 1. Reinforcement learning framework overview. sleep, or stress. The idiosyncratic nature of the condition means that triggers, symptoms and even treatments are often quite individual 1 University of Bristol, UK, email: taku.yamagata@bristol.ac.uk 2 University of Bristol, UK, email: a.okane@bristol.ac.uk 3 University of Bristol, UK, email: amid.ayobi@bristol.ac.uk As a first step towards realising such a recommender system, we 4 The Open University, UK, email: dmitri.katz@open.ac.uk investigated how well MBRL can learn the insulin injection deci- 5 Cardiff University, UK, email: StawarzK@cardiff.ac.uk sion and compared it with both a typical MFRL algorithm (deep 6 University of Bristol, UK, email: p.marshall@bristol.ac.uk Q-Learning (DQN)) and an algorithm that mimics human decision- 7 University of Bristol, UK, email: Peter.Flach@bristol.ac.uk making (Basal-Bolus controller (BBController)). We used an FDA- 8 University of Bristol, UK, email: enrsr@bristol.ac.uk approved Type 1 Diabetes computer simulator and let the algorithms Copyright © 2020 for this paper by its authors. Use permitted under Cre- ative Commons License Attribution 4.0 International (CC BY 4.0). This decide the insulin injections and evaluated its blood glucose level volume is published and copyrighted by its editors. Advances in Artificial behaviours. Intelligence for Healthcare, September 4, 2020, Virtual Workshop. Our MBRL approach builds upon previous work on Echo State Networks (ESNs) [13, 14], the ensembles of models for MBRL [5] 3.1 Cost function and model predictive controller (MPC) for artificial pancreas [3, 4]. However we believe this is the first attempt to combine these algo- For our task, it is natural to use as cost function a measure of risk rithms for the Type 1 Diabetes blood glucose level control task, and associated with the given blood glucose level. However it is not evaluate its performance against non-MBRL algorithms. straightforward to define such a measure, as it presents different This paper is organized as follows. Section 2 introduces related scales of risks between higher than normal blood glucose levels (hy- work regarding the blood glucose control task. Section 3 describes perglycemia) and lower than normal blood glucose levels (hypo- our MBRL method. Section 4 presents our evaluation method, bench- glycemia). Kovatchev et al. proposed the following expression to mark algorithms and the evaluation results. Finally, Section 5 con- symmetrize the risks of hyper and hypoglycemia [16]. This blood cludes with a summary and possible future work. glucose risk function fr is defined as in Eq. 1. The blood glucose level transition from 180 to 250mg/dl would appear threefold larger than a transition from 70 to 50mg/dl, whereas these are similar in 2 Related Work terms of the risk function variations. 2 Several attempts have been made for a closed-loop artificial pancreas, fr (BGL) = 10 1.509 log(BGL)1.084 − 5.381 (1) especially in the control system society using MPC [3], proportional- where BGL is the blood glucose level in mg/dl. Fig. 2 shows the integral-derivative control [28] and fuzzy logic [2]. mapping between blood glucose level (x-axis) to the risk function However, there are relatively few studies on the blood glucose lev- (y-axis). We used the risk function value as the cost function, hence els control task using RL approaches. Most of the early works em- ploy compartmental blood glucose and insulin models to infer some of insulin/glucose related internal states of human body, and then learn its insulin injection policy with relatively simple MFRL algo- rithms such as Q-Learning [21, 22] or Actor-Critic [7, 8]. Fox et al. employed more recent RL techniques [12], such as deep neural networks for the Q-Learning algorithm – arguably the most com- mon MFRL algorithm. They showed that although the agent was not given any prior knowledge of the blood glucose/insulin relations, it learns its insulin injection policy and achieves performance compa- rable with existing algorithms. In the field of model-based system control several approaches ex- ist – we refer the reader to [3] and the references therein. The clos- est to our work is [4], where the authors use a linear compartmental model for predicting the mean and variance of the future blood glu- cose levels. It exploits MPC for planning by taking into account the Figure 2. Risk function proposed by Kovatchev et al. [16]. The figure variance of the blood glucose level prediction. The main differences shows the relationship between blood glucose level [mg/dl] and its risk from our work are: (1) they employ a linear compartmental model function value. which has a small number of parameters and hence easier to learn, whereas we use more generic recurrent neural networks, which have our RL agent searches a policy minimising the total risk values over greater flexibility to adapt to any personal blood glucose level be- an episode. haviour; (2) their model parameters are learnt off-line, whereas ours are adjusted online; and (3) the handling of uncertainty – we measure the model’s uncertainty while they measure the uncertainty involved 3.2 Echo State Networks in meal events. ESNs were proposed as an alternative structure of standard recurrent neural networks in machine learning [14]. They are also called liquid 3 Methods state machine in computational neuroscience [13]. ESNs take an in- put sequence u = (u(1), u(2), ..., u(T )) by recursively processing In order to apply RL algorithms to this problem, we formulate the each symbol while maintaining its internal hidden state x. At each task as Markov Decision Process (MDP), which has four tuples time step t, the ESN takes input u(t) ∈ RK and updates its hidden (S, A, p, c) where S is a set of states, A is a set of actions, p is the state x(t) ∈ RN by: state transition probabilities and c is a cost function. Essentially the x̃(t) = f (Win · u(t) + W · x(t − 1)) (2) blood glucose control task is a Partially Observable MDP, however we see it as an MDP by defining state S as all history of insulin doses x(t) = (1 − α) · x(t − 1) + α · x̃(t), (3) and carbohydrate intakes. where f is the internal unit activation function, which is tanh in our More precisely, the overall pipeline makes use of ESNs to store model, Win ∈ RN ×K is the input weight matrix, W ∈ RN ×N is the history in its hidden states, shown in Section3.2. The corre- the internal connections weight matrix and α ∈ (0, 1] is the leak- sponding actions A are the dosages of bolus insulin. We exploit the age rate, which controls the speed of the hidden states change hence risk function introduced in [16] as our cost function c, described controls the output smoothness. in Section 3.1. While we use the model-based reinforcement learn- The output at time step t, y(t) ∈ RL is obtained from the hidden ing (MBRL) algorithm with ESNs for the prediction of blood glucose states and the inputs by: levels, MPC generates the insulin dose suggestions from the blood h iT glucose level predictions (Section 3.4) and their uncertainty estima- y(t) = f out Wout · x(t)T , u(t)T , (4) tions (Section 3.3). where f out is the output unit activation function (which is the iden- A positive (negative) risk margin means our metric E[c(BGL)] dis- tity function in our model as we are dealing with a regression task) courages (encourages) taking risks. If we use a convex cost function and Wout ∈ RL×(N +K) is the output weights matrix. as described in Section 3.1, RM is positive according to Jensen’s The matrices for updating the hidden states, Win and W, are ran- inequality, hence it discourages risks. domly initialized and fixed (not updated during learning process), only the output weights matrix Wout is leaned to obtain the target 3.4 Model Predictive Controller output sequences. As it only learns the output weights, it doesn’t re- quire back propagation through the network nor time, hence it learns Model predictive controller (MPC) is a planning method to facilitate much faster than the normal recurrent neural networks. The downside control of systems with a long time delay and non-linear characteris- of using ESN is that it requires much higher number of hidden states tics. The MPC uses a prediction model to estimate the consequences to achieve good performance, hence it required more computational of a sequence of actions and repeats the process for many action se- power for inference. quences. Then it picks the sequence of actions that gives the best To make ESNs work properly, the fixed weights must satisfy consequence and applies the first action of the sequence. In the next the so-called echo state property: the internal states x(t) should be time step this process is repeated. This effectively means it re-plans uniquely defined only by the past inputs u(k)|k=...t [14]. The ac- the sequence of actions based on the latest state information from the tual method to initialise the weights can be found in [17], which also environment, which makes the algorithm robust against any noise or gives useful guidance for using ESNs. prediction errors. There are several algorithms to generate the sequence of actions to test – such as random shooting [27] and cross entropy method [10]. ESNs for the blood glucose level prediction task In our work, In our work, we use a fixed table for the sequence of actions to test. the ESN takes a sequence of bolus insulin injection and carbohydrate The table has six action sequences, each of which takes a different intakes as inputs, and predicts the blood glucose level. amount of bolus injection as its first action. The amount of bolus in- To learn the ESN output weights we use the Mean Squared Error jection at the first action is {0, 5, 10, 20, 40, 80} times of the person’s between predicted and observed blood glucose levels as loss func- basal infusion rate. Following the approach of [12], the basal infusion tion. rate is given for each virtual person’s model, and we use it to scale T 1 X the bolus injection. While our model generates suggestions for bolus Ld (θ) = (µθ (t) − BGL(t))2 (5) T t=1 injections, for the basal injections, it assumes the person is taking the given basal infusion rate. The action sequence length (time horizon) Here, µθ (t) is the predicted blood glucose level by ESN at time is set to 48 time steps, which is 4 hours long as each time step repre- step t, where θ is the optimization parameter (here it is Wout ) and sents a five-minute period. Each action sequence has a bolus injection BGL(t) is observed blood glucose level. As it can be seen as a lin- as the first action of the sequence. We believe this is sensible because ear regression problem, the output weights are derived by solving the the bolus injections is normally taken just after or before a meal and Normal equation [17]. there is no meal announcement in our system at moment (the algo- To capture model (epistemic) uncertainty, it applies multiple in- rithm does not know the meal event until it happens). Therefore, the stances of ESNs, and each of them has different input and internal best time to take bolus injection would be immediately after detect- connection weights. ESNs are well suited for the ensemble approach ing the meal event, which is the first action in the sequence. A proper as it has fixed random internal weights which project the inputs se- meal announcement mechanism is left for future work. quence into different hidden states. So naturally they output different values where there is no training data, capturing higher epistemic un- 4 Evaluation certainty. In our evaluation, we employ five instances of ESNs, which is suggested by [5]. We empirically evaluated how well the model-based reinforcement learning (MBRL) can learn insulin injection decisions and compared it with a typical model-free reinforcement learning (MFRL) algo- 3.3 Uncertainty quantification rithm and also with a non-RL algorithm designed to mimic human decision-making. In this paper, we did not compare the blood glucose We employ multiple ESNs to capture the uncertainty in predicted level prediction accuracy with other prediction models. Instead, we blood glucose level. They produce multiple predictions of the blood focused on evaluating the performance of the agents. The overview of glucose levels from the ESN models for each action sequence. To the evaluation system is shown in Fig. 3. We used an FDA-approved quantify the cost (risk) of uncertainty, we take the mean of the Type 1 Diabetes simulator, which takes meal and insulin injection in- cost of the predicted blood glucose levels for each of action se- formation, then outputs a blood glucose level (BGL) as a continuous −1 PM quence M1T n+T m P t=n m=1 c(BGLt ), where c(.) is a cost func- glucose monitor (CGM) reading at each time step. The algorithms m tion, BGLt is blood glucose levels prediction from ESN model m (agents) receive the meal, insulin and blood glucose level informa- at time step t, and M and T are number of ESN models and num- tion and decides the amount of insulin taking in the next time step. ber of time steps in the action sequence. We then select the action We simulated the algorithms together with the Type 1 Diabetes sim- sequence which minimises this mean cost. ulator, and evaluated how well the blood glucose levels are managed. We encourage (optimistic or exploratory approach) or discour- age (pessimistic or safe approach) taking risks by designing the cost function accordingly. Here we define a risk margin RM as the dif- 4.1 UVa/Padova Type 1 Diabetes simulator ference between the averaged cost function and cost of the averaged blood glucose level predictions. The UVa/Padova Type 1 Diabetes Simulator [6] was the first com- puter model accepted by the FDA as a substitute for preclinical tri- RM = E[c(BGL)] − c(E[BGL]). (6) als of certain insulin treatments, including closed-loop algorithms. 4.3 Simulation Conditions Each episode lasts 24 hours, starting at 6am and finishing at 6am the next day. Three meals and three snack events are simulated with some randomness in terms of amount, timing and also whether they take the meal/snack. The timing follows a truncated normal distribu- tion and the amount is normally distributed. The meal parameters are shown in Table 1. The agent receives information from the environ- Table 1. Parameters for meal event generator. Figure 3. Evaluation system top level diagram. Time [hours] Carbs. [g] Meal type Prob. lower upper mean std. mean std. The model takes carbohydrate intakes and insulin injection as inputs, bound bound simulates human body insulin/blood glucose behaviours and outputs Breakfast 0.95 5 9 7 1 45 10 the blood glucose level measurements. It has gastro-interstinal tract, Snack#1 0.3 9 10 9.5 0.5 10 5 glucose kinetics and insulin kinetics sub models. Each of these sub Lunch 0.95 10 14 12 1 70 10 models is defined with differential equations with parameters to sim- Snack#2 0.3 14 16 15 0.5 10 5 ulate different individuals. Our simulator is based on an open source Dinner 0.95 16 20 18 1 80 10 Snack#3 0.3 20 23 21.5 0.5 10 5 implementation of the UVa/Padova Type 1 Diabetes simulator [15], which comes with different profiles for 30 virtual people with type 1 diabetes – ten each for children, adolescents and adults. Our experi- ment such as the meal (carbohydrate), insulin and blood glucose lev- ments use nine virtual people, three of each age group. els, and decides the insulin dose for the next time step. Each time step is set to five minutes in length. In this evaluation, the person does not 4.2 Benchmark algorithms take food to compensate for low blood glucose levels (the meal event always follows a pre-defined order as described above). While this We used two benchmark algorithms to compare the proposed ap- is not realistic, it is a good way to measure how well the algorithm proach against, one from RL algorithms (GRU-DQN) and the other works because ultimately we would like to develop an algorithm that one from non-RL approaches (BBController). These are described does not require any corrections from the user. The episode is termi- below. nated if the blood glucose level goes below 20 mg/dl or beyond 600 mg/dl, as these limit are extreme and they are outside of the possible GRU-DQN Deep Q-Learning (DQN) is a common MFRL al- blood glucose level range considered by [16]. gorithm, which learns the action-value function Q(s, a) – expected cumulative future rewards starting with state s and action a. It then 4.4 Results uses the learned action value function to decide which action to take at time step t by at = argmaxa∈A Q(st , a). In our work, the agent We train MBRL for 200 episodes and GRU-DQN for 1000 episodes, observes the blood glucose levels from a CGM, carbohydrate intakes then use the last 30 episodes to measure the percentage of episodes and insulin injections, and infers the action value function. It is a completed without termination due to extreme blood glucose levels. partially observable model so we used gated recurrent units (GRU) For BBController, we just run 30 episodes to measure, as it has pre- to infer the hidden states and approximate the action value function. optimized model parameters and no training is required. GRU-DQN was successfully applied to this problem before [12] so The results are given in Table 2. MBRL gives better results than we followed their same set up which involves two GRU recurrent GRU-DQN and comparable with BBController. MBRL struggles layers of 128 hidden states and followed by a fully connected output with child#002, #003 and adolescent#002. By looking into these layer size of 128. However, our our states (the input of GRU-DQN) cases, we found that MBRL fails due to the MPC time horizon not be- include carbohydrate information, whereas [12] does not. We include ing long enough. The MPC time horizon is set to 4 hours, hence the it here to make our comparison fair against the MBRL algorithm, agent could not foresee a possible hypoglycemia event in the early which has acess to the carbohydrate information. morning after the person takes an evening meal. The agent suggests too much insulin, and it causes hypoglycemia in the early morning. BBController Basal-Bolus Controller mimics how an individ- This can be fixed by increasing the MPC time horizon, but requires ual with Type 1 Diabetes controls their blood glucose levels. The some additional consideration as it might lead to inappropriate sug- UVa/Padova simulator comes with the necessary parameters for this gestions during the day. algorithm for each of the virtual people with Type 1 Diabetes models, Table 3 shows the percentage of time spent in a target blood glu- such as basal insulin rates bas, a correction factor CF and a carbohy- cose level range (70-180 mg/dl.) These are measured in the last 10 of drate ratio CR. The simulator decides the amount of insulin injection the completed episodes(i.e., not terminated). Here MBRL gives the by bas + (ct > 0) · (ct /CR + (bt > 150) · (bt − btgt )/CF ), where best overall results compared to the other agents. Note that no data ct is carbohydrate intake at time step t, bt is the blood glucose mea- is available for adolescent#002, as it fails to get any non-terminated surements, btgt is a target blood glucose level. The last term is only episode (due to the reason described above). applied when the blood glucose measurement exceeds 150 mg/dl. We also evaluated the effect of the uncertainty estimation by com- We use the implemented model that comes with the Type 1 Diabetes paring the results from MBRL with/without it. For MBRL without simulator [15]. uncertainty, we take an average over multiple ESNs predictions to Table 2. % of number of completed episodes without termination due to extreme blood glucose level value. Person Profile BBCont. GRU-DQN MBRL child#001 30.0 3.3 100 child#002 90 23.3 53.3 child#003 66.7 43.3 30.0 adolescent#001 100 100 100 adolescent#002 66.7 56.7 0.0 adolescent#003 90 20 100 adult#001 100 70.0 96.7 adult#002 100 100 100 adult#003 96.7 16.7 100 Table 3. % of time spent in the target blood glucose level range (70 - 180 mg/dl). Figure 4. Comparison between MBRL with uncertainty and without uncertainty models. The upper plot shows the learning curve for simulated period for each episode, which goes up to 24 hours if the blood glucose level Person Profile BBCont. GRU-DQN MBRL is controlled well. The lower plot shows % of time spent in the target blood child#001 44.0 28.3 59.6 glucose range (70-180mg/dl) child#002 42.6 38.2 55.3 child#003 40.7 36.0 45.1 adolescent#001 85.8 81.4 100.0 GRU-DQN and BBController. The results suggest that the MBRL adolescent#002 49.0 39.8 n/a approach works better than the GRU-DQN algorithm and similar or adolescent#003 46.7 42.4 66.1 slightly better than the BBController. Also, our results show that tak- adult#001 60.1 50.3 56.8 ing into account the model uncertainty improves its performance in adult#002 73.3 66.9 73.3 the early stages of learning. adult#003 58.7 46.9 68.8 There are several avenues for future work. At the present stage we only tested our algorithms with the UVa/Padova Type 1 Diabetes sim- come up with a single blood glucose prediction, and then we calcu- ulator, which is good for single meal scenarios but not for multiple late its cost. Whereas MBRL with uncertainty computes the cost of meals [6]. This is primarily because the model has fixed parameters the all predictions, then takes average of the costs as described in for each person and does not simulate meal-by-meal nor day-by-day Section 3.3. parameter drifting. In addition, our current learning method must be Figure 4 shows the learning curves for these two MBRL algo- extended to adapt to parameter drifts. A possible approach for such rithms with adult#001. The upper plot shows the episode period, an extension would be to introduce meta-learning [11]. which goes up to 24 hours if there is no termination, and the bot- Another area for further work relates to meal information. We as- tom plot shows % of time spent in the target blood glucose range. sumed all meal events are correctly given by the person when the From the upper plot, the algorithm with uncertainty achieves “no event is happening; however, this may not be very realistic as it is episode termination” (24 hours episode) much earlier than the one a considerable burden for a person to put every single meal event without estimating uncertainty. At an early stage of the learning pro- into the algorithm. It is also hard to know the exact carbohydrate cess, the prediction model is not very accurate, so it is much better count of each meal. Some researchers therefore structure the blood by taking into account its uncertainty. For the later stages, the pre- glucose predictor without having a meal input. Another alternative dictions become more accurate, hence it shows similar performance would be to have a model to back-predict a meal event from the ob- in both cases. Table 4 shows asymptotic results of the percentage of served blood glucose levels. We think it is possible to learn the meal time spent in the target blood glucose range, indicating that both have event in conjunction with the blood glucose level prediction model similar asymptotic performances. with occasional human inputs. Table 4. % of time spent in the target blood glucose range (70 - 180 mg/dl). ACKNOWLEDGEMENTS This project is funded by the Innovate UK Digital Catalyst Award – Person Profile MBRL MBRL Digital Health and is in partnership with Quin Technology. (with uncertainty) (without uncertainty) child#001 59.6 57.5 adolescent#001 100.0 95.9 REFERENCES adult#001 56.8 56.7 [1] Kurt George Matthew Mayer Alberti and Paul Z Zimmet, ‘Definition, diagnosis and classification of diabetes mellitus and its complications. part 1: diagnosis and classification of diabetes mellitus. provisional re- port of a who consultation’, Diabetic medicine, 15(7), 539–553, (1998). [2] Eran Atlas, Revital Nimri, Shahar Miller, Eli A. Grunberg, and Moshe 5 Conclusions and Future Work Phillip, ‘MD-logic artificial pancreas system: A pilot study in adults with type 1 diabetes’, Diabetes Care, (2010). We investigated the use of MBRL to assist Type 1 Diabetes decision- [3] B. Wayne Bequette, ‘Algorithms for a closed-loop artificial pancreas: making by evaluating MBRL with the FDA-approved UVa/Padova The case for model predictive control’, Journal of Diabetes Science and simulator. We compared the results with two baseline algorithms, Technology, 7(6), 1632–1643, (2013). [4] Fraser Cameron, B. Wayne Bequette, Darrell M. Wilson, Bruce A. (2016). Buckingham, Hyunjin Lee, and Günter Niemeyer, ‘A closed-loop arti- [25] Aisling Ann O’Kane, Sun Young Park, Helena Mentis, Ann Blandford, ficial pancreas based on risk management’, Journal of Diabetes Science and Yunan Chen, ‘Turning to peers: integrating understanding of the and Technology, 5(2), 368–379, (2011). self, the condition, and others’ experiences in making sense of complex [5] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey chronic conditions’, Computer Supported Cooperative Work (CSCW), Levine, ‘Deep Reinforcement Learning in a Handful of Trials using 25(6), 477–501, (2016). Probabilistic Dynamics Models’, in Advances in Neural Information [26] Peter Pesl, Pau Herrero, Monika Reddy, Nick Oliver, Desmond G John- Processing Systems, volume 2018-Decem, (2018). ston, Christofer Toumazou, and Pantelis Georgiou, ‘Case-based reason- [6] Chiara Dalla Man, Francesco Micheletto, Dayu Lv, Marc Breton, Boris ing for insulin bolus advice: evaluation of case parameters in a six-week Kovatchev, and Claudio Cobelli, ‘The UVA/PADOVA type 1 diabetes pilot study’, Journal of diabetes science and technology, 11(1), 37–42, simulator: New features’, Journal of Diabetes Science and Technology, (2017). 8(1), 26–34, (2014). [27] Anil V. Rao, ‘A survey of numerical methods for optimal control’, in [7] Elena Daskalaki, Peter Diem, and Stavroula G. Mougiakakou, ‘An Advances in the Astronautical Sciences, (2010). Actor-Critic based controller for glucose regulation in type 1 diabetes’, [28] Garry M. Steil, ‘Algorithms for a closed-loop artificial pancreas: The Computer Methods and Programs in Biomedicine, 109(2), 116–125, case for proportional-integral-derivative control’, Journal of Diabetes (2013). Science and Technology, 7(6), 1621–1631, (2013). [8] Elena Daskalaki, Peter Diem, and Stavroula G Mougiakakou, ‘Person- [29] Cristiano Storni, ‘Complexity in an uncertain and cosmopolitan world. alized tuning of a reinforcement learning control algorithm for glucose rethinking personal health technology in diabetes with the tag-it- regulation’, in 2013 35th Annual international conference of the IEEE yourself.’, PsychNology Journal, 9(2), (2011). engineering in medicine and biology society (EMBC), pp. 3487–3490. [30] Richard S Sutton and Barto Andrew G., Reinforcement Learning, The IEEE, (2013). MIT Press, 1998. [9] Asa K Davis, Stephanie N DuBose, Michael J Haller, Kellee M Miller, Linda A DiMeglio, Kathleen E Bethin, Robin S Goland, Ellen M Greenberg, David R Liljenquist, Andrew J Ahmann, et al., ‘Prevalence of detectable c-peptide according to age at diagnosis and duration of type 1 diabetes’, Diabetes care, 38(3), 476–481, (2015). [10] Pieter Tjerk De Boer, Dirk P. Kroese, Shie Mannor, and Reuven Y. Rubinstein, ‘A tutorial on the cross-entropy method’, Annals of Opera- tions Research, (2005). [11] Chelsea Finn, Pieter Abbeel, and Sergey Levine, ‘Model-agnostic meta- learning for fast adaptation of deep networks’, 34th International Con- ference on Machine Learning, ICML 2017, 3, 1856–1868, (2017). [12] Ian Fox and Jenna Wiens, ‘Reinforcement Learning for Blood Glucose Control: Challenges and Opportunities’, (2019). [13] Nurdan Gürbilek, ‘Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations’, Journal of Chemical Information and Modeling, 53(9), 1689–1699, (2013). [14] Herbert Jaeger, ‘The “ echo state ” approach to analysing and train- ing recurrent neural networks – with an Erratum note 1’, GMD Report, (148), 1–47, (2010). [15] Jinyu Xie. Simglucose v0.2.1 https://github.com/jxx123/simglucose, 2018. [16] Boris P. Kovatchev, Daniel J. Cox, Linda A. Gonder-Frederick, and William Clarke, ‘Symmetrization of the blood glucose measurement scale and its applications’, Diabetes Care, 20(11), 1655–1658, (1997). [17] Mantas Lukoševičius, ‘A practical guide to applying echo state net- works’, Lecture Notes in Computer Science (including subseries Lec- ture Notes in Artificial Intelligence and Lecture Notes in Bioinformat- ics), 7700 LECTU, 659–686, (2012). [18] B Mianowska, W Fendler, A Szadkowska, A Baranowska, E Grzelak- Agaciak, J Sadon, Hillary Keenan, and W Mlynarski, ‘Hba 1c levels in schoolchildren with type 1 diabetes are seasonally variable and depen- dent on weather conditions’, Diabetologia, 54(4), 749–756, (2011). [19] Annemarie Mol and John Law, ‘Embodied action, enacted bodies: The example of hypoglycaemia’, Body & society, 10(2-3), 43–62, (2004). [20] Elizabeth D Mynatt, Gregory D Abowd, Lena Mamykina, and Julie A Kientz, ‘Understanding the potential of ubiquitous computing for chronic disease management’, Health Informatics: A Patient-Centered Approach to Diabetes. Health Informatics, 85–106, (2010). [21] Phuong D Ngo, Susan Wei, Anna Holubová, Jan Muzik, and Fred Godtliebsen, ‘Control of blood glucose for type-1 diabetes by using re- inforcement learning with feedforward algorithm’, Computational and mathematical methods in medicine, 2018, (2018). [22] Phuong D Ngo, Susan Wei, Anna Holubová, Jan Muzik, and Fred Godtliebsen, ‘Reinforcement-learning optimal control for type-1 dia- betes’, in 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), pp. 333–336. IEEE, (2018). [23] NHS Choices. Type 1 diabetes https://www.nhs.uk/conditions/type-1- diabetes/, 2018. [24] Aisling Ann O’Kane, Yi Han, and Rosa I Arriaga, ‘Varied & bespoke caregiver needs: organizing and communicating diabetes care for chil- dren in the diy era’, in Proceedings of the 10th EAI International Con- ference on Pervasive Computing Technologies for Healthcare, pp. 9–12,