A General Neural Architecture for Carbohydrate and Bolus Recommendations in Type 1 Diabetes Management Jeremy Beauchamp and Razvan Bunescu and Cindy Marling1 Abstract. People with type 1 diabetes must constantly monitor their the BGL target as a feature and the carbohydrates or insulin as labels, blood glucose levels and take actions to keep them from getting either a similar architecture can be trained instead to predict the number of too high or too low. Having a snack will raise blood glucose levels; carbohydrates that need to be consumed or the amount of insulin that however, the amount of carbohydrates that should be consumed to needs to be delivered during the prediction window in order to reach reach a target level depends on the recent history of blood glucose that BGL target. levels, meals, boluses, and the basal rate of insulin. Conversely, to The work by Mougiakakou and Nikita [7] represents one of the lower the blood glucose level, one can administer a bolus of insulin; first attempts to use neural networks for recommending insulin reg- however, determining the right amount of insulin in the bolus can be imens and dosages. Bolus calculators were introduced as early as cognitively demanding, as it depends on similar contextual factors. 2003 [11], wherein a standard formula is used to calculate the In this paper, we show that a generic neural architecture previously amount of bolus insulin based on parameters such as carbohydrate in- used for blood glucose prediction in a what-if scenario can be con- take, carbohydrate-to-insulin ratio, insulin on board, and target BGL. verted to make either carbohydrate or bolus recommendations. Ini- Walsh et al. [10] discuss major sources of errors and potential targets tial experimental evaluations on the task of predicting carbohydrate for improvement, such as utilizing the massive quantities of clini- amounts necessary to reach a target blood glucose level demonstrate cal data being collected by bolus advisors. As observed by Cap- the feasibility and potential of this general approach. pon et al. in [2], the standard formula approach ignores potentially useful preprandial conditions, such as the glucose rate of change. A feed-forward fully connected neural network was then proposed 1 Introduction and Motivation to exploit CGM information and some easily accessible patient pa- Type 1 diabetes is a disease in which the pancreas fails to produce rameters, with experimental evaluations on simulated data showing insulin, which is required for blood sugar to be absorbed into cells. a small but statistically significant improvement in the blood glucose Without it, that blood sugar remains in the bloodstream, leading to risk index. Simulated data is also used by Sun et al. in [9], where a high blood glucose levels (BGLs). In order to manage type 1 dia- basal-bolus advisor is trained using reinforcement learning in order betes, insulin must be administered via an external source, such as to provide personalized suggestions to people with type 1 diabetes injections or an insulin pump. People with type 1 diabetes also need under multiple injections therapy. to monitor their BGLs closely throughout the day by testing the blood The data-driven architecture proposed in this paper is generic in acquired through fingersticks and/or by using a continuous glucose the sense that it can be trained to make recommendations about any monitoring (CGM) system. If the BGL gets too high (hyperglycemia) variable that can impact BG levels, in particular carbohydrates and or too low (hypoglycemia), the individual responds by eating, taking insulin. The task of making carbohydrate recommendations is po- insulin, or taking some other action to help get their BGL back to tentially useful in scenarios where patients want to prevent hypo- within a healthy range. An issue with this, however, is that the person glycemia well in advance, or where a person is interested in achiev- with diabetes must react to their BGL, whereas, ideally, they would ing a relatively higher target BGL in preparation for an exercise event be able to proactively control their BGL. There has been much work that is expected to lower it. in the area of BGL prediction in the past ([1] and [8] for example) As a first step, in this paper we approach the problem of making with the aim of enabling preemptive actions to manage BGLs be- carbohydrate recommendations. The rest of this paper is organized in fore individuals experience the negative symptoms of hypoglycemia the following way: Section 2 provides a more detailed description of or hyperglycemia. However, individuals still need to figure out how the problem. Section 3 describes the model as well as the baselines much to eat, how much insulin to take, and what other actions they used to compare against. Section 4 describes the dataset that is used can take to prevent hypoglycemia or hyperglycemia. and some of the features of the data. Section 5 discusses some of The broad goal of the research presented in this paper is to essen- the training techniques and methods used as well as the results of tially reverse the blood glucose prediction problem, and instead pre- the experiments that motivated the use of these techniques. Section 6 dict how many carbohydrates an individual should eat or how much contains the conclusion and some plans for future work. insulin to administer with a bolus in order to get their BGL to the desired target. We have previously introduced in [6] an LSTM-based neural architecture that was trained such that it could answer what-if 2 Three Carbohydrate Recommendation Scenarios questions of the type “What will my BGL be in 60 minutes if I eat a snack with 30 carbs 10 minutes from now”. We show that by using We assume that blood glucose levels are measured at 5 minute inter- vals through a CGM system. We also assume that discrete deliveries 1 Ohio University, USA, email: {jb199113,bunescu,marling}@ohio.edu of insulin (boluses) and continuous infusions of insulin (basal rates) Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Figure 1. The general neural network architecture for carbohydrate recommendation. The dashed blue line in the graph represents a subject’s BGL, while the solid brown line represents the basal rate of insulin. The gray star represents the meal at tm . The other meals are represented by squares, and boluses are represented by circles. Meals and boluses with a green outline are allowed in all three example scenarios, while those with an orange outline are allowed in scenario S2 and scenario S3 examples, and those with a red outline are only allowed in scenario S3 examples. The blue units in LSTM1 receive input from different time steps in the past. The green units in LSTM2 receive input from the prediction window. The purple trapezoid represents the 5 fully connected layers, whereas the output node at the end computes the carbohydrate prediction. are recorded. Subjects provide the timing of meals and estimates of hypothesis is that models that are trained on examples from scenario the amount of carbohydrates associated with each meal. Given the S3 will implicitly learn physiological patterns that will improve per- data available up to the present time t, the problem can formally be formance for the fewer examples in scenario S1 . defined as predicting the number of grams of carbohydrates (number of carbs) Ctm in a meal that is to be consumed at time tm ∈ [t, t+τ ) such that the person’s BGL reaches a specified target value BGt+τ at 3 Baseline Models and Neural Architecture time t+τ in the future. Without loss of generality, in this paper we set Given training data containing meals with their corresponding time- the prediction horizon τ = 30 and 60 minutes. We define three car- stamps and carbohydrates, we define the following baselines: bohydrate prediction scenarios, depending on whether events such as boluses or other meals happen inside the prediction window [t, t+τ ): 1. Global average: The average number of carbs over all of the meals in the subject’s training data, µ, are computed and used 1. Scenario S1 assumes that there are no events in the prediction as the estimate for all future meals, irrespective of context. This window [t, t + τ ). Training a model for this scenario can be dif- is a fairly simple baseline, as it predicts the same value for every ficult due to the scarcity of corresponding training examples, as example. meals are typically preceded by boluses. The example shown in 2. ToD average: In this Time-of-Day (ToD) dependent baseline, an Figure 1 would be in this scenario if the orange and red outlined average number of carbs is computed for each of the following meals and boluses were not present. five time windows during a day: 2. Scenario S2 subsumes scenario S1 by allowing events before the meal, i.e. in the time window [t, tm ]. The example that is shown • 12am-6am: µ1 = early breakfast/late snacks. in Figure 1 would be a scenario S2 example if the bolus outlined • 6am-10am: µ2 = breakfast. in red were not present, and would correspond to answering the following what-if question: how many carbs should be consumed • 10am-2pm: µ3 = lunch. at time tm to achieve the target BGt+τ , if the meal were to be • 2pm-6pm: µ4 = dinner. preceded by another meal and a bolus. • 6pm-12am: µ5 = late dinner/post-dinner snacks. 3. Scenario S3 is the most general and allows events to happen dur- ing the entire prediction window [t, t + τ ). The example in Fig- The average for each ToD interval is calculated over all of the ure 1 is a scenario S3 example but not a scenario S1 or scenario meals appearing in the corresponding time frame in the subject’s S2 example because of the presence of the orange and red outlined training data. At test time, to predict the number of carbs for a meal and bolus. meal to be consumed at time tm , we first determine the ToD inter- val that contains tm and output the corresponding ToD average. We train and evaluate carbohydrate recommendation models for each scenario, using data acquired from 6 subjects with type 1 diabetes [5]. Given sufficient historical data, the ToD baseline is expected to per- Given the scarcity of training examples for scenario S1 , our starting form well for individuals who tend to eat very consistently and have regular diets. However, it is expected to perform poorly on individu- Table 1. Meal statistics, per subject and total. als who have a lot of variation in their diets. While simple to compute and use at test time, the two baselines are Carbs Per Meal likely to give suboptimal performance, as their predictions ignore the Subject Meals Avg StdDev history of BG values, insulin (boluses and basal rates), and meals, all 559 179 36.0 16.0 of which could significantly modulate the effect a future meal might 563 153 29.9 16.3 570 169 105.3 42.0 have on the BGL. To exploit this information, we propose the general 575 284 40.6 22.9 neural network architecture shown in Figure 1. The first component 588 257 30.8 16.6 in the architecture is a recurrent neural network (RNN) instantiated 591 249 31.6 14.2 using Long Short-Term Memory (LSTM) cells [3], which is run over Total 1291 43.5 33.1 the previous 6 hours of data, up to the present time t. At each time step (5 minutes), this LSTM network takes as input the BGL, the carbohydrates, and the insulin dosages recorded at that time step. of tm within [t, t + τ ). However, when τ = 60 minutes, an exam- While sufficient for processing data corresponding to scenario S1 , ple is created for every position of tm within [t, t + 30], to ensure this LSTM cannot be used to process events in the prediction window that there are at least 30 minutes between the meal and the prediction [t, t + τ ) that may appear in scenarios S2 and S3 , for which BGL horizon. Table 2 below shows the resulting number of examples for values are not available. Therefore, in these scenarios, the final state τ = 30 and 60 minutes, in each of the three scenarios. Note that computed by the first LSTM model (LSTM1 ) at time t is projected there are fewer examples in scenarios S1 and S2 when τ = 60 vs. 30 and used as the initial state for a second LSTM model (LSTM2 ) that minutes, despite there being more scenario S3 examples. This can be is run over the time steps between (t, t+τ ). The final state computed explained by the scenarios S1 and S2 criteria being even more dif- either by LSTM1 (for scenario S1 ) or LSTM2 (for scenarios S2 and ficult to meet when τ = 60 minutes, i.e. there cannot be any event S3 ) is then used as input to a fully connected network (FCN) whose within [t, t + 60) for S1 , or any event within [tm , t + 60) for S2 . output node computes Ĉtm , an estimate of the carbohydrates at time tm . Besides the LSTM final state, the input to the FCN contains the Table 2. Example counts by scenario, for 30 and 60 minutes. following additional features: Scenario S1 Scenario S2 Scenario S3 1. The target BGL at τ minutes into the future, i.e. BGt+τ . Dataset 30 60 30 60 30 60 2. The time interval ∆ = tm −t between the intended meal time and Training 2396 1923 3889 3491 5096 5931 Validation 629 510 1061 981 1388 1626 the present. Testing 469 339 950 851 1236 1435 3. The ToD average computed for Baseline 2 corresponding to the Total 3494 2772 5900 5323 7720 8992 time the meal was eaten. The entire architecture is trained to minimize the mean squared error between the actual carbohydrates Ctm recorded in the training data and the estimated value Ĉtm computed by the output node of the 5 Experimental Evaluation FCN module. Each LSTM uses vectors of size 100 for the states and gates, whereas the FCN is built with 5 hidden layers, each consisting The Adam [4] variant of gradient descent is used for training, with of 200 ReLU neurons, and one linear output node. the learning rate and mini-batch size being tuned on the validation data. In an effort to avoid overfitting, early stopping with a patience of 5 epochs and dropout with a rate of 10% are used for both models. 4 Dataset Interestingly, dropout was found to help the model if it was only The data used for the model was collected from 6 subjects with type 1 applied to the LSTM networks of the model at each time step and diabetes [5]. Information including the basal rate of insulin, boluses, not the fully connected network. meals, and BGL readings was collected over roughly 50 days, al- Since the overall number of examples available in the dataset is though the exact amount of time varies from subject to subject. This low, the performance was improved by first pretraining a generic time series data is split into three sets, as follows: the last 10 days model on the combined data from all 6 subjects. Then, for each sub- of data for each subject are used as testing, the previous 10 days are ject, a new model is initialized with the weights of the generic model, used as validation, and the remainder of the data is used for training. and then fine-tuned on the subject’s training data. For each subject, five models were trained with different seedings of the random num- ber generators. We also experimented with fine-tuning models on the 4.1 From Meal Events to Examples union of the training and validation data instead of just the training Since the total number of available examples is directly related to the data. When this combined data is used, the average carb values used number of meals, it is useful to know how many meals each subject in the baselines are recalculated over the union of the training and had. This is shown in Table 1, together with the average number of validation data for each subject. carbs per meal (Avg), and the corresponding standard deviation (Std- Dev). Most subjects have a similar average number of carbohydrates 5.1 Results in their meals, with the exception of 570 who has a significantly larger number of carbs per meal on average, and more importantly, a The metrics used to evaluate the performance of the models are much higher standard deviation than the other subjects. the root mean squared error (RMSE) and the mean absolute error A meal event occurring at time tm may give rise to multiple exam- (MAE), which is less sensitive to large errors. At the end of the train- ples, depending on the position of tm in the interval [t, t + τ ). When ing process, there are five fine-tuned models for each subject. The τ = 30 minutes, an example is created for every possible position average RMSE and MAE of the five models are reported, as well as the RMSE and MAE of the best model. The model that is consid- RMSE and MAE performance across all three scenarios, with the ex- ered the ”best” is the one that had the lowest MAE on the validation ception of the RMSE scores for scenario S1 . Compared to the other data. The results of the five models for each subject are also averaged two scenarios, the LSTM models and the baselines have a lower per- across all subjects to obtain one overall RMSE and one overall MAE formance in S1 . The decline in performance is even more apparent value for the average model and the best model scores. The baselines for the LSTM models, which cannot beat the time-dependent base- are treated much the same, as their RMSE and MAE values are aver- line in terms of RMSE for both the 30 minute and 60 minute pre- aged across all subjects to give an RMSE and an MAE score for each diction horizons. This can be explained by the limited number of baseline. examples for scenario S1 : since there are so few testing examples Table 3 compares the validation results achieved in scenario S3 in this scenario per subject, one bad prediction can hurt the results by models with and without pretraining for τ = 30 minutes. This significantly, more so for the RMSE than the MAE. Furthermore, the experiment clearly shows the benefit of pretraining the models: both trained models tend to make very similar predictions for all examples the RMSE and MAE are noticeably lower for the pretrained models. stemming from a specific meal, meaning that if the model made a bad As a result, pretraining is always used as part of the training process prediction for one test example, it likely made a series of similarly for both values of τ . bad predictions. To alleviate the scarcity of training examples in scenario S1 , mod- Table 3. Results with and without pretraining, τ = 30. els trained on S3 examples, which are the most plentiful and subsume S1 , were evaluated separately on test examples from S1 . This gives Setting RMSE MAE an indication on whether any transfer learning is taking place. Table 6 Without Pretraining 22.2 15.5 With Pretraining 20.7 14.5 shows the results of this transfer learning experiment, indicating that training on the additional examples from scenario S3 helps improve performance on scenario S1 to the level that now the LSTM-based models outperform both baselines. Table 4 compares models that were fine-tuned on training and val- idation data with models fine-tuned solely on the training data, in scenario S3 . The results show that the extra examples provided by Table 6. Comparative performance on scenario S1 test examples: the validation data proved helpful in improving performance. It is Baselines vs. LSTM-based models trained on S1 and S3 examples. interesting to note that using the combined training-validation data RMSE on S1 MAE on S1 only slightly helped the baselines, but helped the LSTM-based mod- Baselines & Models 30 60 30 60 els by a noticeable margin. Global Average 19.7 18.4 15.7 15.0 ToD Average 18.9 17.6 14.8 14.4 Table 4. Fine-tuning on Training vs. Training ∪ Validation, τ = 30. Average Model 19.3 19.5 14.1 13.9 Training on S1 Best Model 19.0 19.8 13.9 13.9 Average Model 18.2 17.6 13.6 13.3 Fine-tuning Baselines & Models RMSE MAE Training on S3 Best Model 18.3 16.7 13.8 13.0 Global Average 23.3 19.2 ToD Average 22.5 17.8 Training Average Model 21.3 16.0 Best Model 20.7 15.3 Global Average 23.1 19.0 Training ∪ ToD Average 22.2 17.7 Validation Average Model 20.1 15.0 6 Conclusion and Future Work Best Model 19.2 14.2 We introduced a generic neural architecture, composed of two chained LSTMs and a fully connected network, with the purpose of Table 5 compares the Baselines (Global and ToD averages) with training data-driven models for making recommendations with re- the trained Models (Best and Average) in terms of their RMSE and spect to any type of quantitative events that may impact BG levels, MAE in the three scenarios. in particular carbohydrate amounts and bolus insulin dosages. Ex- perimental evaluations on the task of carbohydrate recommendations Table 5. Results for scenarios S1 , S2 , and S3 , for τ = 30 and 60 minutes. within a 30 or 60 minute prediction window demonstrate the feasibil- ity and potential of the proposed architecture, as well as its ability to RMSE MAE benefit from pre-training and transfer learning. Future plans include Baselines & Models 30 60 30 60 evaluating carbohydrate recommendations within larger prediction Global Average 19.7 18.4 15.7 15.0 ToD Average 18.9 17.6 14.8 14.4 windows, as well as training the architecture for bolus recommenda- S1 tions. Average Model 19.3 19.5 14.1 13.9 Best Model 19.0 19.8 13.9 13.9 Global Average 18.4 17.1 14.5 13.8 ToD Average 17.4 15.9 13.1 12.2 ACKNOWLEDGEMENTS S2 Average Model 16.2 15.3 11.9 11.4 Best Model 15.8 15.4 11.6 10.9 Global Average 18.5 18.6 14.6 14.7 This work was supported by grant 1R21EB022356 from the National ToD Average 17.5 17.6 13.2 13.3 Institutes of Health (NIH). Conversations with Josep Vehi helped S3 Average Model 15.7 15.6 11.5 11.3 shape the research directions presented herein. The contributions of Best Model 15.6 14.8 11.4 10.6 physician collaborators Frank Schwartz, MD, and Amber Healy, DO, are gratefully acknowledged. We would also like to thank the anony- mous people with type 1 diabetes who provided their blood glucose, Overall, the LSTM-based models (Average or Best) had the best insulin, and meal data. REFERENCES [1] R. Bunescu, N. Struble, C. Marling, J. Shubrook, and F. Schwartz, ‘Blood glucose level prediction using physiological models and support vector regression’, in Proceedings of the Twelfth International Confer- ence on Machine Learning and Applications (ICMLA), pp. 135–140. IEEE Press, (2013). [2] G. Cappon, M. Vettoretti, F. Marturano, A. Facchinetti, and G. Spara- cino, ‘A neural-network-based approach to personalize insulin bolus calculation using continuous glucose monitoring’, Journal of Diabetes Science and Technology, 12(2), 265–272, (2018). [3] Sepp Hochreiter and Jürgen Schmidhuber, ‘Long short-term memory’, Neural Computation, 9, 1735–1780, (12 1997). [4] D. P. Kingma and J. L. Ba, ‘Adam: A method for stochastic optimiza- tion’, in Third International Conference for Learning Representations (ICLR), San Diego, California, (2015). [5] C. Marling and R. Bunescu, ‘The OhioT1DM dataset for blood glucose level prediction’, in The 3rd International Workshop on Knowledge Discovery in Healthcare Data, Stockholm, Sweden, (2018). Available at http://ceur-ws.org/Vol-2148/paper09.pdf. [6] S. Mirshekarian, H. Shen, R. Bunescu, and C. Marling, ‘LSTMs and neural attention models for blood glucose prediction: Comparative ex- periments on real and synthetic data’, in Proceedings of the 41st Inter- national Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2019), Berlin, Germany, (2019). [7] S. G. Mougiakakou and K. S. Nikita, ‘A neural network approach for insulin regime and dose adjustment in type 1 diabetes’, Diabetes Tech- nology & Therapeutics, 2(3), 381–389, (2000). [8] K. Plis, R. Bunescu, C. Marling, J. Shubrook, and F. Schwartz, ‘A ma- chine learning approach to predicting blood glucose levels for diabetes management’, in Modern Artificial Intelligence for Health Analytics: Papers Presented at the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 35–39. AAAI Press, (2014). [9] Q. Sun, M. V. Jankovic, J. Budzinski, B. Moore, P. Diem, C. Stet- tler, and S. G. Mougiakakou, ‘A dual mode adaptive basal-bolus ad- visor based on reinforcement learning’, IEEE Journal of Biomedical and Health Informatics, 23(6), 2633–2641, (2019). [10] J. Walsh, R. Roberts, T. S. Bailey, and L. Heinemann, ‘Bolus advisors: Sources of error, targets for improvement’, Journal of Diabetes Science and Technology, 12(1), 190–198, (2018). [11] H. C. Zisser, L. T. Robinson, W. Bevier, E. Dassau, C. L. Ellingsen, F. J. Doyle, and L. Jovanovic, ‘Bolus calculator: A review of four “smart” insulin pumps.’, Diabetes Technology & Therapeutics, 10(6), 441–444, (2008).