A General Neural Architecture for Carbohydrate and
Bolus Recommendations in Type 1 Diabetes Management
                                   Jeremy Beauchamp and                      Razvan Bunescu and Cindy Marling1


Abstract. People with type 1 diabetes must constantly monitor their                      the BGL target as a feature and the carbohydrates or insulin as labels,
blood glucose levels and take actions to keep them from getting either                   a similar architecture can be trained instead to predict the number of
too high or too low. Having a snack will raise blood glucose levels;                     carbohydrates that need to be consumed or the amount of insulin that
however, the amount of carbohydrates that should be consumed to                          needs to be delivered during the prediction window in order to reach
reach a target level depends on the recent history of blood glucose                      that BGL target.
levels, meals, boluses, and the basal rate of insulin. Conversely, to                       The work by Mougiakakou and Nikita [7] represents one of the
lower the blood glucose level, one can administer a bolus of insulin;                    first attempts to use neural networks for recommending insulin reg-
however, determining the right amount of insulin in the bolus can be                     imens and dosages. Bolus calculators were introduced as early as
cognitively demanding, as it depends on similar contextual factors.                      2003 [11], wherein a standard formula is used to calculate the
In this paper, we show that a generic neural architecture previously                     amount of bolus insulin based on parameters such as carbohydrate in-
used for blood glucose prediction in a what-if scenario can be con-                      take, carbohydrate-to-insulin ratio, insulin on board, and target BGL.
verted to make either carbohydrate or bolus recommendations. Ini-                        Walsh et al. [10] discuss major sources of errors and potential targets
tial experimental evaluations on the task of predicting carbohydrate                     for improvement, such as utilizing the massive quantities of clini-
amounts necessary to reach a target blood glucose level demonstrate                      cal data being collected by bolus advisors. As observed by Cap-
the feasibility and potential of this general approach.                                  pon et al. in [2], the standard formula approach ignores potentially
                                                                                         useful preprandial conditions, such as the glucose rate of change.
                                                                                         A feed-forward fully connected neural network was then proposed
1     Introduction and Motivation
                                                                                         to exploit CGM information and some easily accessible patient pa-
Type 1 diabetes is a disease in which the pancreas fails to produce                      rameters, with experimental evaluations on simulated data showing
insulin, which is required for blood sugar to be absorbed into cells.                    a small but statistically significant improvement in the blood glucose
Without it, that blood sugar remains in the bloodstream, leading to                      risk index. Simulated data is also used by Sun et al. in [9], where a
high blood glucose levels (BGLs). In order to manage type 1 dia-                         basal-bolus advisor is trained using reinforcement learning in order
betes, insulin must be administered via an external source, such as                      to provide personalized suggestions to people with type 1 diabetes
injections or an insulin pump. People with type 1 diabetes also need                     under multiple injections therapy.
to monitor their BGLs closely throughout the day by testing the blood                       The data-driven architecture proposed in this paper is generic in
acquired through fingersticks and/or by using a continuous glucose                       the sense that it can be trained to make recommendations about any
monitoring (CGM) system. If the BGL gets too high (hyperglycemia)                        variable that can impact BG levels, in particular carbohydrates and
or too low (hypoglycemia), the individual responds by eating, taking                     insulin. The task of making carbohydrate recommendations is po-
insulin, or taking some other action to help get their BGL back to                       tentially useful in scenarios where patients want to prevent hypo-
within a healthy range. An issue with this, however, is that the person                  glycemia well in advance, or where a person is interested in achiev-
with diabetes must react to their BGL, whereas, ideally, they would                      ing a relatively higher target BGL in preparation for an exercise event
be able to proactively control their BGL. There has been much work                       that is expected to lower it.
in the area of BGL prediction in the past ([1] and [8] for example)                         As a first step, in this paper we approach the problem of making
with the aim of enabling preemptive actions to manage BGLs be-                           carbohydrate recommendations. The rest of this paper is organized in
fore individuals experience the negative symptoms of hypoglycemia                        the following way: Section 2 provides a more detailed description of
or hyperglycemia. However, individuals still need to figure out how                      the problem. Section 3 describes the model as well as the baselines
much to eat, how much insulin to take, and what other actions they                       used to compare against. Section 4 describes the dataset that is used
can take to prevent hypoglycemia or hyperglycemia.                                       and some of the features of the data. Section 5 discusses some of
   The broad goal of the research presented in this paper is to essen-                   the training techniques and methods used as well as the results of
tially reverse the blood glucose prediction problem, and instead pre-                    the experiments that motivated the use of these techniques. Section 6
dict how many carbohydrates an individual should eat or how much                         contains the conclusion and some plans for future work.
insulin to administer with a bolus in order to get their BGL to the
desired target. We have previously introduced in [6] an LSTM-based
neural architecture that was trained such that it could answer what-if                   2    Three Carbohydrate Recommendation Scenarios
questions of the type “What will my BGL be in 60 minutes if I eat a
snack with 30 carbs 10 minutes from now”. We show that by using                          We assume that blood glucose levels are measured at 5 minute inter-
                                                                                         vals through a CGM system. We also assume that discrete deliveries
1 Ohio University, USA, email: {jb199113,bunescu,marling}@ohio.edu
                                                                                         of insulin (boluses) and continuous infusions of insulin (basal rates)


    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Figure 1. The general neural network architecture for carbohydrate recommendation. The dashed blue line in the graph represents a subject’s BGL, while the
   solid brown line represents the basal rate of insulin. The gray star represents the meal at tm . The other meals are represented by squares, and boluses are
  represented by circles. Meals and boluses with a green outline are allowed in all three example scenarios, while those with an orange outline are allowed in
 scenario S2 and scenario S3 examples, and those with a red outline are only allowed in scenario S3 examples. The blue units in LSTM1 receive input from
  different time steps in the past. The green units in LSTM2 receive input from the prediction window. The purple trapezoid represents the 5 fully connected
                                         layers, whereas the output node at the end computes the carbohydrate prediction.


are recorded. Subjects provide the timing of meals and estimates of               hypothesis is that models that are trained on examples from scenario
the amount of carbohydrates associated with each meal. Given the                  S3 will implicitly learn physiological patterns that will improve per-
data available up to the present time t, the problem can formally be              formance for the fewer examples in scenario S1 .
defined as predicting the number of grams of carbohydrates (number
of carbs) Ctm in a meal that is to be consumed at time tm ∈ [t, t+τ )
such that the person’s BGL reaches a specified target value BGt+τ at              3    Baseline Models and Neural Architecture
time t+τ in the future. Without loss of generality, in this paper we set          Given training data containing meals with their corresponding time-
the prediction horizon τ = 30 and 60 minutes. We define three car-                stamps and carbohydrates, we define the following baselines:
bohydrate prediction scenarios, depending on whether events such as
boluses or other meals happen inside the prediction window [t, t+τ ):            1. Global average: The average number of carbs over all of the
                                                                                    meals in the subject’s training data, µ, are computed and used
1. Scenario S1 assumes that there are no events in the prediction
                                                                                    as the estimate for all future meals, irrespective of context. This
   window [t, t + τ ). Training a model for this scenario can be dif-
                                                                                    is a fairly simple baseline, as it predicts the same value for every
   ficult due to the scarcity of corresponding training examples, as
                                                                                    example.
   meals are typically preceded by boluses. The example shown in
                                                                                 2. ToD average: In this Time-of-Day (ToD) dependent baseline, an
   Figure 1 would be in this scenario if the orange and red outlined
                                                                                    average number of carbs is computed for each of the following
   meals and boluses were not present.
                                                                                    five time windows during a day:
2. Scenario S2 subsumes scenario S1 by allowing events before the
   meal, i.e. in the time window [t, tm ]. The example that is shown                  • 12am-6am: µ1 = early breakfast/late snacks.
   in Figure 1 would be a scenario S2 example if the bolus outlined                   • 6am-10am: µ2 = breakfast.
   in red were not present, and would correspond to answering the
   following what-if question: how many carbs should be consumed                      • 10am-2pm: µ3 = lunch.
   at time tm to achieve the target BGt+τ , if the meal were to be                    • 2pm-6pm: µ4 = dinner.
   preceded by another meal and a bolus.
                                                                                      • 6pm-12am: µ5 = late dinner/post-dinner snacks.
3. Scenario S3 is the most general and allows events to happen dur-
   ing the entire prediction window [t, t + τ ). The example in Fig-                  The average for each ToD interval is calculated over all of the
   ure 1 is a scenario S3 example but not a scenario S1 or scenario                   meals appearing in the corresponding time frame in the subject’s
   S2 example because of the presence of the orange and red outlined                  training data. At test time, to predict the number of carbs for a
   meal and bolus.                                                                    meal to be consumed at time tm , we first determine the ToD inter-
                                                                                      val that contains tm and output the corresponding ToD average.
We train and evaluate carbohydrate recommendation models for each
scenario, using data acquired from 6 subjects with type 1 diabetes [5].           Given sufficient historical data, the ToD baseline is expected to per-
Given the scarcity of training examples for scenario S1 , our starting            form well for individuals who tend to eat very consistently and have
regular diets. However, it is expected to perform poorly on individu-
                                                                                         Table 1.   Meal statistics, per subject and total.
als who have a lot of variation in their diets.
    While simple to compute and use at test time, the two baselines are                                              Carbs Per Meal
likely to give suboptimal performance, as their predictions ignore the                         Subject   Meals       Avg     StdDev
history of BG values, insulin (boluses and basal rates), and meals, all                         559       179        36.0      16.0
of which could significantly modulate the effect a future meal might                            563       153        29.9      16.3
                                                                                                570       169       105.3      42.0
have on the BGL. To exploit this information, we propose the general                            575       284        40.6      22.9
neural network architecture shown in Figure 1. The first component                              588       257        30.8      16.6
in the architecture is a recurrent neural network (RNN) instantiated                            591       249        31.6      14.2
using Long Short-Term Memory (LSTM) cells [3], which is run over                                Total    1291        43.5      33.1
the previous 6 hours of data, up to the present time t. At each time
step (5 minutes), this LSTM network takes as input the BGL, the
carbohydrates, and the insulin dosages recorded at that time step.         of tm within [t, t + τ ). However, when τ = 60 minutes, an exam-
While sufficient for processing data corresponding to scenario S1 ,        ple is created for every position of tm within [t, t + 30], to ensure
this LSTM cannot be used to process events in the prediction window        that there are at least 30 minutes between the meal and the prediction
[t, t + τ ) that may appear in scenarios S2 and S3 , for which BGL         horizon. Table 2 below shows the resulting number of examples for
values are not available. Therefore, in these scenarios, the final state   τ = 30 and 60 minutes, in each of the three scenarios. Note that
computed by the first LSTM model (LSTM1 ) at time t is projected           there are fewer examples in scenarios S1 and S2 when τ = 60 vs. 30
and used as the initial state for a second LSTM model (LSTM2 ) that        minutes, despite there being more scenario S3 examples. This can be
is run over the time steps between (t, t+τ ). The final state computed     explained by the scenarios S1 and S2 criteria being even more dif-
either by LSTM1 (for scenario S1 ) or LSTM2 (for scenarios S2 and          ficult to meet when τ = 60 minutes, i.e. there cannot be any event
S3 ) is then used as input to a fully connected network (FCN) whose        within [t, t + 60) for S1 , or any event within [tm , t + 60) for S2 .
output node computes Ĉtm , an estimate of the carbohydrates at time
tm . Besides the LSTM final state, the input to the FCN contains the             Table 2. Example counts by scenario, for 30 and 60 minutes.
following additional features:
                                                                                                 Scenario S1       Scenario S2       Scenario S3
1. The target BGL at τ minutes into the future, i.e. BGt+τ .                      Dataset        30       60       30       60       30       60
2. The time interval ∆ = tm −t between the intended meal time and                 Training      2396 1923         3889    3491      5096    5931
                                                                                  Validation     629      510     1061      981     1388    1626
   the present.                                                                   Testing        469      339      950      851     1236    1435
3. The ToD average computed for Baseline 2 corresponding to the                   Total         3494 2772         5900    5323      7720    8992
   time the meal was eaten.

The entire architecture is trained to minimize the mean squared error
between the actual carbohydrates Ctm recorded in the training data
and the estimated value Ĉtm computed by the output node of the            5     Experimental Evaluation
FCN module. Each LSTM uses vectors of size 100 for the states and
gates, whereas the FCN is built with 5 hidden layers, each consisting      The Adam [4] variant of gradient descent is used for training, with
of 200 ReLU neurons, and one linear output node.                           the learning rate and mini-batch size being tuned on the validation
                                                                           data. In an effort to avoid overfitting, early stopping with a patience
                                                                           of 5 epochs and dropout with a rate of 10% are used for both models.
4     Dataset                                                              Interestingly, dropout was found to help the model if it was only
The data used for the model was collected from 6 subjects with type 1      applied to the LSTM networks of the model at each time step and
diabetes [5]. Information including the basal rate of insulin, boluses,    not the fully connected network.
meals, and BGL readings was collected over roughly 50 days, al-               Since the overall number of examples available in the dataset is
though the exact amount of time varies from subject to subject. This       low, the performance was improved by first pretraining a generic
time series data is split into three sets, as follows: the last 10 days    model on the combined data from all 6 subjects. Then, for each sub-
of data for each subject are used as testing, the previous 10 days are     ject, a new model is initialized with the weights of the generic model,
used as validation, and the remainder of the data is used for training.    and then fine-tuned on the subject’s training data. For each subject,
                                                                           five models were trained with different seedings of the random num-
                                                                           ber generators. We also experimented with fine-tuning models on the
4.1    From Meal Events to Examples                                        union of the training and validation data instead of just the training
Since the total number of available examples is directly related to the    data. When this combined data is used, the average carb values used
number of meals, it is useful to know how many meals each subject          in the baselines are recalculated over the union of the training and
had. This is shown in Table 1, together with the average number of         validation data for each subject.
carbs per meal (Avg), and the corresponding standard deviation (Std-
Dev). Most subjects have a similar average number of carbohydrates         5.1    Results
in their meals, with the exception of 570 who has a significantly
larger number of carbs per meal on average, and more importantly, a        The metrics used to evaluate the performance of the models are
much higher standard deviation than the other subjects.                    the root mean squared error (RMSE) and the mean absolute error
   A meal event occurring at time tm may give rise to multiple exam-       (MAE), which is less sensitive to large errors. At the end of the train-
ples, depending on the position of tm in the interval [t, t + τ ). When    ing process, there are five fine-tuned models for each subject. The
τ = 30 minutes, an example is created for every possible position          average RMSE and MAE of the five models are reported, as well as
the RMSE and MAE of the best model. The model that is consid-                     RMSE and MAE performance across all three scenarios, with the ex-
ered the ”best” is the one that had the lowest MAE on the validation              ception of the RMSE scores for scenario S1 . Compared to the other
data. The results of the five models for each subject are also averaged           two scenarios, the LSTM models and the baselines have a lower per-
across all subjects to obtain one overall RMSE and one overall MAE                formance in S1 . The decline in performance is even more apparent
value for the average model and the best model scores. The baselines              for the LSTM models, which cannot beat the time-dependent base-
are treated much the same, as their RMSE and MAE values are aver-                 line in terms of RMSE for both the 30 minute and 60 minute pre-
aged across all subjects to give an RMSE and an MAE score for each                diction horizons. This can be explained by the limited number of
baseline.                                                                         examples for scenario S1 : since there are so few testing examples
   Table 3 compares the validation results achieved in scenario S3                in this scenario per subject, one bad prediction can hurt the results
by models with and without pretraining for τ = 30 minutes. This                   significantly, more so for the RMSE than the MAE. Furthermore, the
experiment clearly shows the benefit of pretraining the models: both              trained models tend to make very similar predictions for all examples
the RMSE and MAE are noticeably lower for the pretrained models.                  stemming from a specific meal, meaning that if the model made a bad
As a result, pretraining is always used as part of the training process           prediction for one test example, it likely made a series of similarly
for both values of τ .                                                            bad predictions.
                                                                                     To alleviate the scarcity of training examples in scenario S1 , mod-
           Table 3.    Results with and without pretraining, τ = 30.              els trained on S3 examples, which are the most plentiful and subsume
                                                                                  S1 , were evaluated separately on test examples from S1 . This gives
                           Setting           RMSE       MAE                       an indication on whether any transfer learning is taking place. Table 6
                      Without Pretraining     22.2      15.5
                       With Pretraining       20.7      14.5                      shows the results of this transfer learning experiment, indicating that
                                                                                  training on the additional examples from scenario S3 helps improve
                                                                                  performance on scenario S1 to the level that now the LSTM-based
                                                                                  models outperform both baselines.
   Table 4 compares models that were fine-tuned on training and val-
idation data with models fine-tuned solely on the training data, in
scenario S3 . The results show that the extra examples provided by                      Table 6. Comparative performance on scenario S1 test examples:
the validation data proved helpful in improving performance. It is                      Baselines vs. LSTM-based models trained on S1 and S3 examples.
interesting to note that using the combined training-validation data
                                                                                                                             RMSE on S1      MAE on S1
only slightly helped the baselines, but helped the LSTM-based mod-                                     Baselines & Models     30    60        30   60
els by a noticeable margin.                                                                              Global Average      19.7  18.4      15.7 15.0
                                                                                                          ToD Average        18.9  17.6      14.8 14.4
  Table 4.      Fine-tuning on Training vs. Training ∪ Validation, τ = 30.                               Average Model       19.3  19.5      14.1 13.9
                                                                                      Training on S1
                                                                                                           Best Model        19.0  19.8      13.9 13.9
                                                                                                         Average Model       18.2  17.6      13.6 13.3
            Fine-tuning      Baselines & Models      RMSE      MAE                    Training on S3
                                                                                                           Best Model        18.3  16.7      13.8 13.0
                               Global Average         23.3     19.2
                                ToD Average           22.5     17.8
            Training
                               Average Model          21.3     16.0
                                 Best Model           20.7     15.3
                               Global Average         23.1     19.0
            Training ∪          ToD Average           22.2     17.7
              Validation       Average Model          20.1     15.0               6     Conclusion and Future Work
                                 Best Model           19.2     14.2
                                                                                  We introduced a generic neural architecture, composed of two
                                                                                  chained LSTMs and a fully connected network, with the purpose of
   Table 5 compares the Baselines (Global and ToD averages) with                  training data-driven models for making recommendations with re-
the trained Models (Best and Average) in terms of their RMSE and                  spect to any type of quantitative events that may impact BG levels,
MAE in the three scenarios.                                                       in particular carbohydrate amounts and bolus insulin dosages. Ex-
                                                                                  perimental evaluations on the task of carbohydrate recommendations
Table 5.    Results for scenarios S1 , S2 , and S3 , for τ = 30 and 60 minutes.   within a 30 or 60 minute prediction window demonstrate the feasibil-
                                                                                  ity and potential of the proposed architecture, as well as its ability to
                                               RMSE           MAE                 benefit from pre-training and transfer learning. Future plans include
                 Baselines & Models          30    60      30     60              evaluating carbohydrate recommendations within larger prediction
                   Global Average           19.7 18.4     15.7  15.0
                    ToD Average             18.9 17.6     14.8  14.4              windows, as well as training the architecture for bolus recommenda-
           S1                                                                     tions.
                   Average Model            19.3 19.5     14.1  13.9
                     Best Model             19.0 19.8     13.9  13.9
                   Global Average           18.4 17.1     14.5  13.8
                    ToD Average             17.4 15.9     13.1  12.2              ACKNOWLEDGEMENTS
           S2
                   Average Model            16.2 15.3     11.9  11.4
                     Best Model             15.8 15.4     11.6  10.9
                   Global Average           18.5 18.6     14.6  14.7
                                                                                  This work was supported by grant 1R21EB022356 from the National
                    ToD Average             17.5 17.6     13.2  13.3              Institutes of Health (NIH). Conversations with Josep Vehi helped
           S3
                   Average Model            15.7 15.6     11.5  11.3              shape the research directions presented herein. The contributions of
                     Best Model             15.6 14.8     11.4  10.6              physician collaborators Frank Schwartz, MD, and Amber Healy, DO,
                                                                                  are gratefully acknowledged. We would also like to thank the anony-
                                                                                  mous people with type 1 diabetes who provided their blood glucose,
  Overall, the LSTM-based models (Average or Best) had the best                   insulin, and meal data.
REFERENCES
 [1] R. Bunescu, N. Struble, C. Marling, J. Shubrook, and F. Schwartz,
     ‘Blood glucose level prediction using physiological models and support
     vector regression’, in Proceedings of the Twelfth International Confer-
     ence on Machine Learning and Applications (ICMLA), pp. 135–140.
     IEEE Press, (2013).
 [2] G. Cappon, M. Vettoretti, F. Marturano, A. Facchinetti, and G. Spara-
     cino, ‘A neural-network-based approach to personalize insulin bolus
     calculation using continuous glucose monitoring’, Journal of Diabetes
     Science and Technology, 12(2), 265–272, (2018).
 [3] Sepp Hochreiter and Jürgen Schmidhuber, ‘Long short-term memory’,
     Neural Computation, 9, 1735–1780, (12 1997).
 [4] D. P. Kingma and J. L. Ba, ‘Adam: A method for stochastic optimiza-
     tion’, in Third International Conference for Learning Representations
     (ICLR), San Diego, California, (2015).
 [5] C. Marling and R. Bunescu, ‘The OhioT1DM dataset for blood glucose
     level prediction’, in The 3rd International Workshop on Knowledge
     Discovery in Healthcare Data, Stockholm, Sweden, (2018). Available
     at http://ceur-ws.org/Vol-2148/paper09.pdf.
 [6] S. Mirshekarian, H. Shen, R. Bunescu, and C. Marling, ‘LSTMs and
     neural attention models for blood glucose prediction: Comparative ex-
     periments on real and synthetic data’, in Proceedings of the 41st Inter-
     national Conference of the IEEE Engineering in Medicine and Biology
     Society (EMBC 2019), Berlin, Germany, (2019).
 [7] S. G. Mougiakakou and K. S. Nikita, ‘A neural network approach for
     insulin regime and dose adjustment in type 1 diabetes’, Diabetes Tech-
     nology & Therapeutics, 2(3), 381–389, (2000).
 [8] K. Plis, R. Bunescu, C. Marling, J. Shubrook, and F. Schwartz, ‘A ma-
     chine learning approach to predicting blood glucose levels for diabetes
     management’, in Modern Artificial Intelligence for Health Analytics:
     Papers Presented at the Twenty-Eighth AAAI Conference on Artificial
     Intelligence, pp. 35–39. AAAI Press, (2014).
 [9] Q. Sun, M. V. Jankovic, J. Budzinski, B. Moore, P. Diem, C. Stet-
     tler, and S. G. Mougiakakou, ‘A dual mode adaptive basal-bolus ad-
     visor based on reinforcement learning’, IEEE Journal of Biomedical
     and Health Informatics, 23(6), 2633–2641, (2019).
[10] J. Walsh, R. Roberts, T. S. Bailey, and L. Heinemann, ‘Bolus advisors:
     Sources of error, targets for improvement’, Journal of Diabetes Science
     and Technology, 12(1), 190–198, (2018).
[11] H. C. Zisser, L. T. Robinson, W. Bevier, E. Dassau, C. L. Ellingsen, F. J.
     Doyle, and L. Jovanovic, ‘Bolus calculator: A review of four “smart”
     insulin pumps.’, Diabetes Technology & Therapeutics, 10(6), 441–444,
     (2008).