=Paper= {{Paper |id=Vol-2675/paper14 |storemode=property |title=Investigating Potentials and Pitfalls of Knowledge Distillation Across Datasets for Blood Glucose Forecasting |pdfUrl=https://ceur-ws.org/Vol-2675/paper14.pdf |volume=Vol-2675 |authors=Hadia Hameed,Samantha Kleinberg |dblpUrl=https://dblp.org/rec/conf/ecai/HameedK20 }} ==Investigating Potentials and Pitfalls of Knowledge Distillation Across Datasets for Blood Glucose Forecasting== https://ceur-ws.org/Vol-2675/paper14.pdf
       Investigating potentials and pitfalls of knowledge
    distillation across datasets for blood glucose forecasting
                                                         Hadia Hameed, Samantha Kleinberg1


Abstract. Individuals with Type I diabetes (T1D) must frequently                         using the system. To date, there is open source diabetes data avail-
monitor their blood glucose (BG) and deliver insulin to regulate it.                     able for more than 100 subjects, collected over a period of 1 – 4 years
New devices like continuous glucose monitors (CGMs) and insulin                          (more than 1000 days worth of data for some individuals). This pa-
pumps have helped reduce this burden by facilitating closed-loop                         tient generated data is self-reported, noisy, heterogeneous, and irreg-
technologies like the artifical pancreas (AP) for delivering insulin au-                 ularly sampled, but its much larger than the datasets routinely col-
tomatically. As more people use AP systems, which rely on a CGM                          lected in controlled studies.
and insulin pump, there has been a dramatic increase in the availabil-                      We propose that large public datasets like OAPS can be used to
ity of large scale patient-generated health data (PGHD) in T1D. This                     pretrain models, allowing deep learning to be used on smaller curated
data can potentially be used to train robust, generalizable models for                   datasets for forecasing BG. In particular, we show by augmenting and
accurate BG forecasting which can then be used to make forecasts                         distilling knowledge across models trained on data obtained from
for smaller datasets like OhioT1DM in real-time. In this work, we in-                    different sources using RNN, we achieve an accuracy comparable
vestigate the potential and pitfalls of using knowledge distillation to                  to that achieved by using OhioT1DM dataset alone for univariate
transfer knowledge from a model learned from one dataset to another                      setting. We also compare the performance with multi-output setting
and compare it with the baseline case of using either dataset alone.                     in which multiple BG values are estimated in the prediction horizon
We show that using a pre-trained model to do BG forecasting for                          simultaneously. The code is available at https://github.com/health-ai-
OhioT1DM from CGM data only (univariate setting) has compara-                            lab/BGLP BG forcasting.
ble performance to training on OhioT1DM itself. Using a single-step,
univariate recurrent neural network (RNN) trained on OhioT1DM
data alone, we achieve an overall RMSE of 19.21 and 31.77 mg/dl                          2     Methodology
for a prediction horizon (PH) of 30 and 60 minutes respectively.
                                                                                         The task here is to forecast future values for BG. We compare single-
1     Introduction                                                                       step and multi-output forecasting. In the single-step setting, a single
                                                                                         glucose value is estimated several minutes into the future, whereas
Type 1 diabetes (T1D) is a chronic lifelong disease that requires                        in multi-output forecasting several future values are estimated simul-
dozens of daily decisions to manage blood glucose (BG). While                            taneously to model the signal trajectory over the prediction horizon.
keeping BG in a healthy range is critical for avoiding complications,                    We begin by describing our time series forecasting approach, and
it is challenging, as meals and many other factors like exercise and                     later discuss the dataset specific preprocessing.
stress can affect BG and insulin sensitivity. Closed-loop technolo-
gies, which connect a continuous glucose monitor (CGM) and insulin
pump with a control algorithm, could relieve this burden by auto-
                                                                                         2.1     Problem setup
matically dosing insulin. This requires an accurate forecast of where
glucose is headed so the right amount of insulin can be delivered to
keep BG within a target range dynamically.                                               We define the feature vector X0:t = {x0 , x1 , ..., xt } ∈ Rn with
    Prior works include using system identification techniques to                        n being the number of variables. We use only raw CGM values
model glucose-insulin interactions [18, 3] , using classic autoregres-                   and do not incorporate additional features like carbohydrate intake
sive models for time series forecasting [23, 1, 5, 6] or training deep                   and insulin dosage. We also have a corresponding output time series
                                                                                           0
neural networks to implicitly learn the changing glucose level pat-                      Xt+1:t+h   = {x0t+1 , x0t+2 , . . . , x0t+h } ∈ R representing multiple fu-
terns [16, 17, 4, 24]. Neural network architectures such as LSTM                         ture glucose values across a given prediction horizon (PH) of 30 and
have been used successfully for many time series forecasting prob-                       60 minutes. As CGM data is recorded at a frequency of 5 minutes, a
lems [10, 8, 7, 19, 15], but require large amounts of training data.                     PH of 30 and 60 minutes will lead to h = 6 and h = 12 samples,
This is a challenge for BG forecasting, as it is time consuming and                      respectively. For the single step setting, this target vector becomes
                                                                                           0
can be infeasible to collect such massive datasets. However, there                       Xt+h   = {x0t+h } estimating only a single value h time instances in
are now large public datasets created by people with diabetes sharing                    the future. Multi-output forecasting, on the other hand, aims to es-
                                                                                                                               0
their own data, which we believe could be leveraged. In particular,                      timate the joint probability p(Xt+1:t+h        |X0:t ) simultaneously. How-
the open source artificial pancreas system (OAPS) [11], a collabora-                     ever, root mean square error (RMSE) was calculated by comparing
tive project led by people with T1D, has data donated by individuals                     the actual future glucose level and the last future value in the esti-
                                                                                         mated multi-output sequence, to accurately measure the performance
1 Stevens Institute of Technology, USA, email: hhameed@stevens.edu
                                                                                         of the forecasting model across the two output settings.




    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2.2    Learning Framework                                                             value(s) 30 or 60 minutes into the future depending on the PH and
                                                                                      the output setting (single-step or multi-output).
Our proposed approach is to make glucose estimations for a small
dataset by pre-training an RNN on a larger dataset and then re-
training it using a smaller dataset. We compare four learning ap-                     2.4    Training models
proaches for glucose forecasting, as shown in Fig.1: I) training and
                                                                                      The teacher model was trained on OAPS dataset which was pre-
testing an RNN on OhioT1DM only (red path), II) training an RNN
                                                                                      processed the same way as OhioT1DM, as discussed in Section
on OAPS dataset and testing on OhioT1DM without any re-training
                                                                                      4. Early stopping was used to halt the training process if valida-
(blue path), III) training an RNN on OAPS dataset, training again
                                                                                      tion loss was not improving significantly, with the maximum num-
OhioT1DM, and then testing on the OhioT1DM (purple path), and
                                                                                      ber of epochs being 1000 with a batch size of 248 and 128 for
IV) the pre-trained RNN model makes intermediate estimates called
                                                                                      OAPS and OhioT1DM, proportional to size of each dataset. Glo-
soft predictions, which are given as target estimates to a student artifi-
                                                                                      rot normal initialization [9] was used to initialize the weight matrix.
cial neural network (ANN) model instead of the actual ground truth,
                                                                                      For the OhioT1DM dataset, the same training configurations (maxi-
as done for a classification task in [2]. As shown in the figure, the
                                                                                      mum epochs, batch size, initialization technique etc.) were used with
black edges from the two datasets to the teacher model show that it is
                                                                                      all the learning approaches (i.e. student, teacher, retrained teacher,
pre-trained using the source data (OAPS here) but uses target data for
                                                                                      teacher-student) for a fair comparison. The experiments for Ap-
making final predictions in Approach II, for re-training in Approach
                                                                                      proach I, III and IV were repeated 10 times and the average RMSE
III, and making soft estimations in Approach IV (mimic learning),
                                                                                      and MAE was recorded for each subject, along with the standard de-
thus always having access to the two datasets.
                                                                                      viation as presented in the Section 5.

              Approach I: Student model only
              Approach II: Teacher model without re-training
              Approach III: Teacher model with re-training
                                                                                      3     Data
                                                                           Final
              Approach IV: Mimic learning (teacher + student model)     predictions
                                                                                      We aim to evaluate the impact of using a large noisy dataset for im-
                           Student                                              Yn    proving forecasting in a smaller more controlled dataset. The larger
                            Model
                                                                                      (source) dataset from OAPS [20] was used to pre-train the model
                                                                                      before it was trained on OhioT1DM [12] (target), which is much
                    Soft
                                               Pathway II                             smaller in terms of the total number of subjects and days for each.
                 predictions   Yn


                                                                                      3.1    OAPS
                   Pathway IV

                                                                                      The collection of OAPS data started in 2015 as part of an initiative
                            Teacher                               Pathway III         to make APS technology more accessible and transparent for people
                                               Teacher Model
                             Model
                                                re-trained on                         with T1D and to enable them to create their own customized AP sys-
                          trained on
                                                 OhioT1DM
                          OpenAPS
                                                                                      tems. Participants can voluntarily donate their data, including glu-
                                                                                      cose levels recorded via CGM, insulin basal and bolus rates, carbs
                                                                                      intake, physical activity, and other physiological data. Researchers
           Pathway I                                                                  can gain access to this dataset free of charge, provided they share
                        Ohio        Open
                        T1DM        APS                                               their insights and research findings with the public within a reason-
                                                                                      able frame of time [21]. For this work, we used a subset of the dataset
                                                                                      from individuals with multiple calendar years of data (55 people to-
Figure 1: Four learning pipelines to estimate blood glucose levels in                 tal, 320±158.3 days of data on average). Since this data is largely
OhioT1DM test data.                                                                   self-reported, it is noisy, irregularly sampled, and heterogeneous in
                                                                                      terms of the variables recorded, but because of its sheer size, it is
                                                                                      highly useful for pre-training a robust machine learning model for
                                                                                      accurate BG forecasting.
2.3    Network Architecture
We use a vanilla RNN with a single hidden layer, H(t) with 32 units,                  3.2    OhioT1DM
followed by a fully-connected output layer O(t). This was used as
the teacher model in Approaches II, III and IV and trained on the                     The training data consists of 12 subjects: six from the OhioT1DM
source patient-generated data. In Approach I where no teacher model                   dataset shared in 2018 for the First BGLP Challenge (Group I)[13],
was involved, the RNN was used as a student model trained only on                     and six from the second BGLP Challenge 2020 (Group II)[12]. The
the target OhioT1DM dataset to observe the effects of using datasets                  validation and test samples are drawn from the last 10 days of data for
of different sizes and from different sources with the same network                   subjects in Group I and Group II, respectively. The dataset contains
architecture. In Approach IV, a teacher RNN model trained on OAPS                     around 8 weeks of data for 20 variables including raw CGM values,
was used to teach a student ANN model using OhioT1DM data to                          insulin basal and boluses, carbohydrate intake, exercise, and sleep.
study the effects of knowledge distillation between different kinds of
networks. We use a simple, fully-connected ANN with a single hid-                     4     Data pre-processing
den layer with 32 units. The number of units for both RNN and ANN
were chosen after trying and testing [28, 32, 64, 128] and optimiz-                   For both OAPS and OhioT1DM, we use four recorded variables and
ing for the least RMSE. The output layer O(t) predicts the glucose                    one attribute derived from the raw glucose values. The list of features
used in the experiments includes raw CGM values (glucose level, in-       5     Experiments
sulin basal rate (basal and temp basal), bolus amount (bolus), carbs
intake (meal), and difference between consecutive glucose values          5.1    Experimental set up
calculated during data pre-processing (glucose diff ). The first step
                                                                          The last ten days of data for subjects with ID 559, 563, 570, 575, 588
in data pre-processing was to synchronize the multi-modality data
                                                                          and 591 were used as validation set and test set was sampled from
by generating a single timestamp data field based on the timestamps
                                                                          data for subjects 540, 544, 552, 567, 584, 596. The processing steps
for each of the four fields, generating an irregularly sampled multi-
                                                                          for the test data included linear extrapolation for imputing missing
variate time series.
                                                                          values and normalization. The test data was not passed through a
    In OAPS dataset, there were two types of gaps present in the data,
                                                                          median filter like the training set to see how robust the trained models
first where both timestamp and glucose values were missing, and sec-
                                                                          were to unseen, noisy data. We use root mean square error (RMSE)
ond where the timestamp was recorded but the corresponding glu-
                                                                          and mean absolute error (MAE) to compare the predicted values with
cose value was missing. In OhioT1DM, missing glucose values were
                                                                          the actual ground truth to evaluate the model. MAE and RMSE can
identified once the multi-modality data was synchronized since basal,
                                                                          be expressed as,
bolus, and meals are not recorded at the same 5-minute frequency as
glucose levels. When there was missing glucose data for more than
25 consecutive minutes, these times were not used during training.                                                           n
                                                                                                                       1X
Each data segment (series of points not separated by gap longer than                             M AE           =            |yi − yˆi |                             (1)
                                                                                                                       n n=1
25 minutes) was then imputed and windowed separately to maintain                                                       v
temporal continuity in the data.                                                                                       u
                                                                                                                       u1 X  n
    For the rest of the data, which may contain shorter gaps, we used                          RM SE            =      t       (yi − yˆi )2                          (2)
                                                                                                                         n n=1
linear interpolation to impute missing glucose values in training data.
Missing values in test data were imputed by extrapolation to avoid
using data from the future. Basal rates were imputed with forward            where yi is true glucose level and yˆi is estimated glucose level,
filling, meaning replacing missing values with the last recorded basal    both measured in mg/dl. We repeated the experiments 10 times and
rate, since the value is only recorded when it changes and thus miss-     calculated the average RMSE and MAE for each subject across the
ing values mean the last recorded one is still active. However, if        ten trials. We also report the best, worst and mean RMSE (MAE)
the field “temp basal”, recording temporary basal infusion rate, was      across all the subjects for each of the four pipelines using both single-
present for a given set of timestamps, it was used to replace the         step and multi-output models.
recorded basal rate [12] by evenly distributing the rate across the
time duration which was divided into 5-minute intervals, as imple-        5.2    Results
mented in [14, 22]. Bolus rates were imputed in a similar manner
by calculating the rate for every 5-minute interval and distributing it
evenly across the specified duration, and was set to 0 when it was not    Table 1: RMSE (MAE) for single-step forecasting with different
recorded, thereby indicating that insulin was not bolused for those       learning pipelines for a PH of 30 minutes.
time instances. Similarly, the data field “meal” which recorded the
amount of carbohydrate intake was set to 0 when it was missing.
    In addition to missing data, the sensors are also noisy, leading to                                         (a) Single-step
sudden changes in glucose levels, which can cause high variance in                Subject ID          I                II                III             IV
the learned model. To remove these spikes, the signal was passed                    540         19.55 (14.00)     20.32 (14.60)    20.36 (14.69)    20.46 (14.81)
through a median filter with a window size of 5 samples, as in [25].                544         16.56 (11.51)     17.84 (12.51)    17.50 (12.20)    17.92 (12.53)
                                                                                    552         15.04 (11.14)     16.17 (11.90)    15.72 (11.63)    16.20 (12.06)
This was only done for training data and not for the validation and                 567         23.07 (14.67)     24.09 (15.38)    23.91 (15.32)    24.74 (15.65)
                                                                                    584         25.19 (16.16)     26.47 (16.88)    26.97 (16.65)    26.83 (16.84)
test sets to test robustness of the model.                                          596         15.85 (10.98)     17.24 (12.06)    16.50 (11.52)    17.50 (12.12)
    A sliding window was used to split the data into fixed sized se-               Best         15.04 (11.14)     16.17 (11.90)    15.72 (11.63)    16.20 (12.06)
                                                                                   Worst        25.19 (16.16)     26.47 (16.88)    26.97 (16.65)    26.83 (16.84)
quences for further downstream analysis. There are three parameters                Average      19.21 (13.07)     20.36 (13.89)    20.16 (13.67)    20.61 (14.00)
for the moving window configuration: history window size (number
of past samples to use for forecasting), prediction horizon (PH) and
output window (how far into the future and how many future values                                            (b) Multi-output
to predict), and stride (number of samples to skip while sliding the
                                                                                  Subject ID          I                II                III             IV
window). An hour (12 samples) of past values were used to predict                   540         20.30 (14.64)     20.41 (14.77)    20.36 (14.68)    20.55 (15.18)
the glucose levels 30 and 60 minutes into the future (PH = 30, 60)                  544         17.61 (12.19)     18.07 (12.59)    17.68 (12.23)    18.41 (12.91)
                                                                                    552         15.68 (11.57)     15.98 (11.74)    15.66 (11.54)    16.06 (12.08)
with a unit stride, which means overlapping windows were used to                    567         23.94 (15.29)     24.88 (15.58)    23.66 (15.08)    24.47 (15.55)
partition the data.                                                                 584         26.61 (16.65)     26.29 (16.71)    25.82 (16.43)    26.70 (17.01)
                                                                                    596         16.46 (11.43)     17.17 (11.86)    16.54 (16.54)    17.57 (12.21)
    In OhioT1DM train and test data, the raw CGM values range from                 Best         15.68 (11.57)     15.98 (11.74)    15.66 (11.54)    17.57 (12.21)
70 – 275 mg/dl and 75 – 290 mg/dl on average, respectively. To en-                 Worst        26.61 (16.65)     26.29 (16.71)    25.82 (16.43)    26.70 (17.01)
                                                                                   Average      20.10 (13.63)     20.46 (13.87)    19.95 (13.57)    20.63 (14.16)
sure that values of all the features were in the same range, insulin
basal, bolus rates and carbs intake were normalized based on the                 I: Student model only, II: Teacher model without re-training,
minimum and maximum value of glucose levels using Min-Max Nor-                   III: Teacher model with re-training, IV: Mimic learning (teacher + student model)
malization.

                                                                            The results for a PH of 30 and 60 minutes are shown in Tables 1
                                                                          and 2, respectively.
   Overall, approach I achieved the lowest RMSE (MAE) with 19.21                           only on univariate data from OhioT1DM dataset achieved the least
(13.07) for a PH of 30 minutes and 31.77 (23.09) for PH = 60 min-                          RMSE of 19.21 and 31.77 mg/dl for a PH of 30 and 60 minutes,
utes. In this approach an RNN was trained only using the OhioT1DM                          respectively.
data, using raw CGM values. The worst performance was from ap-
proach IV, where estimations made by a teacher model pre-trained on
OpenAPS dataset were given as ground truth to student ANN model                            ACKNOWLEDGEMENTS
for training on OhioT1DM, as shown in Tables 1a and 2a. This ap-
                                                                                           We would like to thank the referees for their comments, which helped
proach did not improve the forecast accuracy as it did in [2]. It might
                                                                                           improve this paper considerably. This work was supported in part by
be because [2] used this technique for a classification task of mor-
                                                                                           the NSF under award number 1915182, NIH under award number
tality prediction which involved predicting hard labels and evaluated
                                                                                           R01LM011826, and Fulbright Scholarship.
performance using misclassification error instead of estimating con-
tinuous valued deviations from the ground truth as is the case in BG
forecasting.                                                                               REFERENCES
   For BG forecasting using multi-output model, all approaches per-
formed equally well, with approach I, II, and IV (student model,                            [1] Ransford Henry Botwey, Elena Daskalaki, Peter Diem, and Stavroula G
                                                                                                Mougiakakou, ‘Multi-model data fusion to improve an early warning
teacher and teacher student model) giving the same RMSE on aver-                                system for hypo-/hyperglycemic events’, in 2014 36th Annual Interna-
age. For approach II, the error did not worsen significantly, show-                             tional Conference of the IEEE Engineering in Medicine and Biology
ing that pre-trained models can be used for making forecasts for                                Society, pp. 4843–4846. IEEE, (2014).
OhioT1DM data in real-time, without having to set aside a portion                           [2] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu,
                                                                                                ‘Interpretable deep models for icu outcome prediction’, in AMIA An-
of the dataset for retraining the model, an important consideration
                                                                                                nual Symposium Proceedings, volume 2016, p. 371. American Medical
for smaller datasets. However, the RMSE improved slightly for ap-                               Informatics Association, (2016).
proach II when the teacher model was retrained.                                             [3] C Cobelli, G Nucci, and S Del Prato, ‘A physiological simulation
   Approach IV                                                                                  model of the glucose-insulin system’, in Proceedings of the First Joint
                                                                                                BMES/EMBS Conference. 1999 IEEE Engineering in Medicine and Bi-
                                                                                                ology 21st Annual Conference and the 1999 Annual Fall Meeting of the
Table 2: RMSE (MAE) for single-step forecasting with different                                  Biomedical Engineering Society (Cat. N, volume 2, pp. 999–vol. IEEE,
learning pipelines for a PH of 60 minutes.                                                      (1999).
                                                                                            [4] J Fernandez de Canete, S Gonzalez-Perez, and JC Ramos-Diaz, ‘Artifi-
                                                                                                cial neural networks for closed loop control of in silico and ad hoc type
                                      (a) Single-step                                           1 diabetes’, Computer methods and programs in biomedicine, 106(1),
                                                                                                55–66, (2012).
       Subject ID           I                II                III             IV           [5] Meriyan Eren-Oruklu, Ali Cinar, Lauretta Quinn, and Donald Smith,
          540         33.94 (25.40)     35.54 (26.74)    35.16 (26.94)    35.84 (27.35)         ‘Estimation of future glucose concentrations with subject-specific re-
          544         27.79 (20.34)     31.07 (22.45)    30.79 (22.84)    31.23 (22.69)         cursive linear models’, Diabetes technology & therapeutics, 11(4),
          552         26.68 (20.15)     28.36 (21.08)    27.99 (21.33)    28.54 (21.53)         243–253, (2009).
          567         37.99 (26.50)     39.89 (27.76)    40.63 (28.47)    39.57 (28.09)
          584         37.47 (27.00)     39.74 (27.87)    39.40 (27.60)    39.36 (27.85)     [6] Adiwinata Gani, Andrei V Gribok, Yinghui Lu, W Kenneth Ward,
          596         26.72 (19.12)     28.41 (20.42)    27.89 (20.31)    28.75 (20.83)         Robert A Vigersky, and Jaques Reifman, ‘Universal glucose models
         Best         26.68 (20.15)     28.36 (21.08)    27.89 (20.31)    28.54 (21.53)         for predicting subcutaneous glucose concentration in humans’, IEEE
         Worst        37.99 (26.50)     39.89 (27.76)    40.63 (28.47)    39.57 (28.09)
         Average      31.77 (23.09)     33.84 (24.39)    33.64 (24.58)    33.88 (24.72)         Transactions on Information Technology in Biomedicine, 14(1), 157–
                                                                                                165, (2009).
                                                                                            [7] André Gensler, Janosch Henze, Bernhard Sick, and Nils Raabe, ‘Deep
                                                                                                learning for solar power forecasting—an approach using autoencoder
                                   (b) Multi-output                                             and lstm neural networks’, in 2016 IEEE international conference
                                                                                                on systems, man, and cybernetics (SMC), pp. 002858–002865. IEEE,
       Subject ID           I                II               III              IV
                                                                                                (2016).
          540         35.23 (26.94)    35.32 (26.99)    35.33 (27.06)     35.66 (27.05)
          544         30.68 (22.83)    31.17 (22.74)    30.73 (22.79)     31.23 (22.84)
                                                                                            [8] Felix A Gers, Douglas Eck, and Jürgen Schmidhuber, ‘Applying lstm
          552         28.22 (21.57)    28.57 (21.56)    28.13 (21.43)     29.23 (21.98)         to time series predictable through time-window approaches’, in Neural
          567         39.53 (28.03)    39.07 (27.95)    39.32 (21.43)     41.33 (28.57)         Nets WIRN Vietri-01, 193–200, Springer, (2002).
          584         39.60(27.71)     39.43 (27.91)    39.43 (21.43)     39.07 (27.37)
          596         27.92 (20.27)    28.48 (20.55)    28.13 (20.42)     28.81 (20.81)
                                                                                            [9] Xavier Glorot and Yoshua Bengio, ‘Understanding the difficulty of
         Best         27.92 (20.27)    28.48 (20.55)     28.13 (20.42)    28.81 (20.81)         training deep feedforward neural networks’, in Proceedings of the thir-
         Worst        39.60 (27.71)    39.43 (27.91)     39.43 (21.43)    41.33 (28.57)         teenth international conference on artificial intelligence and statistics,
         Average      33.53 (24.56)    33.67 (24.62)    33.516 (24.52)    34.22 (24.77)
                                                                                                pp. 249–256, (2010).
                                                                                           [10] Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl, ‘Time-
       I: Student model only, II: Teacher model without re-training,                            series extreme event forecasting with neural networks at uber’, in Inter-
       III: Teacher model with re-training, IV: Mimic learning (teacher + student model)        national Conference on Machine Learning, volume 34, pp. 1–5, (2017).
                                                                                           [11] Dana Lewis, Scott Leibrand, and OpenAPS Community, ‘Real-world
                                                                                                use of open source artificial pancreas systems’, Journal of diabetes sci-
                                                                                                ence and technology, 10(6), 1411, (2016).
                                                                                           [12] Cindy Marling and Razvan Bunescu, ‘The ohiot1dm dataset for blood
                                                                                                glucose level prediction: Update 2020’, in KHD@ IJCAI, (2020).
6   Conclusion                                                                             [13] Cindy Marling and Razvan C Bunescu, ‘The ohiot1dm dataset for blood
                                                                                                glucose level prediction.’, in KHD@ IJCAI, pp. 60–63, (2018).
In this work we have compared four different learning strategies for                       [14] Cooper Midroni, Peter J Leimbigler, Gaurav Baruah, Maheedhar Kolla,
BG forecasting using two different datasets. We have shown that                                 Alfred J Whitehead, and Yan Fossat, ‘Predicting glycemia in type 1 di-
an RNN model pre-trained on a bigger dataset such as OpenAPS                                    abetes patients: experiments with xgboost’, heart, 60(90), 120, (2018).
can be used directly to do BG forecasting for a smaller dataset like                       [15] Sadegh Mirshekarian, Razvan Bunescu, Cindy Marling, and Frank
                                                                                                Schwartz, ‘Using lstms to learn physiological models of blood glucose
OhioT1DM when using CGM data only. We predicted BG levels 30
                                                                                                behavior’, in 2017 39th Annual International Conference of the IEEE
and 60 minutes into the future using single-step and multi-output                               Engineering in Medicine and Biology Society (EMBC), pp. 2887–2891.
models, using univariate BG data. Overall, a single-step RNN trained                            IEEE, (2017).
[16] Stavroula G Mougiakakou, Aikaterini Prountzou, Dimitra Iliopoulou,
     Konstantina S Nikita, Andriani Vazeou, and Christos S Bartsocas,
     ‘Neural network based glucose-insulin metabolism models for children
     with type 1 diabetes’, in 2006 International Conference of the IEEE
     Engineering in Medicine and Biology Society, pp. 3545–3548. IEEE,
     (2006).
[17] Carmen Pérez-Gandı́a, A Facchinetti, G Sparacino, C Cobelli,
     EJ Gómez, M Rigla, Alberto de Leiva, and ME Hernando, ‘Artificial
     neural network algorithm for online glucose prediction from continu-
     ous glucose monitoring’, Diabetes technology & therapeutics, 12(1),
     81–88, (2010).
[18] Fredrik Ståhl and Rolf Johansson, ‘Diabetes mellitus modeling and
     short-term prediction based on blood glucose measurements’, Mathe-
     matical biosciences, 217(2), 101–117, (2009).
[19] Qingnan Sun, Marko V Jankovic, Lia Bally, and Stavroula G
     Mougiakakou, ‘Predicting blood glucose with an lstm and bi-lstm based
     deep neural network’, in 2018 14th Symposium on Neural Networks and
     Applications (NEUREL), pp. 1–5. IEEE, (2018).
[20] Open Artificial Pancreas System. Openaps. https://openaps.
     org/what-is-openaps/, 2015. [Online; accessed 10-Dec-2019].
[21] Open Artificial Pancreas System. Openaps research application.
     https://tinyurl.com/oaps-application, 2015. [Online;
     accessed 10-Dec-2019].
[22] Jinyu Xie and Qian Wang, ‘Benchmark machine learning approaches
     with classical time series approaches on the blood glucose level predic-
     tion challenge.’, in KHD@ IJCAI, pp. 97–102, (2018).
[23] Jun Yang, Lei Li, Yimeng Shi, and Xiaolei Xie, ‘An arima model with
     adaptive orders for predicting blood glucose concentrations and hypo-
     glycemia’, IEEE journal of biomedical and health informatics, 23(3),
     1251–1260, (2018).
[24] Konstantia Zarkogianni, Konstantinos Mitsis, Eleni Litsa, M-T
     Arredondo, G Fico, Alessio Fioravanti, and Konstantina S Nikita,
     ‘Comparative assessment of glucose prediction models for patients with
     type 1 diabetes mellitus applying sensors for glucose and physical ac-
     tivity monitoring’, Medical & biological engineering & computing,
     53(12), 1333–1343, (2015).
[25] Taiyu Zhu, Kezhi Li, Pau Herrero, Jianwei Chen, and Pantelis Geor-
     giou, ‘A deep learning algorithm for personalized blood glucose pre-
     diction.’, in KHD@ IJCAI, pp. 64–78, (2018).