Comparison of Forecasting Algorithms for Type 1
            Diabetic Glucose Prediction on 30 and 60-Minute
                          Prediction Horizons
                                                  Richard McShinsky1 and Brandon Marshall2


Abstract. Control of blood glucose (BG) levels is essential for di-                      Exogenous Regressor (VAR), Ordinary Least Squares (OLS), K-
abetes management, especially for long term health improvement.                          Nearest Neighbors (KNN), SVM, RF, Gradient Boosting, XGBoost-
Predicting both hypoglycemic events (BG < 70 mg/dl) and hyper-                           ing, Adaptive Neuro-Fuzzy Inference System (ANFIS), and Multi-
glycemic events (BG > 180 mg/dl) is essential in helping diabetics                       Layer Perceptron. Additionally we attempt to use both the Kalman
control their long term health. In this paper we attempt to forecast                     Filter and the Unscented Kalman Filter (UKF) to predict future blood
future blood glucose levels, as well as analyze the efficiency of de-                    glucose values. The Unscented Kalman Filter was chosen over the
tecting both hypoglycemic events and hyperglycemic events. We do                         Extended Kalman Filter due to its ability to use state-space models
so by comparing Auto-Regressive Integrated Moving-Average, Vector                        to predict nonlinear functions. In comparing each of these model’s
Auto-Regression, Kalman Filter, Unscented Kalman Filter, Ordinary                        effectiveness we use RMSE, MAE, the Matthew Correlation Coef-
Least Squares, Support Vector Machines, Random Forests, Gradient                         ficient (A commonly used metric for checking hypoglycemic and
Boosted Trees, XGBoosted Trees, Adaptive Neuro-Fuzzy Inference                           hyperglycemic events that roughly measures the quality of binary
System (ANFIS), and Multi-Layer Perceptron in terms of Root Mean                         classifications) [4], and the Clarke Error Grid.
Squared Error, Mean Absolute Error, Coefficient of Determination,
Matthews Correlation Coefficient, and Clarke Error Grid to com-
pare their effectiveness in predicting future blood glucose levels, as
                                                                                         2     Data
well as predicting both hypoglycemic and hyperglycemic events.                           2.1     OHIO T1DM
                                                                                         The data used for this comparison was the OhioT1DM data set,
1     Introduction                                                                       which was obtained as part of the second Blood Glucose Level
                                                                                         Prediction Challenge [5]. This data set contains eight weeks worth
Blood glucose prediction has been an ongoing challenge within the
                                                                                         of data for 12 people with type 1 diabetes. All contributors were
medical field due to the near unpredictable variability of the many
                                                                                         on insulin pump therapy with continuous blood glucose monitoring
underlying factors influencing an individual’s glucose levels. There
                                                                                         (CGM). All pumps were of one of two brands, all life event data was
has been a strong drive recently to create an artificial pancreas using
                                                                                         reported via a custom smartphone app, and all psychological data
artificial intelligence, which has necessitated the need to predict fu-
                                                                                         was provided from a fitness band. The features themselves provided
ture blood glucose levels as well as the ability to accurately predict
                                                                                         in the data set are: Date, Glucose Level, Finger Stick, Basal (Insulin),
the onset of both hypoglycemic (BG < 70 mg/dl) and hyperglycemic
                                                                                         Basal Temperature, Bolus (Insulin), Meal (Carbohydrate Estimate),
(BG > 180 mg/dl) events [11].
                                                                                         Sleep, Work, Stressors, Hypoglycemic Event, Illness, Exercise, Basis
   Most predictive models for blood glucose encompass a physio-
                                                                                         Heart Rate, Basis GSR, Basis Skin Temperature, Basis Air Temper-
logical profile that includes a person’s insulin, meal absorption, and
                                                                                         ature, Basis Steps, Basis Sleep, and Acceleration [5].
past blood glucose levels [13]. Various machine learning methods
                                                                                            The train and test splits were given as part of the second Blood
that have been attempted to predict future blood glucose levels with
                                                                                         Glucose Level Prediction Challenge (see [5] for more details).
regards to this profile include Auto-Regressive Integrated Moving-
Average (ARIMA, see [3], [4], [13], and [15]), Support Vector Ma-
chines and Kernel Regression (SVM, see [3], [12], [13], and [15]),                       2.2     Preprocessing
Random Forests (RF, see [8], [12], [13], and [15]), Gradient Boosted
                                                                                         The glucose readings are in about 5-minute increments while other
Trees (see [8] and [15]), and Artificial Neural Networks (see REF-
                                                                                         reading are every minute. Other readings reported by the patient are
ERENCES).
                                                                                         at arbitrary times not aligned with the glucose readings. To combine
   Comparing papers on the results, accuracy, and effectiveness of
                                                                                         them into one data frame to use for predicting glucose, the most im-
the models is near impossible due to different data sets being used
                                                                                         portant predictor, glucose levels, was made the main index. All other
between them. This paper seeks to offer a comparison of as many
                                                                                         values were merged to the closest glucose values within the previ-
models as possible on a single data set.
                                                                                         ous 4 minutes. For values that were not in this tolerance they were
   In this paper, we compare the effectiveness of several mod-
                                                                                         dropped from the data frame.
els, namely ARIMA, Vector Auto-Regression Moving-Average with
                                                                                            Most of these values that were dropped were due to missing data.
1 Brigham Young University, USA, email: richard.mcshinsky@byu.net                        There are many gaps where the meter was not recording glucose val-
2 Brigham Young University, USA, email: brandon.marshall@byu.net
                                                                                         ues. This could be times between taking it off and putting it on, the


    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
hour or more it takes for the meter to get set up, or a day where the        using the previous p blood glucose values. VAR used the same pa-
user just did not put it on. Leaving these gaps often resulted in large      rameters used in the ARIMA model described above.
jumps in the training and testing data. These discontinuities would
be a problem in training the models. To fill them we couldn’t use
interpolation methods as we are unable to know the future while pre-         3.1.3    Unscented Kalman Filter (UKF)
dicting these values. Therefore, our method to extrapolate values for
these times was to use a moving average. For example, for the first          Whilst the Extended Kalman Filter (EKF) works well for linear pro-
extrapolated missing value, we would use the mean of the previous 2          jections, blood glucose levels are nonlinear in nature. Generally EKF
values. For the second we would use the mean of the previous 4 val-          can be thought of as the extension of a Gaussian Random Variable
ues. For the tenth we would use the mean of the previous 20 values,          (GRV) through a linear system [14]. In the nonlinear case however,
including the ten we had just extrapolated before that. This would           the EKF produces approximations to the values xk , yk , and Kk
happen in five minute increments until we reach the next actual value        (the state, observation, and covariance for the system) [14]. In other
in the data frame. The last predicted value would be dropped and the         words, the Extended Kalman Filter propagates a GRV through a first-
data frame would continue as normal until a difference of more than 6        order linearization of the nonlinear system [14].
minutes between values was detected and this rolling average would              The Unscented Kalman Filter also uses a Gaussian Random Vari-
extrapolate the missing values. The rolling average would eventu-            able, but instead uses a minimal set of carefully chosen sample points
ally converge to the average value of all the data, but maintains the        for which to propagate this GRV [14]. This is done by applying
nature of the recent data. For example, if the person has had high           the unscented transformation to the selected sample points and then
blood glucose levels for the day, the filled data would stay high, but       propagating these carefully chosen points through the system. Doing
eventually move towards the mean of the person when using several            so allows for approximations that are accurate to the third order of a
days for large gaps. This was done since after a few hours, guessing         Taylor series expansion [14].
where the person’s data was going to start is nearly random guessing.           To summarize, the Unscented Kalman Filter selects carefully cho-
Since the actual glucose values are essentially normally distributed,        sen points, applies the unscented transformation to these points, then
it is better to guess more towards the mean of the glucose levels.           performs the time update and measurement update as is standard in
Meanwhile, the discontinuities were reduced by maintaining the lo-           the Kalman Filter [14].
cal rolling mean. This resulted in many of the extrapolations ending
very close to where the data continues from the discontinuity for this
data.                                                                        3.2     Regression and Ensemble Methods

3     Methods                                                                Since the OhioT1DM data set is time series based, regular regression
                                                                             methods are not immediately available for us to use when forecasting
We intend to compare many methods used for classical and regres-             data. However, we can transform the data into a regression problem
sive time series analysis. Thus, even though some methods are known          by first redefining how the data is presented. Instead of each row in
to not perform well with blood glucose levels for this type of prob-         the data representing a single time step of the nineteen features, we
lem, they give a baseline to compare each successive method. In ad-          instead redefine the data on the last six rows of data (we used the last
dition to the classical models, we used some models described in             30 minutes of known information of data). Thus each row in the new
other papers about predicting glucose levels for comparison and po-          reformatted data set now contains the last six known time steps with
tentially better parameter choices. Further, we chose some methods           the labels being the future blood glucose values we wish to predict
like VAR and ANFIS in order to compare methods not seen in the               at each time step. Each label is the next six or twelve blood glucose
research found. The following subsections explain choices in why             values following the current time step in the OhioT1DM data set
specific methods, parameters, and architecture were chosen.                  for the 30-minute and 60-minute prediction horizons respectively. In
                                                                             summary, each time step is reformatted to have a 6x19 feature space
3.1     Classical Methods                                                    with each label having 6 or 12 values. With the data reformatted the
                                                                             following algorithms can be run.
3.1.1    ARIMA
Even though ARIMA itself is a linear combination of a trend com-
ponent, a seasonal component, and a residual component, we chose             3.2.1    Ordinary Least Squares
to use this model due to its classical use within time series anal-
ysis. Additionally, ARIMA was chosen due to its ability to allow             While the data is nonlinear in nature, it is possible that within a suf-
us to choose the order of p and q for both the AR and MA parts               ficiently small subset of the data (that is, for a sufficiently small time
of the model. These hyperparameters p and q were chosen using                interval), the data may be quasi-linear. As with ODEs (where one
stats.models.orderselect, from which we found that p=2 and q=2 gave          can essentially linearize a nonlinear system) we seek to do some-
the lowest error. It should be noted that the data is nearly stationary to   thing similar by attempting to fit affine functions to a sufficiently
start, so a lag of 0 was used (as larger lags resulted in a worse error).    small time domain. Ordinary Least Squares (OLS) seeks to do this,
The only data features used were the previous p blood glucose levels         fit an affine function (with a constant and error term), to the data
and the q corresponding error terms.                                         set. In addition to regular OLS, we also run OLS with regularization
                                                                             terms, namely Lasso (L1 regularization), Ridge (L2 regularization),
                                                                             and Elastic Net (L1 and L2 regularization) all with α values of 1 for
3.1.2    VAR
                                                                             the regularization terms. We note that Lasso regularization gives us
VAR is a vectored version of an AR model. This allows for more               the advantage of feature reduction, allowing us to analyze which lags
types of inputs to influence the prediction, rather than just simply         are most important in determining future blood glucose levels.
3.2.2    Support Vector Machines                                           3.3.1    ANFIS

We believe Support Vector Machine regression may be a useful               ANFIS is a neural network that includes fuzzy logic principles.
method due to its ability to alter the kernel being used, thus allowing    Fuzzy logic is about partial truths. Most neural networks have
us to alter our definition of distance with regards to the data. Sup-      a true/false form in selections. Fuzzy logic models uncertain-
port Vector Machine (SVM) regression seeks to fit a hyperplane to          ties. Some examples of this are what one considers warm/cold,
the data with an -margin. Points that fall within this -margin are       fast/medium/slow, or high/low. Rather than just picking one or the
known as support vectors and are used to help define the hyperplane        other, a draw from a distribution can give a weighted random nature
used in the regression. Notions of distance to this hyperplane are de-     to the choices. ANFIS is designed to approximate nonlinear func-
fined using a kernel. We attempt to use an RBF-kernel (with a scaling      tions like glucose values. This was chosen due to the extremely ac-
γ value) and a Polynomial Kernel (with a scaling γ value, a constant       curate predictions in the referenced paper on chaotic systems. [9]
term of 0 and a power of 3) in our regressions. Each SVM had an
-margin of 0.1. The results for each of the SVMs are reported under       3.3.2    Multi-Layer Perceptron (MLP)
RBF, Poly, and Sig respectively.
                                                                           The Multi-Layer Perceptron (MLP) is a fully-connected, feed-
                                                                           forward neural network. This neural network can often find higher-
3.2.3    K-Nearest Neighbors                                               order terms without having to create these higher-order terms. This
                                                                           reduces feature engineering of the data. Our MLP consists of three
It is likely that previous patterns in the lags of blood glucose (and      hidden layers, each with 100 nodes, and ReLu activation functions.
other features) may be similar to the current pattern in the lags of       The output layer for the regression is merely the output of the last
features, we believe KNN regression may also be a useful regression        affine function. Results are reported under MLP.
method. KNN uses a voting method to form the regression. Using a
defined metric of distance, KNN regression finds the K closest neigh-      4     Metrics
bors to the given data point and then returns the average of the labels.
We use five neighbors, along with Euclidean distance for this algo-        The following metrics were used when evaluating the efficiency and
rithm. The results for this algorithm are reported under KNN.              accuracy of the algorithms:

                                                                           4.1     Root Mean Square Error
3.2.4    Random Forest Regression                                                                                             r         n
                                                                                                                                    1
                                                                                                                                        P
                                                                           The root mean square error (RMSE) is defined as                    (yˆi − yi )2
Random Forest Regression is an ensemble method that combines                                                                        n
                                                                                                                                        i=1
weak decision-tree regressors to form a strong group regressor, Ran-       where yˆi is the predicted value and yi is the actual value. RMSE has
dom Forests allow us to create a regressor that branches based on the      the advantage of an easily defined gradient, easy interpretability, and
features. This is included here due to its use in other papers attempt-    taking the square root of the squares transforms the error back to the
ing blood glucose prediction (see [8], [12], [13], and [15]). To limit     original function space (that is, the RMSE value is in the same units
run-time to a reasonable length, a max-depth of four was imposed on        as our label). This is the first metric used in evaluating the accuracy
each forest.                                                               of the regression models.


3.2.5    Gradient Boosting                                                 4.2     Mean Absolute Error
                                                                                                                              n
                                                                           The mean absolute error (MAE) is defined as n1
                                                                                                                              P
Another ensemble method that combines weak decision-tree regres-                                                                    | yˆi − yi |. This
                                                                                                                              i=1
sors to form a strong group regressor, Gradient Boosting instead           error function is easy to define, is fairly robust against outliers, and
seeks to optimize the gradient of the loss function for each regres-       will be in the same units as our label. However, the gradient is not
sor. As this can perform well with the correct hyperparameters, we         always easy to define (and may not exist). This is the second metric
include this to see if the algorithm can outperform any of the afore-      used in evaluating the accuracy of the regression models.
mentioned algorithms. In addition to using regular Gradient Boosted
Trees, we also use an optimized version of this algorithm known as
Extreme Gradient Boosted Trees (XGB). For Gradient Boosting a
                                                                           4.3     Coefficient of Determination
least-squares loss function, along with a learning rate of 0.1, and 100    The coefficient of determination (R2 ) is defined as
estimators were used. For XBG a grid search was performed to find                                               n
                                                                                                                  2
                                                                                                                P
the optimal hyperparameters. Respectively, the results for these algo-                                                i
rithms are reported under Grad and XGB.                                                              1−
                                                                                                                i=1
                                                                                                          n
                                                                                                          P
                                                                                                                (yi − ȳ)2
                                                                                                          i=1
3.3     Neural Networks
                                                                           where yi is the actual value, yˆi is the predicted value, i = yi − yˆi
Much work has already been done implementing neural networks               and is defined as the ith residual, and ȳ is the sample mean. The co-
in many different forms, including CNN, CRNN, DCNN, LSTM,                  efficient of determination gives a measure of how much variance is
Jump neural Networks, and Echo State (see [1], [2], [3], [4], [6], [8],    explained by the model. Values near 1 indicate nearly all variance
and [15]). Much of this work came from the Blood Glucose Level             is explained by the model, while values near 0 indicate the variance
Prediction Challenge (BGLP) in 2018 using the OHIO T1DM data               may be caused by other factors. We note that negative values are pos-
set.                                                                       sible, and for this paper indicate poor performance from the model.
4.4      Matthews Correlation Coefficient                                           Table 2.   Metric Averages for 60-minute Prediction Horizon
The Matthews Correlation Coefficient (MCC) is defined as
        (T P ∗T N )−(F P ∗F N )
√                                   where TP, FP, FN, TN                                 Method       RMSE MAE MCCl MCCh              R2
    (T P +F P )(T P +F N )(T N +F P )(T N +F N )
stand for the true positive, false positive, false negative, and true neg-                 OLS        33.42   24.65   0.02    0.61   0.62
ative rates respectively [4]. This metric gives a general idea of how                      Lasso      33.41   24.67   0.02    0.61   0.62
well an algorithm does in predicting glycemic events. Values near 1                        Ridge      33.41   24.65   0.02    0.61   0.62
                                                                                           Elastic    33.40   24.67   0.02    0.61   0.62
show the predictions correlate with the actual glycemic events. Val-                       RBF        36.76   26.53   -0.00   0.55   0.54
ues near 0 indicate the algorithm does no better than random guess-                        Poly       39.16   29.31   -0.00   0.53   0.48
ing. Values near -1 indicate negative correlation (that is the predic-                     KNN        38.11   28.01   0.15    0.53   0.50
tions correlate with the opposite of the glycemic event). This metric                      RF         35.20   26.08   0.09    0.58   0.58
                                                                                           Grad       33.98   24.96   0.08    0.58   0.61
is commonly used by many articles that attempt to predict blood glu-                       XGB        39.78   26.97   0.15    0.53   0.46
cose levels (see [4] for one such example), and as such is used here.                      Kalman     22.77   15.28   0.41    0.75   0.81
                                                                                           UKF        29.78   20.65   0.30    0.66   0.69
                                                                                           ARIMA      36.39   26.93   0.01    0.56   0.54
4.5      Clarke Error Grid                                                                 VAR        35.06   19.56   0.16    0.70   0.54
                                                                                           ANFIS      36.87   26.53   0.12    0.59   0.56
The Clarke Error Grid plots the actual blood glucose values against                        MLP        35.59   25.81   0.06    0.59   0.57
the predicted blood glucose values and is used as an indication of
the potential results that may occur for a given prediction. The grid        6     Analysis
is split into 5 zones A-E. Predictions in Zone A and B are gener-
ally considered safe predictions and would not result in any negative        In an attempt to first analyze the accuracy of these predictions we first
effects on the patient. Predictions in Zone C would result in unneces-       analyze the RMSE and MAE for both the 30-minute and 60-minute
sary treatment. Predictions in Zone D indicate a potentially danger-         prediction horizons (Tables 1 and 2). As a general guideline we will
ous failure to detect a glycemic event. Predictions in zone E would          first analyze which model we believe is performing best among the
confuse treatment of hypoglycemia for hyperglycemia and vice versa           patients. Once this is done we will then analyze general trends we
(see [1]). Points in Zone E are considered extremely dangerous, as           have noticed while analyzing this data.
treatment due to these results could result in the patient’s death. For
this paper, in addition to MCC we use the percentage of points within
each zone to evaluate the accuracy of a model’s predictions.                 6.1    30-Minute Prediction
                                                                             We note that in terms of the above defined metrics OLS, Lasso,
5     Results                                                                Ridge, and Elastic Net Regression perform nearly identical. Thus,
                                                                             since the differences between OLS, Ridge, Lasso, and Elastic Net re-
The following tables describe the average of the metric scores from          gression yield minimally different results, we consider Lasso to be
the 6 patients. Each of these metrics are described above, namely            the best model for the 30-minute blood glucose predictions. Lasso
RMSE, MAE, MCCs, and R2 . The abbreviation definitions and ex-               regression offers a natural form of feature selection which allows
planations can be found in the Methods section above.                        us to analyze which lags are most important for predicting future
                                                                             blood glucose levels. A further analysis of the feature relevancy can
        Table 1.   Metric Averages for 30-minute Prediction Horizon          be found under section 6.4.
                                                                                Even though we have identified Lasso regression as the best per-
              Method       RMSE MAE MCCl MCCh              R2                forming algorithm among those tested for the 30-minute prediction
                 OLS        20.53   14.14   0.34    0.79   0.86
                                                                             horizon, this means little if this ”best” algorithm still yields subpar
                 Lasso      20.58   14.22   0.32    0.79   0.85              results. As such, we analyze Lasso regression both in terms of MCC
                 Ridge      20.52   14.13   0.35    0.79   0.86              and the Clarke Error Grid to determine if these results are ”suffi-
                 Elastic    20.56   14.20   0.31    0.79   0.86              ciently adequate” for blood glucose prediction. To see general trends
                 RBF        24.89   16.96   0.14    0.74   0.79              for the prediction we analyze the results for actual and predicted val-
                 Poly       31.73   22.51   -0.00   0.70   0.66
                 KNN        24.57   17.07   0.30    0.73   0.79              ues across time for patients 540 and 584.
                 RF         23.00   16.27   0.16    0.76   0.82                 Note the Clarke Error Grid for patients 540 and 584 for the 30-
                 Grad       21.37   14.87   0.17    0.78   0.84              minute prediction horizon (figure 2). The closer the points fall onto
                 XGB        24.62   17.29   0.34    0.74   0.79              the bottom left to top right diagonal the better the predictions are
                 Kalman     24.08   24.08   0.40    0.74   0.78
                 UKF        29.88   20.65   0.30    0.67   0.69
                                                                             considered. Analyzing these plots visually does not raise any imme-
                 ARIMA      23.73   16.68   0.12    0.75   0.81              diate concerns for the predictions. Most values appear to fall within
                 VAR        25.25   17.05   0.36    0.74   0.79              zones A, B, and C. Analyzing the zones percentages (table 3) shows
                 ANFIS      24.56   16.52   0.26    0.76   0.80              that Lasso has 96% accuracy for patient 540 and about 99% accu-
                 MLP        20.85   14.30   0.30    0.78   0.85              racy for patient 584. The major concern however is that the rest of
                                                                             these predictions fall within zones D-E, indicating these predictions
                                                                             may result in potentially dangerous care if acted on for the patient.
                                                                             Considering the high accuracy for each patient though, these results
                                                                             are considered ”sufficiently accurate” for the 30-minute prediction
                                                                             horizon.
                                                                                Analyzing the MCC for Lasso regression for the 30 minute hori-
                                                                             zon shows that the MCC tends to be about twice as high for hyper-
glycemic events than for hypoglycemic events. Given that the data          172.71 mg/dl, and 148.23 mg/dl for patients 540, 544, 552, 567, 584,
tends to have many more values in the hyperglycemic range than the         and 596 respectively the most likely reason that the MCC for hypo-
hypoglycemic this reflects more on the class imbalance more than the       glycemic events is so low is due to class imbalance within the glucose
algorithm. This is seen due to all the algorithms having this trend.       levels. Since most glucose levels are generally high for the patients,
Further, this bias is reflected in the algorithm’s predictions, as val-    the model overfits for higher glucose levels, and as such struggles to
leys in the predictions do not reach as low as the valleys in the actual   predict hypoglycemic events. A potential solution could be to upsam-
data (see figure 1). Because of this, we note that the algorithms are      ple by ”jittering” the smaller imbalanced class (adding small random
less likely to predict hypoglycemic events as they are hyperglycemic       perturbations to the existing smaller imbalanced class in order to cre-
events, a result that occurs due to the higher number of blood glucose     ate for data). See [7] and [10] for such an example.
values in the data.


6.2    60-Minute Prediction                                                6.4    Feature Relevancy
Looking at the results for the 60-minute prediction horizon for the        As stated earlier, one important benefit of Lasso regression is the
RMSE and MAE we find the surprising result that the Kalman Filter          ability to identify features important to glucose prediction. As seen
(not the Unscented Kalman Filter), performs best out of all the algo-      in Table 4: glucose level, bolus, meal, and exercise are significant in
rithms. Several explanations are possible as to why this occurs. One       predicting glucose levels (finger sticks are potentially significant, but
of these is that the Kalman filter seemed to dampen the predictions.       they may be linearly dependent on glucose level). The Weights col-
Most of the other algorithms would keep predicting upwards for the         umn is the sum of all 6 people’s weight scores. The problem with the
hour predictions if the trend was going up beforehand. The Kalman          weights is the huge variability in the number of recorded data points.
filter seems to mainly shift the prediction horizon over (so the differ-   In an attempt to normalize the data, we created an Adjusted Weight.
ence between the last known glucose value and the prediction for an        This is made by dividing the weights of each person by the number
hour later is minimal). Since it keeps the results in the typical ranges   of recorded values for each person and summing all 6 of them to-
of glucose values it may avoid the poor scores from unusually strong       gether. This was multiplied by 1000 so the values would be about
spikes of predicted values. The scores may be the best, but they may       the same magnitude as the original weights. The lack of enough data
still be very poor predictors for an hour out.                             for exercise is demonstrated here. Only 3 of the 6 people had values
    Considering the aforementioned problems with the Kalman filter,        for exercise and one of them had only 4 values. This person in the
we analyze the ”second” best algorithm. Since the general trends dis-      Adjusted Weights had a score of 32 while the other two were about
cussed in the 30-minute prediction horizon section still hold for the      1.5 and 2. More data points for these other categories would reduce
60-minute prediction horizon (when we disregard the Kalman Fil-            the variance and more clearly identify what features are important.
ter), we conclude Lasso regression to be the next best algorithm to
use. However, analyzing the difference between the 30-minute pre-
diction horizon and the 60-minute prediction horizon raises several
concerns with using Lasso regression for the 60-minute prediction
                                                                           7     Conclusion
horizon.
    We noted earlier that Lasso regression tends to underfit with re-      We found that Lasso regression performed best out of the algo-
gards to hypoglycemic events. This problem is only exacerbated             rithms used for both the 30-minute prediction horizon and the 60-
when the prediction horizon is extended to 60 minutes (see table 2).       minute prediction horizon. While the results were adequate for the
Here we notice the hypoglycemic MCC has reduced to near 0, indi-           30-minute prediction horizon, these quickly degraded for the 60-
cating that Lasso prediction does no better than random guessing as        minute horizon. We found in general that the regression algorithms
to whether a hypoglycemic event is occurring. This is far from ideal       perform fairly well for predicting hyperglycemic events, but strug-
for any diabetic patient. As well, we note that for the 60-minute pre-     gle for predicting hypoglycemic events. It is our opinion that further
diction horizon, the accuracy of safe predictions degrades by about        research should be done with regards to improving the prediction
2-3% (see table 3). While 94-97% accuracy is still fairly good, given      horizon for blood glucose prediction. Specifically, further research
that this reduction in accuracy results in 2-3% more dangerous pre-        should be investigated into the effects of the volume of data on the
dictions, and considering the fact that Lasso regression is unable to      prediction horizon. If an artificial pancreas is to become a reality,
predict hypoglycemic events better than random guessing, we do not         stable prediction horizons beyond 30-minutes are needed.
consider these predictions to be ”sufficiently accurate” for the 60-          Furthermore, analyzing the coefficients of the Lasso model shows
minute prediction horizon. As such, our recommendation is to use           that glucose level, bolus, meal, and exercise are the most relevant
the 30-minute prediction horizon.                                          features in producing forecasts for blood glucose levels. However,
                                                                           problems with sparsity among certain features reduce the relevancy
                                                                           of these features. As such, future research should include handling
6.3    Overall Trends                                                      sparse features in a more robust way.
The biggest trend that we notice is that the models tend to underfit
in regards to hypoglycemic events. That is, the predicted values do
not reach as low as the actual blood glucose values do. This is noted      8     Additional Material
in the hypoglycemic MCC for the 30-minute prediction horizon (see
table 1) which gives on average a score at about 0.3. This indicates       For those wishing to compare or reproduce work found in this
a general correlation in predicting hypoglycemic events, but not a         paper, the related code can be found at https://github.
strong one. Given that the average blood glucose levels on the test        com/marshallb95/BloodGlucosePrediction/blob/
data were 159.42 mg/dl, 158.51 mg/dl, 134.92 mg/dl, 143.41 mg/dl,          master/Master.ipynb.
REFERENCES
 [1] A. Aliberti, I. Pupillo, S. Terna, E. Macii, S. Di Cataldo, E. Patti, and
     A. Acquaviva, ‘A multi-patient data-driven approach to blood glucose
     prediction’, IEEE Access, 7, 69311–69325, (2019).                                                               Table 3. Clarke Error Grid percentages
 [2] J. Chen, K. Li, P. Herror, T. Zhu, and G Pantelis. Dilated recurrent neu-
     ral network for short-time prediction of glucose concentration. Paper                                                               30 min       60 min
     presented at the Third International Workshop on Knowledge Discov-                                                                  z }| {       z }| {
     ery in Healthcare Data at the 27th International Joint Conference on                                              Zone             540    584   540   584
     Artificial Intelligence and the 23rd European Conference on Artificial
     Intelligence, 2018.
 [3] S. Fiorini, C. Martini, D. Malpassi, R. Cordera, D. Maggi, A. Verri, and                                          Zones A-B 0.96 0.99 0.935 0.97
     A. Barla. Data-driven strategies for robust forecast of continuous glu-                                           Zone C    0.0 0.00 0.001 0.01
     cose monitoring time-series. Paper presented at the 39th Annual Inter-                                            Zones D-E 0.04 0.01 0.064 0.02
     national Conference of the IEEE Engineering in Medicine and Biology
     Society (EMBC), 2017.
 [4] K. Li, J. Daniels, C. Liu, P. Herrero, and P. Georgiou, ‘Convolutional
     recurrent neural networks for glucose prediction’, IEEE Journal of
     Biomedical and Health Informatics, 24, 603–613, (2019).
 [5] C. Marling and R. Bunescu. The ohiot1dm dataset for blood glucose
     level prediction: Update 2020. In The 5th International Workshop on
     Knowledge Discovery in Healthcare Data, Santiago de Compostela,
     Spain, June, 2020, 2020.                                                                                        Table 4.    Lasso Significant Values Totals
 [6] J. Martinsson, A. Schliep, B. Eliasson, C. Meijner, S. Persson, and
     O Mogren. Automatic blood glucose prediction with confidence us-             Feature                              Number Recorded Weights Adjusted Weights
     ing recurrent neural networks. Paper presented at the Third Interna-
     tional Workshop on Knowledge Discovery in Healthcare Data at the            glucose level                                  77563          15.4654          1.2062
     27th International Joint Conference on Artificial Intelligence and the      basis gsr                                      39542           0.2272         0.03560
     23rd European Conference on Artificial Intelligence, 2018.                  skin temperature                               39540           0.2418          0.0295
 [7] M. Mayo, L. Chepulis, and R. Paul, ‘Glycemic-aware metrics and over-        acceleration                                   39542              0               0
     sampling techniques for predicting blood glucose levels using machine       finger stick                                   1669             0.54           2.4504
     learning’, PLoS ONE, 14, 0225613–0225632, (2019).                           basal                                           428               0               0
 [8] C. Midroni, P. J. Leimbigler, G. Baruah, M. Kolla, A. J. Whitehead, and     temp basal                                      208               0               0
     Y. Fossat. Predicting glycemia in type 1 diabetes patients: Experiments     bolus                                          1994           9.4944          23.4776
     with xgboost. Paper presented at the Third International Workshop on        meal                                            957           3.5682          31.6974
     Knowledge Discovery in Healthcare Data at the 27th International Joint      stressors                                        2                0               0
     Conference on Artificial Intelligence and the 23rd European Confer-         exercise                                        65            0.2312          36.2337
     ence on Artificial Intelligence, 2018.
 [9] A. Miranian and M. Abdollahzade, ‘Developing a local least-squares
     support vector machines-based neuro-fuzzy model for nonlinear and
     chaotic time series prediction,’, IEEE Transactions on Neural Networks
     and Learning Systems,, 24, 207–218, (2013).
[10] N. Nnamoko and I. Korkontzelos, ‘Efficient treatment of outliers
     and class imbalance for diabetes prediction’, Artificial Intelligence in
     Medicine, 104, 101805–101817, (2020).
[11] S. M. Pappada, M. H. Owais, B. D. Cameron, J. C. Jaume, A. Mavarez-
     Martinez, R. S. Tripathi, and T. J. Papadimos, ‘An artificial neural                                  400                                                          actual
     network-based predictive model to support optimization of inpatient
                                                                                                                                                                        predicted
     glycemic control’, Diabetes Technology & Therapeutics, 22, 1–12,                                      350
     (2020).
[12] I. Rodriguez-Rodriguez, J.V. Rodriguez, I. Chatzigiannakis, and M.A.
                                                                                                           300
                                                                                   blood glucose (mg/dl)


     Zamora, ‘On the possibility of predicting glycaemia ‘on the fly’ with
     constrained iot devices in type 1 diabetes mellitus patients’, Sensors,
     19, 4482–4496, (2019).                                                                                250
[13] I. Rodrı́guez-Rodrı́guez, I. Chatzigiannakis, J.V. Rodrı́guez,
     M. Maranghi, M. Gentili, and M.A. Zamora, ‘Utility of big data                                        200
     in predicting short-term blood glucose levels in type 1 diabetes mel-
     litus through machine learning techniques’, Sensors, 19, 4538–4557,
     (2019).
                                                                                                           150
[14] E. A. Wan and R. V. D. Merwe. The unscented kalman filter for non-
     linear estimation. Adaptive Systems for Signal Processing, Communi-                                   100
     cations, and Control Symposium, 2000.
[15] J. Xie and Q. Wang. Benchmark machine learning approaches with                                        50
     classical time series approaches on the blood glucose level prediction
     challenge. Paper presented at the Third International Workshop on                                      0
     Knowledge Discovery in Healthcare Data at the 27th International Joint                                      0       500       1000         1500 2000        2500       3000
     Conference on Artificial Intelligence and the 23rd European Confer-
     ence on Artificial Intelligence, 2018.                                                                                                   time step
                                                                                 Figure 1. Patient 540 prediction results for 30 min PH with Lasso regres-
                                                                                 sion
                                    400
                                              E         C         B
                                    350
 Prediction Concentration (mg/dl)


                                    300
                                    250
                                                                         B
                                    200
                                    150       D                          D
                                    100
                                    50
                                     0
                                              A         C                E
                                          0   50 100 150 200 250 300 350 400
                                               Reference Concentration (mg/dl)
Figure 2. Patient 540 Clarke Error Grid for 30 min PH with Lasso regres-
sion


                                    400
                                              E         C         B
                                    350
 Prediction Concentration (mg/dl)


                                    300
                                    250
                                                                         B
                                    200
                                    150       D                          D
                                    100
                                    50
                                     0
                                              A         C                E
                                          0   50 100 150 200 250 300 350 400
                                               Reference Concentration (mg/dl)
Figure 3. Patient 584 Clarke Error Grid for 30 min PH with Lasso regres-
sion