On the Predictability of non-CGM Diabetes Data for Personalized Recommendation Tu Nguyen Markus Rokicki L3S Research Center L3S Research Center Hannover, Germany 30167 Hannover, Germany 30167 tunguyen@l3s.de rokicki@l3s.de by drawing blood samples every 10-15 min for up to 48 h [8]. In these settings, long term (e.g., days or months) Abstract studies resorted to self-monitoring blood glucose (SMBG) data, i.e., approx. 3 samples per day obtained by the patient With continuous glucose monitoring (CGM), herself by using fingerstick glucose meters. The retrospec- data-driven models on blood glucose prediction tive analysis of SMBG time-series was used by physicians, have been shown to be effective in related work. together with the information taken from the ‘patient’s di- However, such (CGM) systems are not always ary‘ (e.g., insulin dosage, meals intake, physical exercise) available, e.g., for a patient at home. In this work, and some glycaemic indexes (typically HbA1c), to assess we conduct a study on 9 patients and examine the glucose control and the effectiveness of a particular ther- predictability of data-driven (aka. machine learn- apy [8]. ing) based models on patient-level blood glucose prediction; with measurements are taken only pe- With the support of continuous glucose monitoring riodically (i.e., after several hours). To this end, (CGM) sensors, the development of new strategies for the we propose several post-prediction methods to treatment of diabetes has been accelerated in recent years. account for the noise nature of these data, that In particular, CGM sensors can be injected into ‘online‘ marginally improves the performance of the end recommender systems that are able to generate alerts when system. glucose concentration is predicted to exceed the normal range thresholds. Recently, there has been a lot of com- 1 Introduction plex data-driven prediction models [4, 7, 3, 5] that are built based on the CGM data, and have been shown to be effec- Diabetes mellitus has been a major and global problem for tive. These data-driven models, or machine learning/deep a long time, as it is report that there are over 400 million learning are data-hungry, hence, its performance on sparse patients over the world 1 . The knowledge of glucose con- / non-continuous data is still a question. CGM data are centration in blood is a key aspect in the diagnosis and still not always available for all diabetic patients for many treatment of diabetes. The use of signal processing tech- reasons 2 ; while a personalized or patient-level model that niques on glucose data started a long time ago, when glu- are trained on the same patient’s data is essential. In this cose time-series in a given individual could be obtained work, we examine the performance of these machine lean- in lab study from samples drawn in the blood at a suffi- ing approaches on our real, limited data of a group of dia- ciently high rate. In particular, related work employed not betic patients. Our contributions are two-fold: (1) we pro- only linear (e.g., correlation and spectrum analysis, peak vide a quantitative study on the predictability of machine detection), but also nonlinear (e.g., approximate entropy) learned models on limited and sparse data; (2) we propose methods to investigate oscillations present in glucose (and a prediction system that is robust on noisy data (based on insulin) time-series obtained, during hospital monitoring, prediction interval). Copyright © CIKM 2018 for the individual papers by the papers' authors. Copyright © CIKM 2018 for the volume as a collection by its editors. This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC 2 http://time.com/4703099/continuous-glucose-monitor- BY 4.0). blood-sugar-diabetes/ 1 https://www.diabetes.co.uk/diabetes-prevalence.html 2 Dataset Overview particular, for most patients the number of glucose mea- surements roughly matches or exceeds the number of rapid The data collection study was conducted from end of insulin applications throughout the days. Notable excep- February to beginning of April 2017 and includes 9 patients tions are patients 14, 15, and 17 (figures excluded). For who were given specially prepared smartphones. Measure- patient 14, in the evening the number of meals and rapid ments on carbohydrate consumption, blood glucose levels, insulin applications match but exceed the number of blood and insulin intake were made with the Emperras Esysta sys- glucose measurements by far. Patient 17 has more rapid in- tem 3 . Measurements on physical activities were obtained sulin applications than glucose measurements in the morn- using the Google Fit app. We use only steps information ing and particularly in the late evening. For patient 15, (number of steps) for our study. rapid insulin again slightly exceeds the number of glucose We describe briefly here some basic patient information. measurements in the morning. Curiously, the number of Half of the patients are female and ages range from 17 to glucose measurements match the number carbohydrate en- 66, with a mean age of 41.8 years. Body weight, accord- tries – it is possible the discrepancy is a result of miss- ing to BMI (Body mass index), is normal for half of the ing (glucose and carbohydrate) measurements. We further patients, four are overweight and one is obese. The mean show the blood glucose distribution of each patient in Fig- BMI value is 26.9. Only one of the patients suffers from di- ure 1. The different lengths of the interquartile range for abetes type 2 and all are in ICT therapy 4 . In terms of time each distribution also reflects the difficulty of prediction since being diagnosed with diabetes, patients vary from in- problem on different patients. experienced (2 years) to very experienced (35 years), with a mean value of 13.9 years. We anonymize the patients and identify them by IDs (from 8 to 17, we do not have infor- mation for patient 9). Frequency of Measurements We give an overview of the number of different measure- ments that are available for each patient. The study dura- tion varies among the patients, ranging from 18 days, for patient 8, to 33 days, for patient 14. Likewise, the daily number of measurements taken for carbohydrate intake, blood glucose level and insulin units vary across the pa- Figure 1: Blood glucose distribution for each patient. tients. The median number of carbohydrate log entries vary between 2 per day for patient 10 and 5 per day for patient 14. Median number of blood glucose measurements per day varies between 2 and 7. Similarly, insulin is used on average between 3 and 6 times per day. In terms of phys- ical activity, we measure the 10 minute intervals with at least 10 steps tracked by the google fit app. This very low threshold for now serves to measure very basic movements and to check for validity of the data. Patients 11 and 14 are Figure 2: Blood glucose prediction scenario. the most active, both having a median of more than 50 ac- tive intervals per day (corresponding to more than 8 hours of activity). Patient 10 on the other hand has a surprisingly low median of 0 active 10 minutes intervals per day, indi- cating missing values due to, for instance, not carrying the smartphone at all times. Measurements per Hour of Day Figure 3 show measurements of blood glucose, carbo- hydrates and insulin per hour of day for patient 13 and 14. Overall, the distribution of all three kinds of values throughout the day roughly correspond to each other. In 3 https://www.emperra.com/en/esysta-product-system/ 4 describes as a model of an insulin therapy for the diabetics with two different types of insulin. 15 15 15 13 13 i ∈ [1, n]. Median absolute error measures the median error 13 13 11 count 10 12 11 10 12 10 10 made and is defined as count 9 9 9 9 9 9 8 8 8 7 7 7 6 6 6 6 6 5 5 5 5 4 4 4 5 3 0 2 0 1 1 1 MdAE = median(|ŷi − yi |). 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 i Hour of day Hour of day (a) P13 Glucose (b) P14 Glucose Root mean squared error weighs larger errors more heavily and is defined as 12.5 25 22 10 r 10.0 20 ∑ni=1 (ŷi − yi )2 9 15 7.5 7 15 14 count count 5.0 5 4 5 10 12 12 8 8 11 8 RMSE = . 2.5 3 2 3 1 3 3 2 3 1 5 6 1 1 1 4 6 6 7 5 6 6 6 n 0.0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 Hour of day 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 Hour of day Symmetric mean absolute percentage error relates pre- (c) P13 Carbohydrates (d) P14 Carbohydrates diction errors to predicted values. It is defined as basal rapid basal rapid 100% n |ŷi − yi | 19 20 SMAPE = ∑ (|yi | + |ŷi |)/2 . 10 12 8 8 11 8 9 8 8 12 10 20 15 15 18 n i=1 count count 7 7 11 6 6 10 5 10 8 8 5 4 4 7 6 6 6 5 5 5 5 5 0 2 5 0 4 2 3 Note that this gives a result between 0% and 200%. Further, 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 Hour of day Hour of day the measure penalizes a) deviating for low values and b) (e) P13 Insulin (f) P14 Insulin over-forecasting. Figure 3: Glucose, carbohydrate and insulin values per 3.2 Algorithms hour of day for patients 13 and 14. Simple Baselines 3 Prediction As standard simple baselines, we use the last value ob- served one hour before the value that is being predicted Our first approach to blood glucose prediction is based on (Last) and the arithmetic mean of glucose values in the a regression type form of time series prediction. Given training set. historical blood glucose data, we learn a model that pre- dicts future glucose values based on a representation of the Context-AVG current situation (including the recent past), using informa- tion on patient context, recent insulin applications, carbo- As a more advanced baseline, we use a (temporal) context hydrate intake, and physical activity levels. weighted average of previous glucose values. As our analy- sis showed differences in glucose values according to time of the day, we weigh previous glucose values base on tem- 3.1 Setup poral proximity, weighted exponentially decreasing in the Prediction task difference of time of day. Our prediction task is a time series prediction of blood glu- LSTM cose values (in mmol/L) with a prediction horizon of 1 hour. Consequently, we can construct a data instance for LSTM is a recurrent neural network model that effectively each glucose measurement found in the dataset and use all accounts for the long-term sequence dependence among information available up until 1 hour before the measure- glucose inputs. ment for predicting the glucose value (c.f., Figure 2). RandomForest Evaluation Protocol The Random Forest Regressor (RF) is a meta estimator that learns an ensemble of regression trees [2], averaging the Performance is evaluated on a per patient basis. In addition, output of individual regression trees to perform the predic- we average performance over patients to get an overview. tion. We use a standard value of 500 estimators, as well as For each patient, we consider the first 66% of blood glucose a minimal leaf size of 4 for the individual trees to reduce measurements as training data to learn the models and the overfitting of the individual models. last 34% as test data to evaluate prediction performance. ExtraTrees Performance Measures The Extra-Trees Regressor (ET) is a variation on Random- Prediction performance is measured in terms of median Forest that uses a different base learner: Extremely ran- absolute error (MdAE), root mean squared error (RMSE) domized trees [6]. In contrast to regular regression trees, and symmetric mean absolute percentage error (SMAPE). best split values per feature are chosen randomly. We use Given are ground truth values yi and predictions ŷi , with 300 estimators and a minimum leaf size of 2. 3.3 Overall Results about the underlying distribution of the data or the range of response values. We hence, leverage the concept of pre- In this section we report aggregate results, averaged over diction intervals to supplement for the noisy data and en- all patients. Table 1 shows regression performance aver- hance the end model, in the sense that it can refuse to give aged over all patients. Performance is based on 42 test in- prediction at certain time when the confidence is low. stances on average. The simple baselines Last and AVG achieve median errors of 3.3 and 2.5 mmol/L. Weighing A prediction interval or confidence interval is an esti- previous glucose values based on time of the day (Context- mate of an interval into which the future observations will AVG) improves average median errors to 2.28 mmol/L. The fall with a given probability. In other words, it can quantify Extra-Trees Regressor achieves the lowest MdAE of 2.16 our confidence or certainty in the prediction. Unlike confi- and similarly slightly outperforms Context-AVG in terms dence intervals from classical statistics, which are about a of RMSE and SMAPE. In comparison to predicting the parameter of population (such as the mean), prediction in- arithmetic mean (AVG), however, RMSE does not improve tervals are about individual predictions [1]. We leverage the by much (12.15 vs 12.96), indicating that the ensemble is confidence interval estimations for Random Forests, pro- not able to predict extreme errors well on average. We addi- posed in [9], that account for certain variability estimation tionally report the performance of a neural-network based (of individual trees) bias to conduct the experiments. model, the LSTM, trained with 10 and 100 epochs. LSTM seems to be quite stable for MdAE but varies substantially for RMSE and SMAPE. The performance of LSTM actu- 4.1 Regression evaluation ally gets much worse after 100 epochs, that indicates the prone to overfitting. One can imply the instability of such We report here the variablity evaluation across all patients models toward the dataset, and thus we do not consider the for the regression task. Figure 4 show the error bars using LSTM results for model comparisons in Table 1. unbiased variance for all patients. We then show in Fig- ures 5 the error bar graphs for patient 8 in an incremental Method MdAE RMSE SMAPE training size setting – meaning that we keep the same actual test set, but training on only part of the training data. E.g., Last 3.28 25.71 40.96 AVG 2.51 12.96 31.42 1/4 training data indicates that we ‘look back’ on only 1/4 Context-AVG 2.28 12.53 29.71 of the available past data. The more dots that near the diag- ARIMA 2.40 13.88 31.61 onal show the more ‘accurate’ is our prediction model. And LSTM (10 iter) 2.24 10.41 29.02 the error bars show the ‘confidence’ interval. Figure 5(a) LSTM (100 iter) 2.76 19.24 35.64 indicates the high ‘confidence’ in the predictions with little RandomForest 2.27 12.05 29.98 training data, yet the dots are far away from the diagonal. Extremely (randomized) Trees 2.16 12.15 29.56 Table 1: Overall regression performance averaged over all patients. Best performance per measure is marked in bold (results in italic are not considered for comparison). (a) Patient 8 (b) Patient 10 (c) Patient 11 4 Prediction confidence In this section, we study the confidence of our best per- formed prediction tree-based models, RandomForest and ExtraTrees. This would, to an extent, facilitate us to answer an important question, when the system is reliable enough to give out predictions. Thus, we study the variability of (d) Patient 12 (e) Patient 13 (f) Patient 14 predictions and estimate standard errors for the prediction model. Prediction intervals When looking at two regression models, while the model predictions could be similar, confidence in them would vary (g) Patient 15 (h) Patient 16 (i) Patient 17 if we look at the training data, a less and more spread out data could bring a low confidence. Hence, a predic- Figure 4: Error bar graphs for predicted BG using unbiased tion returning a single value (typically meant to minimize variance. the squared error) likewise does not relay any information interval. The idea is to answer the question, ”if we filter low confidence instances (high confident interval), will the model perform better?” The answer somehow is depicted in Figure 11. For some patients i.e., patient 10, 13, the filtering technique substan- tially enhance the model performances on MAE and RMSE (not shown) metrics. It is witness that the biased confi- (a) 1/4 training data (b) 2/4 training data dence measure somewhat works better than non-biased one across patients. However, for some patients i.e., patient 8 it seems does not bring any effects. We move on to experiment with filtering instances that we empirically witness that it seems lacking of preditable context within the training data. They are the BG mea- surements at night. We then attempt to filter those out for (c) 3/4 training data (d) 4/4 training data prediction. Even though slightly improving for some (c.f., Figure 12), overall the filtering attempt does not make sig- nificant difference, indicating that our model learns it better Figure 5: Incremental training size - error bar graphs for than we expect. predicted BG using unbiased variance for patient 8. 4.1.1 when to predict: on the training size evaluation. 4.1.4 when to predict: combined factors. To answer this question, we set up an evaluation setting Figure 7 show some highlighted combined filtering tech- with increasing size of number of instances, order by time. niques. In general, combining the aforementioned factors Each training point is evaluated by leave-one-out valida- together does improve the model performance. However, tion. We show in Figure 6 the results for patient 8. The the combination is not straightforward, e.g., confidence in- general conclusion is the that the more training data, the terval filtering lower the performance at the starting time better the performance is, as witness for patient 13, 15 or when the model is unstable aka. cold start. Hence, there is 17. However, the results for such patients e.,g patient 8, 11 not enough evidence for us to make a hard decision. The or 16 show that the training size increment could also bring more trial-and-error attempts on the fly or a bigger dataset more noise and decrease the results. We envision that it however will be at ease to be built on these as a foundation. could because the learned model is not stabilized yet with 4.2 Overall results with Filtering methods the limited number of instances in our experiment. In ad- dition, training size is not the only factor to decide when to We show in Table 2 the overall results of our models with predict. We hence move on to examine the other two fac- different filtering approaches for all patients. We use 2 tors: (1) model stability - via std. dev. and (2) prediction different filtering approaches: (1) Sanity filter, heuristics confidence toward coming instances. (e.g., remove out wrongly input measurement or moments when the last glucose level input is too far) that remove 4.1.2 when to predict: on the model stability. noise and (2) Stability filter: prediction confidence (std. dev is not needed when the training size is large enough). To answer this question, we measure the stability of the The results show that the stability filter (based on bias and model by the standard deviation of the k-fold cross valida- bias-corrected) achieve the best performance, without the tion with incremental training size. Figure 10 indicate on need of human efforts on sanity filter. Sole stability filter MAE and RMSE metrics, the model seems to be more sta- also provide more predictions (avg. 24) than other filtering bilized with the more number of training data. This is a combination. good indicator for the when to predict questions. 4.1.3 when to predict: on the prediction confidence. We show in Figure 8 and Figure 9 the confidence distri- bution at each run of the 5-fold CV for different patients based on bias and no-bias confidences respectively. The results show the confidence distributions are rather simi- (a) Patient 15 (b) Patient 17 lar across different run, indicating that the temporal order of the instances does not impact much on the model perfor- Figure 10: Standard deviation with incremental training mance. Base on the distribution, we move on the the thresh- size. old parameter tuning for the data filtering using confidence 6 5 8 ● ● ● 5 ● ● 4 ● ● ● 6 4 ● ● ● 3 ● ● ● 3 4 2 2 1 2 1 0 0 0 (a) Patient 8 (b) Patient 10 (c) Patient 11 run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 (a) Patient 8 (b) Patient 10 (c) Patient 11 6 15 ● ● 5 ● 15 ● ● ● 4 ● ● ● ● 10 3 10 ● ● ● 2 ● ● ● (d) Patient 12 (e) Patient 13 (f) Patient 14 5 1 5 0 0 −1 0 run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 (d) Patient 12 (e) Patient 13 (f) Patient 14 3 5 2.0 ● 2 ● 4 1.5 ● ● ● (g) Patient 15 (h) Patient 16 (i) Patient 17 ● ● ● ● 3 ● 1.0 ● ● ● ● ● ● 1 ● ● 2 ● 0.5 ● ● 1 ● ● ● 0.0 0 0 −0.5 −1 run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 Figure 6: Leave-one-out cross validation with incremental training size. (g) Patient 15 (h) Patient 16 (i) Patient 17 Figure 8: Confidence distributions at each run of 5-fold CV for predicted BG using biased variance. (a) Patient 8 (b) Patient 13 4 4 5 ● ● ● 4 3 3 ● ● ● ● 3 ● 2 2 ● 2 1 1 1 0 0 0 −1 −1 run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 (a) Patient 8 (b) Patient 10 (c) Patient 11 12 ● ● 3 10 (c) Patient 15 (d) Patient 16 10 8 2 ● ● ● 8 6 ● ● ● ● 6 ● ● ● ● ● 1 ● ● 4 ● 4 ● ● 0 2 2 0 0 −1 Figure 7: 5-fold cross validation with incremental training run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 −2 run1 run2 run3 run4 run5 size. (d) Patient 12 (e) Patient 13 (f) Patient 14 2.0 3 1.5 1.5 2 ● ● 1.0 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● 1 ● ● ● ● 0.0 ● 0.0 Table 2: Average performance of different filtering ap- −0.5 −1.0 0 −1 −0.5 −1.0 proaches for all patients. run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 run1 run2 run3 run4 run5 Model # predictions MAE MdAE RMSE SMAPE (g) Patient 15 (h) Patient 16 (i) Patient 17 rf 42 2.58 2.27 12.05 29.98 et 42 2.55 2.16 12.15 29.56 rf + sanity filter 16 2.22 2.01 8.80 28.10 Figure 9: Confidence distributions at each run of 5-fold CV et + sanity filter 16 2.29 2.06 9.01 29.36 for predicted BG using unbiased variance. rf + sanity + stability filter 15 2.22 1.92 8.71 27.82 rf + stability filter 24 1.92 1.77 7.57 22.65 5 Conclusion We studied the predictability of machine-learning models in the scenarios of non-continuous blood glucose tracking. Additionally, we studied the stability and robustness of the learned model over time. We show that Random Forest and Extra Tree ensemble-based models are the most suitable models for this case, as they can account for the outliers (a) Patient 8 - bias (b) Patient 8 - no bias as well as overfitting problems when the data are limited. Our further study on the prediction confidence show that the model can give reliable predictions after acquiring 25- 30 instances. Acknowledgements. This work was partially funded by the German Federal Ministry of Education and Research (BMBF) under project GlycoRec (16SV7172). (c) Patient 10 - bias (d) Patient 10 - no bias References [1] Prediction intervals for random forests. http: //blog.datadive.net/prediction-intervals- for-random-forests/. Accessed: 2017-17-09. [2] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001. (e) Patient 13 - bias (f) Patient 13 - no bias [3] I. Contreras, S. Oviedo, M. Vettoretti, R. Visentin, and J. Vehı́. Personalized blood glucose prediction: A hy- brid approach using grammatical evolution and physi- ological models. PloS one, 12(11):e0187754, 2017. [4] M. Eren-Oruklu, A. Cinar, L. Quinn, and D. Smith. Es- timation of future glucose concentrations with subject- (g) Patient 14 - bias (h) Patient 14 - no bias specific recursive linear models. Diabetes technology & therapeutics, 11(4):243–253, 2009. Figure 11: Performance with confidence filtering thresh- [5] S. Fiorini, C. Martini, D. Malpassi, R. Cordera, old (x-axis) for some patients, MAE is when filtering is D. Maggi, A. Verri, and A. Barla. Data-driven strate- applied. gies for robust forecast of continuous glucose moni- toring time-series. In Engineering in Medicine and Biology Society (EMBC), 2017 39th Annual Inter- national Conference of the IEEE, pages 1680–1683. IEEE, 2017. [6] P. Geurts, D. Ernst, and L. Wehenkel. Extremely ran- domized trees. Machine learning, 63(1):3–42, 2006. [7] K. Plis, R. Bunescu, C. Marling, J. Shubrook, and (a) Patient 8 (b) Patient 13 F. Schwartz. A machine learning approach to pre- dicting blood glucose levels for diabetes management. 2014. [8] G. Sparacino, A. Facchinetti, and C. Cobelli. smart continuous glucose monitoring sensors: on-line signal processing issues. Sensors, 10(7):6751–6772, 2010. (c) Patient 15 (d) Patient 16 [9] S. Wager, T. Hastie, and B. Efron. Confidence inter- vals for random forests: the jackknife and the infinitesi- Figure 12: Performance with night time filtering, x-axis is mal jackknife. Journal of Machine Learning Research, the night time, MAE is when filtering is applied. 15(1):1625–1651, 2014.