=Paper=
{{Paper
|id=Vol-2675/paper17
|storemode=property
|title=Experiments in Non-Personalized Future Blood Glucose Level Prediction
|pdfUrl=https://ceur-ws.org/Vol-2675/paper17.pdf
|volume=Vol-2675
|authors=Robert Bevan,Frans Coenen
|dblpUrl=https://dblp.org/rec/conf/ecai/BevanC20
}}
==Experiments in Non-Personalized Future Blood Glucose Level Prediction==
Experiments in non-personalized future blood glucose level prediction Robert Bevan 1 and Frans Coenen 2 Abstract. In this study we investigate the need for training future glucose level prediction [10]. In this study, we aimed to extend this blood glucose level prediction models at the individual level (i.e. per research into future glucose level prediction from historical glucose patient). Specifically, we train various model classes: linear models, levels only. Most previous work involved training personalized mod- feed-forward neural networks, recurrent neural networks, and recur- els designed to predict future blood glucose level data for a single rent neural networks incorporating attention mechanisms, to predict patient [9, 10, 2]. Others used schemes coupling pre-training using future blood glucose levels using varying time series history lengths adjacent patient’s data with a final training phase using the patient of and data sources. We also compare methods of handling missing interest’s data only [3, 12]. time series data during training. We found that relatively short his- In this work, we investigate the possibility of building a single tory lengths provided the best results: a 30 minute history length model that is able to predict future blood glucose levels for all 12 proved optimal in our experiments. We observed long short-term patients in the OhioT1DM data set, and the effectiveness of apply- memory (LSTM) networks performed better than linear and feed- ing such a model to completely unseen data (i.e. blood glucose series forward neural networks, and that including an attention mechanism from an unseen patient). We also investigate the impact of history in the LSTM model further improved performance, even when pro- length on future blood glucose level prediction. We experiment with cessing sequences with relatively short length. We observed models various model types: linear models, feed-forward neural networks, trained using all of the available data outperformed those trained at and recurrent neural networks. Furthermore, inspired by advances in the individual level. We also observed models trained using all of the leveraging long distance temporal patterns for time series prediction available data, except for the data contributed by a given patient, were [6], we attempt to build a long short-term memory (LSTM) model as effective at predicting the patient’s future blood glucose levels as that is able to use information from very far in the past (up to 24 models trained using all of the available data. These models also sig- hours) by incorporating an attention mechanism. Finally, we com- nificantly outperformed models trained using the patient’s data only. pare the effectiveness of two methods for handling missing data dur- Finally, we found that including sequences with missing values dur- ing training. ing training produced models that were more robust to missing val- ues. 2 Method 1 Introduction 2.1 Datasets Accurate future blood glucose level prediction systems could play As stated above, one of our primary aims was to build a single model an important role in future type-I diabetes condition management that is able to predict future blood glucose levels for each patient in practices. Such a system could prove particularly useful in avoiding the data set. To this end, we constructed a combined data set contain- hypo/hyper-glycemic events. Future blood glucose level prediction ing data provided by each of the patients. Specifically, we created a is difficult - blood glucose levels are influenced by many variables, data set composed of all of the data points contributed by the six pa- including food consumption, physical activity, mental stress, and fa- tients included in the previous iteration of the challenge (both train- tigue. The Blood Glucose Level Prediction Challenge 2020 tasked ing and test sets), as well as the training data points provided by the entrants with building systems to predict future blood glucose levels new cohort of patients. This combined data set was split into training at 30 minutes, and 60 minutes into the future. Challenge participants and validation sets: the final 20% of the data points provided by each were given access to the OhioT1DM dataset [8], which comprises patient were chosen for the validation set. The test data sets for the 8 weeks worth of data collected for 12 type-I diabetes patients. The 6 new patients were ignored during development to avoid bias in the data include periodic blood glucose level readings, administered in- result. For experiments in building patient specific models, training sulin information, various bio-metric data, and self-reported infor- and validation sets were constructed using the patient in question’s mation regarding meals and exercise. data only (again with an 80/20 split). In the previous iteration of the challenge, several researchers demonstrated both that it is possible to predict future blood glucose 2.2 Data preprocessing levels using previous blood glucose levels only [9], and that past blood glucose levels are the most important features for future blood Prior to model training, the data were standardized according to: 1 University of Liverpool, UK, email: robert.e.bevan@gmail.com x − µtrain 2 University of Liverpool, UK, email: coenen@liverpool.ac.uk x= (1) σtrain Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Model Hyper-parameters RMSE MAE Patient ID PH=30 PH=60 PH=30 PH=60 Feed-forward # hidden units ∈ {16, 32, 64, 128, 256, 512, 1024} # layers* ∈ {1, 2} 540 21.03 (0.07) 37.37 (0.09) 16.64 (0.1) 30.8 (0.13) activation function ∈ {ReLU } 544 16.14 (0.12) 28.4 (0.14) 12.85 (0.11) 23.57 (0.16) 552 15.82 (0.06) 27.6 (0.15) 12.43 (0.12) 22.78 (0.16) Recurrent # hidden units ∈ {16, 32, 64, 128, 256, 512, 1024} 567 20.29 (0.08) 34.28 (0.18) 15.9 (0.12) 28.95 (0.13) recurrent cell* ∈ {LST M, GRU } 584 20.39 (0.07) 32.97 (0.09) 15.99 (0.03) 27.04 (0.07) # layers* ∈ {1, 2} 596 15.7 (0.03) 25.99 (0.12) 12.4 (0.04) 21.33 (0.13) output dropout* ∈ {0, 0.1, 0.2, 0.5} AVG 18.23 (2.36) 31.1 (4.05) 14.37 (1.83) 25.75 (3.43) LSTM + Attention # αt hidden units ∈ {4, 8, 16, 32, 64, 128} Table 2. Root mean square error and mean absolute error (mg/dl) computed using the test points for each patient, at different prediction horizons (30 minutes, and 60 minutes) for a single layer LSTM with 128 Table 1. Lists of hyper-parameters tuned when training feed-forward, hidden units. recurrent neural networks, and LSTM with attention networks. Note that not all possible combinations were tried - parameters marked with an asterisk were tuned after the optimal number of hidden units was chosen. [4]. Even so, LSTM networks can struggle to learn very long range patterns. Attention mechanisms - initially introduced in the context where µtrain , and σtrain , are the mean and standard deviation of the of neural machine translation - have been shown to improve LSTM training data, respectively. There is a non-negligible amount of data networks’ capacity for learning very long range patterns [7, 1]. At- missing from the training set, which needed to be considered when tention mechanisms have also been applied to time series data, and preprocessing the data. We investigated two approaches to handling have proven to be effective in instances where the data exhibit long missing data: discarding any training sequences with one or more range periodicity - for example, in electricity consumption prediction missing data points; and replacing missing values with zeros follow- [6]. We hypothesised that blood glucose level prediction using a very ing standardization. It was hypothesized that the second approach long history, coupled with an attention mechanism, could lead to im- may help the system learn to be robust to missing data. proved performance, due to periodic human behaviours (e.g. eating meals at similar times each day; walking to and from work e.t.c). In order to test this hypothesis, we chose the best performing LSTM 2.3 Model training configuration trained with a history length of 24 hours, without at- We experimented with linear models, feed-forward neural networks, tention, and added an attention mechanism as per [7]: and recurrent neural networks. Each model was trained to minimize the root mean square error (RMSE): v score(ht , hi ) = hT t W hi (3) u n exp(score(ht , hi ) uX yˆi − yi 2 αti = ) (4) RM SE = t (2) Σtj=1 exp(score(ht , hj )) n i=1 X ct = αti hi (5) where yˆi is the predicted value, yi is the true value, and n is i the total number of points in the evaluation set. Various hyper- at = f (ct , ht ) = tanh(Wc [ct ; ht ]) (6) parameters were tuned when training the feed-forward and recur- rent neural networks; Table 1 provides a summary. During devel- where score(ht , hi ) is an alignment model, ct is the weighted con- opment, each model was trained for 50 epochs using the Adam op- text vector, and at is the attention vector, which is fed into the clas- timizer (α=0.001, β1 =0.9, β2 =0.999) [5] and a batch size of 256. sification layer (without an attention mechanism, the hidden state ht The final model was trained with early stopping using the same opti- is fed into the classification layer). The dimensionality of αt was mizer settings, and a batch size of 32, for a maximum period of 500 chosen using the validation set - see Table 1 for details. We also ex- epochs with early stopping and a patience value of 30. Each model perimented with attention mechanisms in LSTM networks designed was trained 5 times in order to get an estimate of the influence of the to process shorter sequence lengths: we chose the optimal LSTM random initialization and stochastic training process on the result. model architecture for each history length (30 minutes, 60 minutes, Model selection and hyper-parameter tuning were performed for 2 hours), added an attention mechanism, and re-trained the model the 30 minute prediction horizon task. The best performing model (again, the optimal αt dimensionality was chosen using the valida- was then trained for 60 minute prediction. Experiments were re- tion set). peated for blood glucose history lengths of 30 minutes, 1 hour, 2 hours, and 24 hours. 2.5 Investigating the need for personal data during 2.4 Improving long distance pattern learning with training Attention In order to investigate the need for an individual’s data when training It can be difficult for recurrent neural networks to learn long distance a model to predict their future blood glucose levels, we trained 6 patterns. The LSTM network was introduced to address this problem different models - each with one patient’s training data excluded from 2 Patient ID Patient only All patients Patient excluded 540 21.68 (0.04) 21.03 (0.07) 21.16 (0.11) 544 17.28 (0.1) 16.14 (0.12) 16.22 (0.09) 552 16.87 (0.12) 15.82 (0.06) 15.87 (0.09) 567 21.15 (0.3) 20.29 (0.08) 20.5 (0.11) 584 22.11 (0.13) 20.39 (0.07) 20.46 (0.06) 596 16.16 (0.11) 15.7 (0.03) 15.71 (0.02) AVG 19.21 (2.48) 18.23 (2.36) 18.32 (2.4) Table 3. Root mean square error (mg/dl) computed using the test points for each patient, with a prediction horizon of 30 minutes for a single layer LSTM with 128 hidden units, trained using different data sets. Values listed in the first column correspond to models trained using the individual patient data only; values in the middle column correspond to models trained using data from all patients; values in the final column correspond to models trained using data from all patients except for the patient for which the evaluation is performed. Figure 1. Comparison of validation set scores for linear, feed-forward, and LSTM neural networks as a function of history length. the training set - using the optimal LSTM architecture determined in previous experiments. The models were then evaluated using the ained excluding the patient’s data, highlighting the general nature of test data for the patient that was excluded from the training set. We the models. We found that including sequences with values in the also trained 6 patient specific models, each trained using the patient’s training set produced models that were more robust to missing data, training data only. We again used the optimal architecture determined as evidenced by the improved RMSE scores listed in Table 4: RMSE in previous experiments, but tuned the number of hidden units using scores were significantly improved for each patient using this ap- the validation set in order to avoid over-fitting due to the significantly proach to training (p=0.05). reduced size of the training set (compared with the set with which the Incorporating an attention mechanism further improved perfor- optimal architecture was chosen). Each model was trained using the mance in most instances: we observed significant improvements for early-stopping procedure outlined in 2.3. history lengths of 30 minutes, 60 minutes, and 2 hours (p=0.05), but not for history lengths of 24 hours. Figure 3 compares the regular 3 Results and Discussion LSTM and LSTM with attention performance as a function of his- tory length. Figure 2 shows partial auto-correlation plots for 4 differ- Our evaluation showed that recurrent models performed significantly ent patients. Interestingly, two of the patients’ blood glucose data - better than both linear and feed-forward neural network models, for patient 540, and patient 544 - don’t show any significant long term each history length we experimented with (p=0.05, corrected paired correlation, whereas the other two - patient 552, and patient 567 - t-test [11]). We also found that feed-forward networks generally out- both exhibit significant correlation at time lags of approximately 6 performed linear models, likely due to their ability to model non- and 12 hours. We observed this behaviour in half of the patients. We linear relationships. The optimal feed-forward network contained 512 hidden units. We found no difference between LSTM and gated Patient ID Exclude missing data Include missing data recurrent unit (GRU) networks - remaining evaluations will be per- formed for LSTM recurrent networks only for simplicity. Figure 1 540 21.45 (0.06) 21.03 (0.07) compares the performance of the different model types as a func- 544 16.79 (0.06) 16.14 (0.12) tion of history length. For each model class we observed that perfor- 552 16.27 (0.13) 15.82 (0.06) mance decreased linearly with increasing history length. The LSTM appeared better able to deal with longer history lengths - the perfor- 567 21.19 (0.1) 20.29 (0.08) mance degradation was less severe than for the other model classes. 584 21.16 (0.06) 20.39 (0.07) We found a history length of 30 minutes to be optimal for each model 596 16.08 (0.07) 15.7 (0.03) class. The best performing LSTM model contained a single layer with 128 hidden units, and was trained without dropout. The test set AVG 18.82 (2.45) 18.23 (2.36) results for this model are listed in Table 2. Table 3 compares LSTM models (with the same architecture as Table 4. Root mean square error (mg/dl) computed using the test points for above) trained with the following data sources: the individual pa- each patient, with a prediction horizon of 30 minutes for a single layer tient’s data only, data from all of the available patients, and data LSTM with 128 hidden units, trained using different methods of handling from every other patient (excluding data contributed by the patient missing data. Values in the first column correspond to a model trained with full sequences only (any sequences with missing values were discarded). in question). We observed that models trained using a large amount Values in the second column correspond to a model trained with sequences of data, but excluding the patient’s data, outperformed models trained including missing values - missing values were replaced with zeros using the patient’s data only (p=0.05). We also found no significant following standardization. difference in performance between models trained using all of the available training data (i.e. including the patient’s data) and those tr- 3 Figure 2. Partial auto-correlation plots for 4 different patients. The patients in the top row exhibit short term patterns only, but those in the bottom row show significant correlations at time lags of approximately 6 and 12 hours. also observed correlations at even greater time lags, corresponding neral, and long range patterns are more personalized, and tuning the to multiples of 6 hours. The difference in the patients’ partial auto- history length per patient could improve prediction performance. All correlation plots suggests it may be sub-optimal to train an attention of the results presented in this section can be reproduced using pub- mechanism using each patient’s data at once, and that training at the licly available code 3 . patient-level may enable the model to learn very long range patterns. Furthermore, while we were able to train an LSTM model with a 4 Conclusion history length of 30 minutes that generalized across all patients, it may be the case that short range blood glucose patterns are quite ge- In this study we showed that it is possible to train a single LSTM model that is able to predict future blood glucose levels for each of the different patients whose data are included in the OhioT1DM data set. We also demonstrated that an individual patient’s data is not re- quired during the training process in order for our model to effec- tively predict the patient’s future blood glucose levels. Furthermore we showed that incorporating an attention mechanism in the LSTM improved performance, and that including sequences with missing values during training produced models that were more robust to missing data. Figure 3. Comparison of validation set RMSE scores (prediction horizon = 30 minutes) for LSTMs and LSTMs incorporating an attention mechanism as a function of history length. 3 https://github.com/robert-bevan/bglp 4 REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, ‘Neural ma- chine translation by jointly learning to align and translate’, in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, (2015). [2] Arthur Bertachi, Lyvia Biagi, Iván Contreras, Ningsu Luo, and Josep Vehı́, ‘Prediction of blood glucose levels and nocturnal hypoglycemia using physiological models and artificial neural networks’, in Proceed- ings of the 3rd International Workshop on Knowledge Discovery in Healthcare Data, pp. 85–90. [3] Jianwei Chen, Kezhi Li, Pau Herrero, Taiyu Zhu, and Pantelis Geor- giou, ‘Dilated recurrent neural network for short-time prediction of glu- cose concentration’, in Proceedings of the 3rd International Workshop on Knowledge Discovery in Healthcare Data, pp. 69–73. [4] Sepp Hochreiter and Jürgen Schmidhuber, ‘Long short-term memory’, Neural computation, 9(8), 1735–1780, (1997). [5] Diederik P. Kingma and Jimmy Ba, ‘Adam: A method for stochastic optimization’, in 3rd International Conference on Learning Represen- tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, (2015). [6] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu, ‘Mod- eling long- and short-term temporal patterns with deep neural net- works’, CoRR, abs/1703.07015, (2017). [7] Thang Luong, Hieu Pham, and Christopher D. Manning, ‘Effective ap- proaches to attention-based neural machine translation’, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 1412–1421, (2015). [8] Cynthia R. Marling and Razvan C. Bunescu, ‘The OhioT1DM dataset for blood glucose level prediction’, in Proceedings of the 3rd Inter- national Workshop on Knowledge Discovery in Healthcare Data, pp. 60–63, (2018). [9] John Martinsson, Alexander Schliep, Bjorn Eliasson, Christian Meijner, Simon Persson, and Olof Mogren, ‘Automatic blood glucose prediction with confidence using recurrent neural networks’, in Proceedings of the 3rd International Workshop on Knowledge Discovery in Healthcare Data, pp. 64–68. [10] Cooper Midroni, Peter J. Leimbigler, Gaurav Baruah, Maheedhar Kolla, Alfred J. Whitehead, and Yan Fossat, ‘Predicting glycemia in type 1 diabetes patients: Experiments with xgboost’, in Proceedings of the 3rd International Workshop on Knowledge Discovery in Healthcare Data, pp. 79–84. [11] C. Nadeau and Y. Bengio, ‘Inference for the generalization error’, Mach. Learn., 52(3), 239–281, (September 2003). [12] Taiyu Zhu, Kezhi Li, Pau Herrero, Jianwei Chen, and Pantelis Geor- giou, ‘A deep learning algorithm for personalized blood glucose predic- tion’, in Proceedings of the 3rd International Workshop on Knowledge Discovery in Healthcare Data, pp. 74–78. 5