Blood Glucose Level Prediction as Time-Series Modeling using Sequence-to-Sequence Neural Networks Ananth Bhimireddy 1 and Priyanshu Sinha 2 and Bolu Oluwalade 3 and Judy Wawira Gichoya 4 and Saptarshi Purkayastha 5 Abstract. The management of blood glucose levels is critical in insulin, or the body is resistant to the effect of insulin. This absence the care of Type 1 diabetes subjects. In extremes, high or low lev- or resistance to insulin leads to an increase in blood glucose levels, els of blood glucose are fatal. To avoid such adverse events, there is known as hyperglycemia. Symptoms of hyperglycemia include ex- the development and adoption of wearable technologies that continu- cessive thirst, excessive urination, sweating, etc. Diabetic ketoacido- ously monitor blood glucose and administer insulin. This technology sis is a serious complication of uncontrolled hyperglycemia that can allows subjects to easily track their blood glucose levels with early lead to death. On the other hand, elevated insulin levels in the body intervention, preventing the need for hospital visits. The data col- can cause low levels of blood glucose, a condition known as hypo- lected from these sensors is an excellent candidate for the application glycemia. Dizziness, weakness, coma, or eventually, death can occur of machine learning algorithms to learn patterns and predict future in uncontrolled hypoglycemia. Insulin and glucose control are criti- values of blood glucose levels. In this study, we developed artificial cal to the management of diabetes, and hence titration of the admin- neural network algorithms based on the OhioT1DM training dataset istered insulin doses is critical in management of diabetes patients. that contains data on 12 subjects. The dataset contains features such Glucose levels vary according to the patient’s diet, and activities as subject identifiers, continuous glucose monitoring data obtained throughout the day. Sensors have been developed to estimate blood in 5 minutes intervals, insulin infusion rate, etc. We developed in- glucose levels at various time intervals. These sensors are useful in dividual models, including LSTM, BiLSTM, Convolutional LSTMs, diabetes management because they provide longitudinal data about TCN, and sequence-to-sequence models. We also developed transfer subjects’ blood glucose and show the distinctive patterns throughout learning models based on the most important features of the data, as the day. These sensors are frequently coupled with the use of in- identified by a gradient boosting algorithm. These models were eval- sulin pumps to deliver short-acting insulin continuously (basal rate) uated on the OhioT1DM test dataset that contains 6 unique subject’s and specific insulin quantity after a meal for appropriate glycemic data. The model with the lowest RMSE values for the 30- and 60- control. Although the sensors and insulin pumps have helped to im- minutes was selected as the best performing model. Our result shows prove patient care, patients are typically unaware of an impending that sequence-to-sequence BiLSTM performed better than the other adverse event of severe hyperglycemia or hypoglycemia. These ad- models. This work demonstrates the potential of artificial neural net- verse effects commonly occur when patients are asleep. There is an works algorithms in the management of Type 1 diabetes. opportunity for the development of accurate prediction models using Keywords. Blood glucose prediction, Time-series model, Wear- previously collected sensor data to estimate future values of blood able devices, Transfer learning glucose levels to prevent the occurrence of adverse events. In this study, we utilize the OhioT1DM dataset, which contains blood glucose values of twelve T1DM subjects collected at inter- 1 INTRODUCTION vals over a total time span of eight weeks [13]. These individuals Diabetes Mellitus is a chronic disease characterized by high blood had an insulin pump with continuous glucose monitoring (CGM), glucose levels. According to the 2020 CDC National Diabetes Statis- wearing a physical activity band and self-reporting life events us- tics Report, about 34.2 million people or 10.2 percent of Americans ing a smartphone application. CGM blood glucose data were ob- have diabetes [3]. Diabetes is classified as Type 1, Type 2 and Ges- tained at 5-minute intervals [12]. We developed multiple models for tational Diabetes. Insulin is an enzyme produced in the pancreas predicting glucose values at 30 and 60 minutes in the future, us- that helps in blood glucose absorption into cells. In Type 1 diabetes ing the CGM values, and mean Root Mean Square Error (RMSE) (T1DM), the pancreas produces little or no insulin. In contrast, in as an evaluation metric. The code for this study can be found at Type 2 diabetes (T2DM), the pancreas produces a small amount of https://github.com/iupui-soic/bglp2 1 Indiana University Purdue University Indianapolis, USA, email: anbhimi@iupui.edu 2 RELATED WORK 2 Mentor Graphics India Pvt. Ltd., India, email: priyanshus- inha@outlook.com Models like LSTM and RNN have been used for forecasting in [10] 3 Indiana University Purdue University Indianapolis, USA, email: which has been improved since then. The paper explores short-term boluwala@iupui.edu load forecasting for individual electric customers and proposes an 4 Emory University School of Medicine, USA, email: judy- wawira@emory.edu LSTM based framework to tackle the issue. Machine Learning mod- 5 Indiana University Purdue University Indianapolis, USA, email: sapt- els such as XGBoost have been used to predict glycemia in type- purk@iupui.edu 1 diabetic patients in [14]. This paper experiments primarily with Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). XGBoost algorithm to predict the blood glucose levels at a 30- 3.2.1 Long Short-Term Memory (LSTM) Networks minute horizon in the OhioT1DM dataset. Features from pre-trained TimeNets have been used for clinical predictions in [7]. This pa- LSTMs were originally introduced by Hochreiter and Schmidhuber per uses pre-training a network for some supervised or unsupervised [8] and later refined and popularized [17] [20]. LSTMs are a spe- tasks on datasets, and then fine-tuning via transfer learning for a re- cial kind of RNNs, capable of learning long-term dependencies. This lated end-task to leverage the resources of labeled data in making quality of LSTMs helps memorize useful parts of the sequence and predictions. This paper points out that training deep learning models the model learns parameters more efficiently, making it useful for such as RNNs and LSTMs requires large labeled data and is compu- time series models. tationally expensive. We trained two models using LSTMs, one with 5-min interval data Deep learning models like Recurrent Neural Network (RNN) have and another with 30-min interval data. In each model, we used all the been used on the OhioTIDM dataset to predict future blood glucose available features at time t to predict the glucose value at time t+1. values [16], including the BGLP Challenge at KDH@IJCAI-ECAI Before fitting the dataset, we scaled the dataset using MinMaxScaler 2018 (http://ceur-ws.org/Vol-2148/). In some cases, from scikit-learn [15]. these data-driven models use only the CGM values or use physiolog- The LSTM model was built using the Keras [6] platform. We used ical data such as the insulin concentration, amount of carbohydrate in 128 LSTM units, followed by a dense layer (150 units), dropout layer meals and physical activities. Chen et al. created a data-driven 3-layer (0.20), dense layer (100 units), dropout layer (0.15), dense layer (50 dilated recurrent neural network model with a mean RMSE of 18.9, units), dense layer (20 units) and a final layer with one unit (for pre- with a range of 15.2995 and 22.7104 [4]. They concluded that the diction). We used ReLU as the activation function with Adam op- missing data and the continuous fluctuations in the data influenced timizer. The loss was calculated in mean squared error (MSE) and the model’s performance. Their model bettered the Convolutional later converted into Root Mean Squared Error (RMSE). The model Neural Network (CNN) that gave an average RMSE of 21.726 for was trained for 200 epochs with a batch size of 32. The results of the six subjects [21]. Bertachi et al. predicted blood glucose levels using model for each subject are provided in the results table. Artificial Neural Networks with the inclusion of physiological data [1]. Their results were not significantly dissimilar to those obtained 3.2.2 Bi-directional Long Short-Term Memory (BiLSTM) in the data-driven models. All the previous studies demonstrate that Networks a lower RMSE is obtained at the 30-minute prediction when com- pared to the 60-minutes values. We postulate that a hybrid approach As our input text data is static, and the entire sequence is available of combining both the data-driven and physiological models could at the same time, we implemented a BiLSTM model to observe how improve on the performance of the individual models and incorpo- processing the sequence from either direction affected the accuracy. rate this in our approach. The architecture of the BiLSTM model is similar to the LSTM model and is thus useful for time series prediction. The data processing and model parameters for BiLSTM and 3 METHODS LSTM model were similar with an exception in the model’s first 3.1 Dataset Description layer, where the scaled data was inputted into the Bidirectional LSTM with 128 units. The model was trained for 200 epochs with A detailed description of the dataset has been previously published in a batch size of 32. The results of the model for each subject are pro- the OhioT1DM dataset paper [12]. We used the data provided on 12 vided in the results table. subjects for training, and 6 test subjects. Furthermore, the parameters basal and temp basal were merged into a single parameter. 3.2.3 Temporal Convolutional Networks (TCN) We converted both the training and testing datasets from XML to CSV, preserving the time intervals. We did not use interpolation on TCNs were originally introduced by Lea and Videl [11] in 2016. the datasets as rules of the competition have prohibited interpolation. TCNs are extremely useful in capturing the high-level temporal rela- We tried using the forward and backward filling to fill the null values tionships in sequential networks. The TCN architecture allows cap- in the datasets, but they are creating additional time intervals which turing long-range spatio-temporal relationships. TCN’s help cap- becomes a problem in testing datasets. So, no re-sampling technique ture the blood glucose level of subjects who usually have routine was used in this paper to preserve the time intervals of the samples. lifestyles, as TCNs can capture hierarchical relationships at low, in- The data pre-processing is performed on only 6 patients (test and termediate, and high time scales. train datasets) whose results are to be predicted. Subject-548 has the The data processing steps are similar to the LSTM model. The highest number of training records (12150) and Subject-552 has the TCN model was built using the Keras platform, but the depth of lowest no. of records(9080). The no.of features varied from subject to the model is relatively simpler compared to the LSTM and BiLSTM subject which causes unevenness while training time series models. model. The scaled data was inputted into a TCN layer and then con- So, we have added features to subjects as required to ease the process nected to a dense layer with one unit for the output. The model used of training and predicting. Missing data is handled in all columns by Adam optimizer and MSE for calculating loss which was later con- inputting the null values with zeros(0). We did not disturb the glucose verted to RMSE. The model was trained for 10 epochs and the ob- value column as required by the competition. tained results are provided in the results table. 3.2.4 Convolutional LSTM 3.2 Description of ML models Convolutional LSTMs (ConvLSTM) were introduced by Xingjian We used the following deep learning models to predict the blood Shi et al. [18] in 2015. ConvLSTMs are created by extending the fully glucose levels of each subject. The data pre-processing and model connected LSTM to have convolutional structure in both the input- development are summarized in figure 1. to-state and state-to-state transitions. ConvLSTM network captures Figure 1. Process Architecture spatio-temporal correlations better and usually outperform Fully specific time periods, let’s say (after 10 minutes) or a whole sequence Connected LSTM networks. of blood glucose values. The scaled data was reshaped and inputted to a convolution layer To evaluate these sequence-to-sequence models we used walk for- with 32 filters of kernel size 1, followed by LSTM layer with 128 ward validation. Here the model predicts the next one hour and then units, Dense layer with 150 units, Dropout layer with 0.2 dropout the actual data for one hour is given to make prediction for next one rate, Dense layer with 100 units, Dropout layer with 0.15 rate, Dense hour. See below table 1 for more illustration: layer with 50 units, Dense layer with 16 units and finally a Dense layer with 1 unit for prediction. ReLU activation function was used Table 1. Description of Sequence-to-Sequence input and prediction values in all layers. The model used Adam optimizer and the MSE loss was Input Prediction further converted to RMSE for model comparison. The model was trained for 200 epochs with a batch size of 32. The obtained results 1st 60 minutes data 2nd 60 minutes data [1st + 2nd] 60 minutes data 3rd 60 minutes data are provided in the results table. [1st + 2nd + 3rd] 60 minutes data 4th 60 minutes data For our training, we kept our input size (number of prior observa- 3.2.5 Description of Sequence-to-Sequence Models tions required to make next predictions) as 30 minutes data to predict Sequence to sequence models were first introduced by Google in next 60 minutes data. Each sequence-to-sequence model used in our 2014 [19]. These models map fixed-length input with fixed-length work is described below: output where length of input and output differ. Sequence-to-sequence 1. Sequence-to-Sequence LSTM models consist of three parts: In this model, we used 200 LSTM cells for the encoder and • Encoder: Encoder consists of stacks of several recurrent (LSTM) decoder. This layer was followed with 2 dense layers containing units where each unit takes a single element of the input sequence, 150 and 1 units wrapped in a TimeDistributed layer. The model extracts information from it and propagates it to the next unit. was trained for 80 epochs with batch size of 40. We used Adam • Encoded Vector: This is the intermediate step and the final hid- optimizer [9] with learning rate of 0.01 and loss function as MSE. den layer of the encoder. It also acts as the first hidden layer for the decoder to make predictions. This vector encapsulates all in- 2. Sequence-to-Sequence Bi-LSTM formation from all input samples and provides this information to In this model, we used 100 Bi-LSTM cells (LSTM cells wrapped the decoder. in Bidirectional wrapper) for the encoder and decoder. This layer • Decoder: This consists of stacks of recurrent unit where each unit was followed with 2 dense layers containing 150 and 1 units predicts output at time step t. Each unit accepts a hidden layer as wrapped in a TimeDistributed layer. The model was trained for input and produces output as well as its own hidden state. 80 epochs with batch size of 40. Similar to sequence-to-sequence LSTM, Adam optimizer was used with 0.01 learning rate and The main advantage of this architecture is that it can map se- “mean squared error” as the loss function. quences of different lengths to each other. We applied 3 variants of sequence to sequence models viz., sequence-to-sequence LSTM, 3. Sequence-to-Sequence CNN-LSTM In this model, we used 2 sequence-to-sequence Bi-LSTM, and sequence-to-sequence CNN- 1D Convolutional layer with filter size of 128 and 64 respectively. LSTM. Convolutional layers were followed with MaxPooling 1D layer For training of sequence-to-sequence models, we split the data into and flatten layer for the encoder. 200 LSTM cells are used for windows of 60 minutes. This approach is intuitive and helpful as the decoder. This layer was followed with 2 dense layers con- BGLP values can be predicted 1-hour ahead. It is also helpful while taining 100 and 1 units wrapped in a TimeDistributed layer. The modelling as the model can be used to predict blood glucose values at model was trained for 80 epochs with batch size of 40. Similar to above models, Adam optimizer was used with 0.01 learning rate Table 3. RMSE values for individual models at 30 minutes Horizon and “mean squared error” as the loss function. Subjects LSTM BiLSTM TCN ConvLSTM 584 27.97 29.06 26.55 26.57 3.2.6 Transfer Learning 567 25.65 27.04 25.71 26.04 596 19.47 20.30 18.95 21.08 For transfer learning, we first found the most relevant and common 552 20.62 20.30 17.14 17.73 features among all 12 subjects. The importance of features was found 544 21.34 22.06 80.67 20.94 540 30.21 31.72 25.94 27.35 using a Gradient Boosting algorithm. We set the cumulative fre- Mean 25.0 24.4 34.7 23.2 quency to 0.99 for feature selection and the 5 most important and SD 3.99 4.98 21.13 3.43 common features are as follows: ’finger stick value’, ’basal rate value’, ’galvanic skin response value’, ’skin temperature value’, ’bo- Table 4. RMSE values for sequence-to-sequence models at 30 Minutes lus dose value’ horizon Using the above-identified important features only, we trained our model on a randomly selected subject (567) and subsequently, fine- Seq-2-Seq Seq-2-Seq Seq-2-Seq tuned each sequence-to-sequence model on each subject. The final Subjects LSTM BiLSTM CNN-LSTM model was used for prediction on the test data. The configurations 567 29.2 20.7 29.0 of each model was similar to the sequence-to-sequence model as de- 540 25.0 24.3 23.1 scribed above. 544 18.5 19.8 19.2 596 18.6 18.6 19.0 584 30.7 29.7 31.0 4 RESULTS 552 19.2 18.1 17.2 Mean 23.5 21.8 23.0 SD 5.0 4.0 5.2 Figures 2 and 3 shows the comparison between the actual and the pre- dicted values obtained from our best performing model i.e. sequence- to-sequence BiLSTM model. From the figures, it is clear that our Table 5. RMSE values for sequence to Sequence transfer learning model at model closely predicts the values of the test data by following simi- 30 Minutes horizon lar peaks and troughs. Seq-2-Seq Seq-2-Seq Seq-2-Seq Subjects LSTM BiLSTM CNN-LSTM Table 2. RMSE and MAE results of Sequence-to-Sequence BiLSTM model 567 33.5 30.8 32.6 540 39.7 32.7 44.2 RMSE MAE 544 23.1 31.5 21.8 Subjects 30 Mins 60 Mins 30 Mins 60 Mins 596 19.2 18.0 17.9 584 29.7 42.6 18.1 30.0 584 36.8 37.3 38.1 567 20.7 35.1 14.4 24.9 552 21.8 14.3 17.7 596 18.6 28.3 12.7 19.3 Mean 29.0 27.4 28.7 552 18.2 30.0 13.3 22.2 SD 7.9 8.3 10.2 544 19.8 32.9 13.7 23.1 540 24.3 41.4 17.8 31.0 Mean 21.8 35.0 15.0 25.0 SD 4.0 5.4 2.1 4.2 The RMSE of the 30 minutes horizon predictions of the models are presented in tables 3, 4 and 5. From the tables, the sequence- to-sequence (Seq2Seq) models performed better than all the other models with an average RMSE of 23.5 for Seq-2-Seq LSTM, 21.8 for Seq-2-Seq BiLSTM, and 23.0 for Seq-2-Seq CNN-LSTM. Ta- ble 2 describes the RMSE and MAE values for the sequence-to- sequence BiLSTM model which performed better than the individual and transfer learning models. From Table 2, the value for the RMSE varies from 18.2 in subject 552 to 29.7 in subject 584 at 30 minutes, and between 28.3 in sub- ject 596 to 42.6 in subject 584 at 60 minutes. The MAE values lies between 12.7 to 18.1 at 30 minutes and between 19.3 and 31.0 at 60 minutes. Subject 584 in figure 3 shows more fluctuations, has a higher peak and lower troughs than subject 552 in figure 2. The effect of these variations are reflected in our results as patient 584 has the highest RMSE value while subject 552 has the lowest RMSE values. It is evi- dent that the levels of variations in the individual subjects contributes significantly to the differences in RMSE values of the individual sub- jects in our model. Figure 2. Actual and Predicted values of subject 552 at 30 minutes horizons Figure 3. Actual and Predicted values of subject 584 at 30 minutes horizons 5 CONCLUSION AND FUTURE SCOPE posed results. But [5] have used interpolation as a part of data pro- cessing which is against the rules of the competition and [2] did not mention details about data processing. Our future work will be to In this paper, we present results of application of deep learning mod- improve the transfer learning model as we are provided with more els to make predictions of blood glucose values. Potential benefits common features among all subjects, so that we can create a generic such as the prevention of adverse events associated extreme glu- model for predicting blood glucose levels. However, the development cose values serve as a source of motivation for these efforts. Over- of a generic model can be challenging because of confounding fac- all, sequence-to-sequence models especially Bi-LSTM have the best tors such as variations in sensor types, lifestyles, physiology and ge- performance as these models are best at mapping sequences irrespec- netics. It is therefore pertinent that these factors are considered in tive of their lengths. Our performance is affected by fluctuations in future endeavors. glucose values and also with missing data as described in previous experiments. Given the overall success of transfer learning, we also evaluated the potential of single model prediction via transfer learn- REFERENCES ing approach. The transfer learning approach was inferior to the se- [1] Arthur Bertachi, Lyvia Biagi, Iván Contreras, Ningsu Luo, and Josep quence to sequence models. Vehı́, ‘Prediction of blood glucose levels and nocturnal hypoglycemia Compared to the previous paper for BGLP Challenege, we ob- using physiological models and artificial neural networks.’, in KHD@ served that two papers [2] and [5] have better results than our pro- IJCAI, pp. 85–90, (2018). [2] Arthur Bertachi, Lyvia Biagi, Iván Contreras, Ningsu Luo, and Josep Vehı́, ‘Prediction of blood glucose levels and nocturnal hypo- glycemia using physiological models and artificial neural networks’, in KHD@IJCAI, (2018). [3] CDC, National Diabetes Statistics Report 2020. Estimates of diabetes and its burden in the United States., 2020. [4] Jianwei Chen, Kezhi Li, Pau Herrero, Taiyu Zhu, and Pantelis Geor- giou, ‘Dilated recurrent neural network for short-time prediction of glu- cose concentration.’, in KHD@ IJCAI, pp. 69–73, (2018). [5] Jianwei Chen, Kezhi Li, Pau Herrero, Taiyu Zhu, and Pantelis Geor- giou, ‘Dilated recurrent neural network for short-time prediction of glu- cose concentration’, in KHD@IJCAI, (2018). [6] François Chollet et al. Keras. https://keras.io, 2015. [7] Priyanka Gupta, Pankaj Malhotra, Lovekesh Vig, and Gautam Shroff, ‘Using features from pre-trained timenet for clinical predictions’, (07 2018). [8] Sepp Hochreiter and Jürgen Schmidhuber, ‘Long short-term memory’, Neural Comput., 9(8), 1735–1780, (November 1997). [9] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. [10] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang, ‘Short-term residential load forecasting based on lstm recurrent neural network’, IEEE Transactions on Smart Grid, 10(1), 841–851, (Jan 2019). [11] Colin Lea, René Vidal, Austin Reiter, and Gregory D. Hager, ‘Temporal convolutional networks: A unified approach to action segmentation’, CoRR, abs/1608.08242, (2016). [12] Cindy Marling and Razvan Bunescu, ‘The ohiot1dm dataset for blood glucose level prediction: Update 2020’. [13] Cindy Marling and Razvan C Bunescu, ‘The ohiot1dm dataset for blood glucose level prediction.’, in KHD@ IJCAI, pp. 60–63, (2018). [14] Cooper Midroni, Peter leimbigler, Gaurav baruah, maheeder kolla, al- fred whitehead, and Yan Fossat, ‘Predicting glycemia in type 1 diabetes patients: Experiments with xgboost’, (07 2018). [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay, ‘Scikit-learn: Machine learning in Python’, Journal of Machine Learning Research, 12, 2825–2830, (2011). [16] M Sangeetha and M Senthil Kumaran, ‘Deep learning-based data impu- tation on time-variant data using recurrent neural network’, Soft Com- puting, 1–12, (2020). [17] Alex Sherstinsky, ‘Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network’, CoRR, abs/1808.03314, (2018). [18] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai- Kin Wong, and Wang-chun Woo, ‘Convolutional LSTM network: A machine learning approach for precipitation nowcasting’, CoRR, abs/1506.04214, (2015). [19] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, ‘Sequence to sequence learning with neural networks’, in Advances in neural information pro- cessing systems, pp. 3104–3112, (2014). [20] Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu, ‘Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling’, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 3485–3495, Osaka, Japan, (December 2016). The COLING 2016 Organizing Committee. [21] Taiyu Zhu, Kezhi Li, Pau Herrero, Jianwei Chen, and Pantelis Geor- giou, ‘A deep learning algorithm for personalized blood glucose pre- diction.’, in KHD@ IJCAI, pp. 64–78, (2018).