The OhioT1DM Dataset for Blood Glucose Level Prediction Cindy Marling and Razvan Bunescu School of Electrical Engineering and Computer Science Ohio University Athens, Ohio, 45701, USA {marling,bunescu}@ohio.edu Abstract The dataset includes: a CGM blood glucose level every 5 minutes; blood glucose levels from periodic self-monitoring This paper documents the OhioT1DM Dataset, of blood glucose (finger sticks); insulin doses, both bolus and which was developed to promote and facilitate re- basal; self-reported meal times with carbohydrate estimates; search in blood glucose level prediction. It contains self-reported times of exercise, sleep, work, stress, and ill- eight weeks’ worth of continuous glucose monitor- ness; and 5-minute aggregations of heart rate, galvanic skin ing, insulin, physiological sensor, and self-reported response (GSR), skin temperature, air temperature, and step life-event data for each of six people with type 1 count. diabetes. An associated graphical software tool al- The rest of this paper provides background information, lows researchers to visualize the integrated data. details the data format, describes the OhioT1DM Viewer vi- The paper details the contents and format of the sualization software, and tells how to obtain the OhioT1DM dataset and tells interested researchers how to ob- Dataset and Viewer for research purposes. tain it. 2 Background 1 Introduction We have been working on intelligent systems for diabetes Blood glucose level (BGL) prediction is a challenging task for management for over a decade [Schwartz et al., 2008; Mar- AI researchers, with the potential to improve the health and ling et al., 2009; Marling et al., 2012; Bunescu et al., 2013; wellbeing of people with diabetes. Knowing in advance when Plis et al., 2014; Marling et al., 2016; Mirshekarian et al., blood glucose is approaching unsafe levels provides time to 2017]. As part of our work, we have run five clinical re- proactively avoid hypo- and hyper-glycemia and their con- search studies involving subjects with type 1 diabetes on in- comitant complications. The drive to perfect an artificial pan- sulin pump therapy. Over 50 anonymous subjects have pro- creas [Juvenile Diabetes Research Foundation (JDRF), 2018] vided blood glucose, insulin, and life-event data so that we has increased the interest in using machine learning (ML) ap- could develop software intended to help people with diabetes proaches to improve prediction accuracy. Work in this area and their professional health care providers. has been hindered, however, by a lack of real patient data; Throughout the years, we have received numerous requests some researchers have only been able to work on simulated to share the data with other researchers. Our most recent patient data. study was designed so that de-identified data could be shared To promote and facilitate research in blood glucose level with the research community. All data contributors to the prediction, we have curated the OhioT1DM Dataset and made OhioT1DM dataset signed informed consent documents al- it publicly available for research purposes. To the best of our lowing us to share their de-identified data with outside re- knowledge, this is the first publicly available dataset to in- searchers. This agreement clearly delineated what types of clude continuous glucose monitoring, insulin, physiological data could be shared and with whom. The data in the dataset sensor, and self-reported life-event data for people with type was fully de-identified according to the Safe Harbor method, 1 diabetes. a standard specified by the Health Insurance Portability and The OhioT1DM Dataset contains eight weeks’ worth of Accountability Act (HIPAA) Privacy Rule [Office for Civil data for each of six people with type 1 diabetes. All data con- Rights, 2012]. To protect the data and ensure that it is used tributors were between 40 and 60 years of age at the time of only for research purposes, a Data Use Agreement (DUA) the data collection. Two were male, and four were female. All must be executed before a researcher can obtain the data. were on insulin pump therapy with continuous glucose moni- toring (CGM). They wore Medtronic 530G insulin pumps and used Medtronic Enlite CGM sensors throughout the 8-week 3 OhioT1DM Data Format data collection period. They reported life-event data via a cus- In the OhioT1DM Dataset, the data contributors are referred tom smartphone app and provided physiological data from a to by ID numbers 559, 563, 570, 575, 588 and 591. Numbers Basis Peak fitness band. 563 and 570 were male, while numbers 559, 575, 588 and 591 were female. For each data contributor, there is one XML file 12. Time of self-reported illness. for training and development data and a separate XML file 13. Time and duration, in minutes, of self- for testing data. This results in a total of 12 XML files, two reported exercise. Intensity is the patient’s subjective for each of the six contributors. Table 1 shows the number of assessment of physical exertion, on a scale of 1 to 10, training and test examples for each contributor. with 10 being most physically active. Table 1: Number of Training and Test Examples per Contributor 14. Heart rate, aggregated every 5 minutes. 15. Galvanic skin response, also known as Training Test skin conductance, aggregated every 5 minutes. Contributor Examples Examples 16. Skin temperature, in de- 559 10796 2514 grees Fahrenheit, aggregated every 5 minutes. 563 12124 2570 17. Air temperature, in degrees 570 10982 2745 Fahrenheit, aggregated every 5 minutes. 575 11866 2590 18. Step count, aggregated every 5 minutes. 588 12640 2791 19. Times when the Basis band reported that the subject was asleep, along with its estimate of sleep 591 10847 2760 quality. Note that, in de-identifying the dataset, all dates were shifted Each XML file contains the following data fields: by the same random amount of time into the future. The days of the week and the times of day were maintained in the new 1. The patient ID number and insulin type. timeframes. However, the months were shifted, so that it is Weight is set to 99 as a placeholder, as actual patient not possible to consider the effects of seasonality or of holi- weights are unavailable. days. 2. Continuous glucose monitoring (CGM) data, recorded every 5 minutes. 4 The OhioT1DM Viewer 3. Blood glucose values obtained through The OhioT1DM Viewer is a visualization tool that opens an self-monitoring by the patient. XML file from the OhioT1DM Dataset and graphically dis- 4. The rate at which basal insulin is continuously plays the integrated data. It aids in developing intuition about infused. The basal rate begins at the specified timestamp the data and also in debugging. For example, if a system ts, and it continues until another basal rate is set. makes a poor blood glucose level prediction at a particular point in time, viewing the data at that time might illuminate 5. A temporary basal insulin rate that su- a cause. For example, the subject might have forgotten to persedes the patient’s normal basal rate. When the value report a meal or might have been feeling ill or stressed. is 0, this indicates that the basal insulin flow has been Figure 1 shows a screenshot from the OhioT1DM Viewer. suspended. At the end of a temp basal, the basal rate The data is displayed one day at a time, from midnight to goes back to the normal basal rate, midnight. Controls allow the user to move from day to day 6. Insulin delivered to the patient, typically be- and to toggle any type of data off or on for targeted viewing. fore a meal or when the patient is hyperglycemic. The The bottom pane shows blood glucose, insulin, and self- most common type of bolus, normal, delivers all insulin reported life-event data. CGM data is displayed as a mostly at once. Other bolus types can stretch out the insulin blue curve, with green points indicating hypoglycemia. Fin- dose over the period between ts begin and ts end. ger sticks are displayed as red dots. Boluses are displayed 7. The self-reported time and type of a meal, plus along the horizontal axis as orange and yellow circles. The the patient’s carbohydrate estimate for the meal. basal rate is indicated as a black line. Temporary basal rates appear as red lines. Self-reported sleep is indicated by blue 8. The times of self-reported sleep, plus the pa- regions. Life-event icons appear at the top of the pane as dots, tient’s subjective assessment of sleep quality: 1 for Poor; squares, and triangles. The data in the bottom pane is click- 2 for Fair; 3 for Good. able, so that additional information about any data point can 9. Self-reported times of going to and from work. be displayed. For example, clicking on a meal (a square blue Intensity is the patient’s subjective assessment of phys- icon) displays the timestamp, type of meal, and carbohydrate ical exertion, on a scale of 1 to 10, with 10 being most estimate. physically active. The top pane displays data from the Basis Peak fitness band. Blue regions in the top pane are times that the fitness 10. Time of self-reported stress. band reported that the subject was asleep. The step count 11. Time of self-reported hypoglycemic is indicated by vertical blue lines. The curves show heart rate episode. Symptoms are not available, although there is (red), galvanic skin response (green), skin temperature (gold), a slot for them in the XML file. and air temperature (cyan). Figure 1: Screenshot from the OhioT1DM Viewer. 5 Obtaining the Dataset and Viewer [Marling et al., 2009] C. Marling, J. Shubrook, and The OhioT1DM Dataset and Viewer are initially avail- F. Schwartz. Toward case-based reasoning for diabetes able to participants in the Blood Glucose Level Predic- management: A preliminary clinical study and decision tion (BGLP) Challenge of the Third International Work- support system prototype. Computational Intelligence, shop on Knowledge Discovery in Healthcare Data at IJCAI- 25(3):165–179, 2009. ECAI 2018, in Stockholm, Sweden. After the BGLP [Marling et al., 2012] C. Marling, M. Wiley, R. Bunescu, Challenge, these resources become publicly available to J. Shubrook, and F. Schwartz. Emerging applications for other researchers. To protect the data and ensure that it intelligent diabetes management. AI Magazine, 33(2):67– is used only for research purposes, a Data Use Agree- 78, 2012. ment (DUA) is required. A DUA is a binding document [Marling et al., 2016] C. Marling, L. Xia, R. Bunescu, and signed by legal signatories of Ohio University and the re- F. Schwartz. Machine learning experiments with noninva- searcher’s home institution. As of this writing, researchers sive sensors for hypoglycemia detection. In IJCAI 2016 can request a DUA at https://sites.google.com/view/kdhd- Workshop on Knowledge Discovery in Healthcare Data, 2018/bglp-challenge. Once a DUA is executed, the Dataset New York, NY, 2016. and Viewer will be directly released to the researcher. [Mirshekarian et al., 2017] S. Mirshekarian, R. Bunescu, 6 Conclusion C. Marling, and F. Schwartz. Using LSTMs to Learn Physiological Models of Blood Glucose Behavior. In Pro- The OhioT1DM Dataset was developed to promote and fa- ceedings of the 39th International Conference of the IEEE cilitate research in blood glucose level prediction. Accurate Engineering in Medicine and Biology Society (EMBC’17), blood gluocose level predictions could positively impact the pages 2887–2891, Jeju Island, Korea, 2017. health and well-being of people with diabetes. In addition to [Office for Civil Rights, 2012] Office for Civil Rights. their role in the artificial pancreas project, such predictions could also enable other beneficial applications, such as de- Guidance regarding methods for de-identification cision support for avoiding impending problems, “what if” of protected health information in accordance with analysis to project the effects of different lifestyle choices, the Health Insurance Portability and Accountabil- and enhanced blood glucose profiles to aid in individualizing ity Act (HIPAA) privacy rule, 2012. Available at diabetes care. It is our hope that sharing this Dataset will help https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/ to advance the state of the art in blood glucose level predic- understanding/coveredentities/De-identification/hhs deid tion. guidance.pdf, accessed June, 2018. [Plis et al., 2014] K. Plis, R. Bunescu, C. Marling, Acknowledgments J. Shubrook, and F. Schwartz. A machine learning approach to predicting blood glucose levels for diabetes This work was supported by grant 1R21EB022356 from the management. In Modern Artificial Intelligence for Health National Institutes of Health (NIH). The OhioT1DM Viewer Analytics: Papers Presented at the Twenty-Eighth AAAI was implemented by Robin Kelby, based on earlier visual- Conference on Artificial Intelligence, pages 35–39. AAAI ization software built by Hannah Quillin and Charlie Mur- Press, 2014. phy. The authors gratefully acknowledge the contributions of Emeritus Professor of Endocrinology Frank Schwartz, MD, a [Schwartz et al., 2008] F. L. Schwartz, J. H. Shubrook, and pioneer in building intelligent systems for diabetes manage- C. R. Marling. Use of case-based reasoning to enhance in- ment. We would also like to thank our physician collabo- tensive management of patients on insulin pump therapy. rators, Aili Guo, MD, and Amber Healy, DO, our research Journal of Diabetes Science and Technology, 2(4):603– nurses, Cammie Starner and Lynn Petrik, and our past and 611, 2008. present graduate and undergraduate research assistants. We are especially grateful to the anonymous individuals with type 1 diabetes who shared their data, enabling the creation of this dataset. References [Bunescu et al., 2013] R. Bunescu, N. Struble, C. Marling, J. Shubrook, and F. Schwartz. Blood glucose level pre- diction using physiological models and support vector re- gression. In Proceedings of the Twelfth International Con- ference on Machine Learning and Applications (ICMLA), pages 135–140. IEEE Press, 2013. [Juvenile Diabetes Research Foundation (JDRF), 2018] Juvenile Diabetes Research Foundation (JDRF). Artificial Pancreas, 2018. Available at http://www.jdrf.org/research/ artificial-pancreas/, accessed June, 2018.