Compositional Data Analysis of Type 1 Diabetes Data Lyvia Biagi1,2 , Arthur Bertachi1,2 , Josep Antoni Martı́n-Fernández1 , Josep Vehı́1,3 1 University of Girona, Spain 2 Federal University of Technology - Paraná (UTFPR), Guarapuava, Brazil 3 Centro de Investigación Biomédica en Red de Diabetes y Enfermedades Metabólicas Asociadas (CIBERDEM), Spain lyviar@utfpr.edu.br, abertachi@utfpr.edu.br, josepantoni.martin@udg.edu, josep.vehi@udg.edu Abstract The integration of CGM and CSII allows not only the visu- alization of glucose data in real time, which permits that the Type 1 Diabetes (T1D) is a chronic disease char- patient take measures in order to pursue glucose control, but acterized by a substantial reduction or no produc- also the acquisition of data, which supports the extraction of tion of insulin by the pancreas. Subjects with T1D information that can help physicians to improve and adapt the need to infuse insulin exogenously to avoid high current insulin therapy. Contreras et al. [2016] developed a blood glucose levels (hyperglycemia). However, if tool for profiling BG dynamics of T1D patients based on dif- insulin is delivered exaggeratedly, subjects may ex- ferent levels of BG. Even though the authors did not consider perience low glucose levels (hypoglycemia). Both any insulin information in advance, they obtained groups of conditions are undesirable, and physicians pre- glucose profiles with different insulin requirements. The pos- scribe individualized insulin therapy to cope with sibility of categorizing daily glycemic profiles according to the disturbances that jeopardize glycemic control, the glycemic control can assist physicians to find different e.g. meals, physical activity, stress, illness. This patterns of days and determine proper therapies to deal with work presents a novel methodology for the individ- day-to-day variations. ual classification of glucose profiles obtained from The analysis of time spent in distinct BG ranges defined by continuous glucose monitoring. Daily glucose pro- different levels during one day can be performed with Com- files were discretized into time spent in distinct positional Data Analysis (CoDA). The analysis of Composi- glucose ranges according to different glucose lev- tional Data (CoDa) deals with vectors in the form of propor- els and formed a composition. Compositional data tions to some whole [Aitchison, 1982]. The time spent in (CoDa) analysis was applied to the data and the different glucose ranges are relative contributions to the 24- discovery of groups of similar compositions was h time budget, and therefore, are compositional data, which possible. Data was expressed in coordinates and have important characteristics that must be considered. The k-means algorithm was applied to the coordinates aim of this paper is to present a proof of concept of the feasi- to classify different patterns of days. This classifi- bility of the usage of CoDA for the categorization of glucose cation allowed the extraction of information of the profiles in patients with T1D. data, which can assist physicians to adjust patients’ treatments. 2 Compositional Data Analysis of Glucose Time Series 1 Introduction Considering a compositional vector x = [x1 , x2 , . . . , xD ], Individuals with type 1 diabetes (T1D) need to inject insulin where D is the number of parts, x1 , x2 , . . . , xD are positive PD to assure blood glucose (BG) control. Insulin must be prop- components and i=1 xi = C, where C is a closure con- erly dosed either by multiple daily injections or continuous stant. The set of real positive vectors closed to a constant subcutaneous insulin infusion (CSII) in order to avoid both C is called simplex (S D ) [Aitchison, 1982]. As data car- hypo- and hyperglycemia. These situations over time can re- ries only relative information, the time spent in each glucose sult in different complications, such as seizures, coma, kid- range could be measured either as percentages of the day or ney, liver and heart damage, and even death [Agiostratidou et minutes, or hours. The relevant information of a composition al., 2017]. is contained in the ratios between its components. Even though present T1D technology combines continu- This work considers a dataset obtained from one T1D pa- ous glucose monitoring (CGM) and CSII, achieving optimal tient who wore for approximately eight weeks the Paradigm glycemic control in T1D patients is still a hurdle due to the Veo system with second generation of the Enlite CGM sensor great intra-patient variability, which is resultant from char- (Medtronic Minimed, Northridge, CA, USA). Glucose data acteristics of individual behavior and a complex metabolic obtained from the CGM was used to validate the proposal. system [Kovatchev and Cobelli, 2016]. Insulin data obtained from the pump was also considered dur- ing the preprocessing of the data and analysis of results. The real space after the expression in coordinates, which allows CGM system used by the patient recorded BG measurements the application of traditional statistical methods [Aitchison, every five minutes, thus, a complete day is supposed to have 1986]. The centred log-ratio (clr) transformation projects S D 288 samples. For several reasons, such as CGM or insulin to the real space RD . It was introduced in Aitchison [1986] pump malfunction during periods of the days, the quantity as the logarithm of the ratio of each part over the geometric of samples per day, starting at 00:00 and ending at 23:55 is mean, and it is defined in (1). The isometric log-ratio (ilr) sometimes non-uniform, due to the missing values. transformation expresses x in terms of its orthonormal log- Data was preprocessed as follows: it was decided to con- ratio coordinates. It was introduced in Egozcue et al. [2003]. sider days as valid only if each six-hour period contained at An ilr vector can be viewed as the coordinates of a compo- least 70% of the possible values for both glucose and insulin sition with respect to an orthonormal basis e1 , e2 , . . . , eD−1 data. After that, valid days starting at 00:00 and ending were on the simplex [Pawlowsky-Glahn et al., 2015], as described selected for analysis. in (2). The 24-h glucose time series is analyzed considering five glucose ranges defined according to the standardized clinical        levels of hypo- and hyperglycemia described in Agiostratidou x1 x2 xD clr(x) = ln , ln , . . . , ln (1) et al. [2017]: g(x) g(x) g(x) • Hypoglycemia Level 1: 54 mg/dL ≤ BG < 70 mg/dL ilr (x) = clr (x) · Φt (2) Level 2: BG < 54 mg/dL where g(x) is the geometric mean of x, Φ is the (D − 1) × • Hyperglycemia D-matrix whose i-th row is the vector clr (ei ), for i = Level 1: 180 mg/dL < BG ≤ 250 mg/dL 1, . . . , D − 1. Level 2: BG > 250 mg/dL 2.3 Compositional biplot Thus, the glucose ranges considered were: <54 mg/dL, 54-70 mg/dL, 70-180 mg/dL, 180-250 A representation of any matrix by means of a 2-rank ap- mg/dL, >250 mg/dL, obtaining the composition proximation is possible with a biplot [Gabriel, 1971]. The x = (G 54, G 54 70, G 70 180, G 180 250, G 250). adaptation of the biplot for use with CoDa is a useful ex- Missing values were assumed to be evenly distributed ploratory tool [Pawlowsky-Glahn et al., 2015]. The qual- between the existing ranges of the day in analysis. The ity of the (D − 1)-dimensional ilr-transformed representa- 24-h glucose profile was split into time spent in the five tion in a two-dimensional graph is expressed by the cum- aforementioned glucose ranges. Time spent in different mulative total variance of the biplot of the clr-transformed ranges are relative contributions to the 24-h glucose profile, data. The discovery of potential clusters of compositions is codependent, and therefore, should be analyzed as CoDa. possible with the clr-biplot [Pawlowsky-Glahn et al., 2015; Palarea-Albaladejo et al., 2012]. 2.1 Treatment of zeros Figure 1 shows the biplot of days of the patient in analysis, The log-ratio methodology, which is the basis of CoDA, must obtained with CoDaPack [Comas-Cufı́ and Thió-Henestrosa, be preceded by a proper treatment of zero values. That is be- 2011]. The origin of the biplot represents the centre of the cause both operations, logarithms and ratios, require non-zero compositional dataset. Five vertices, each one related to the elements in the data matrix. Martı́n-Fernández et al. [2011] clr-coordinate of the part correspondent to the time spent in describe different zero problems in their work. One example each glucose range are connected to the origin through five of a zero problem are rounded zeros, which cannot be ob- rays (in red). Each marker represents a single day. Markers served because their value is is below some detection limit close to a ray of determined clr-coordinate indicate that those (DL). days are characterized by high values in that clr-coordinate, The analysis of the zeros of the dataset was performed i.e. it means that the individual spent relatively high time in [Palarea-Albaladejo and Martı́n-Fernández, 2015]. Given the the glucose range correspondent to that clr-coordinate, com- sampling frequency of the CGM (one sample every 5 min- paring to the geometric average of time spent in all glucose utes), and that the missing data was evenly distributed among ranges. The cummulative total variance retained by the biplot the ranges of the data available for that day, the DL was set to is equal to 92.19%, the high value of the variance retained 5 minutes. means that the biplot provides a good representation of the The imputation of rounded zeros from continuous data can data in the real space. It is possible to infer the existence of be done with the log-ratio Expectation-Maximization (EM) groups. replacement [Palarea-Albaladejo and Martı́n-Fernández, According to Palarea-Albaladejo et al. [2012], it is inap- 2015]. The values imputed by this method incorporate the propriate to apply the ordinary approach of clustering, due to information of the relative covariance structure. the constraints of the simplex space. However, it is possible to use distance based clustering techniques in CoDA, con- 2.2 Representation in Coordinates sidering that the Aitchison distance between compositions is CoDa must not be analyzed through standard multivariate equal to the Euclidean distance between the log-ratio coordi- analyses, which are designed for unconstrained multivari- nates. One way to obtain the log-ratio coordinates is through ate data [Aitchison, 1982]. CoDa can be analyzed in the the SBP [Egozcue and Pawlowsky-Glahn, 2005]. Figure 1: clr-biplot of days of one patient. Cummulative total vari- Figure 2: clr-biplot of days of one patient. Four groups obtained ance retained = 92.19%. after clustering, indicated by different colors and the letters A, B, C and D. 2.4 Cluster Analysis of this group are characterized by high values in the part cor- The log-ratio coordinates were obtained through the SBP of responding to BG between 54-70 mg/dL. Days of group C are the data and k-means algorithm [Hartigan and Wong, 1979] close to the rays of variables G 70 180 and G 180 250. This was applied to the coordinates to check for different patterns suggest that days of group C are characterized by relatively of days. This algorithm is based on squared Euclidean dis- high values in the parts corresponding to BG between 70-180 tance and we considered with 25 random repetitions of the mg/dL and 180-250 mg/dL. Days of group D are character- selection of initial centres. ized by relatively high values in the parts corresponding to K-means was tested for several numbers of groups. We BG >250 mg/dL. analyzed the groups of days obtained per patient in terms of Figure 3 shows the compositional geometric mean barplot, minimums and maximums of parts (times in different glu- which is useful as visualization tool for the analysis of cose ranges) and ratios (coordinates). The choice of number grouped data. The compositional centre of each group of data of groups took into account the representation obtained in the is compared with the centre of the whole dataset. Thus, pos- clr-biplot. The following measurements were also summa- itive bars reflect relatively mean values of a part above the rized per group: average blood glucose (Avg BG), BG varia- overall composition. Analogously, negative bars reflect rela- tion (BGV), number of level 1 and 2 hypoglycemic events per tively mean values of a part below the overall composition. day, where a hypoglycemic event is defined as three or more Days of group A are characterized by relatively high values CGM readings under the referred level, average time of level in the parts corresponding to BG <54 mg/dL and between 54- 1 and 2 hypo- and hyperglycemia per day. The information 70 mg/dL. regarding insulin therapy is also provided: total basal and bo- Days of group B are characterized by relatively high val- lus insulin, expressed in units of insulin (UI), number of bolus ues in the parts corresponding to between 54-70 mg/dL and per day (# Bolus), carbohydrates intake (CHO), expressed in relatively low values in the part corresponding to BG >250 exchanges (ex, where 1 exchange is equivalent to 10 grams mg/dL. Days of group B are characterized by the existence of of CHO), relation between bolus insulin and carbohydrates level 2 hypoglycemic events. (bolus:C) and time of pump suspension (min/day). Days of group C are characterized by relatively low values in the parts corresponding to BG <54 mg/dL, between 54-70 3 Results mg/dL and >250 mg/dL. On days of group C, the individual Figure 2 shows the clr-biplot of the days of one patient. Four spent relatively less time in the extreme glucose ranges (high groups are represented. Days of group A are very close to the and low). ray of variable G 54. This suggest that days of this group are Days of group D are also characterized by relatively low characterized by high values in this variable, i.e., days of this values in the parts corresponding to BG <54 mg/dL and be- group are characterized by relatively high values in the part tween 60-70 mg/dL, however, unlike days of groups C, days corresponding to BG < 54 mg/dL. Days of group B are very of group D are characterized by relatively high values in the close to the ray of variable G 54 70. This suggests that days part corresponding to BG >250 mg/dL. On days of group D, Table 1: Summary of clinically relevant outcomes for each group. A B C D # days per group 3 18 11 16 Avg BG (mg/dl) 150.28 140.40 151.33 181.44 BGV (mg/dl) 40.52 40.50 36.34 49.86 # Level 1 Hypo events 0.67 0.89 0.00 0.00 (events/day) Time Level 1 Hypoglycemia 23.11 33.53 0.00 0.00 (min/day) # Level 2 Hypo events 0.33 0.00 0.00 0.00 (events/day) Time Level 2 Hypoglycemia 14.06 0.00 0.00 0.00 (min/day) Time Level 1 Hyperglycemia 166.44 296.72 363.72 434.32 Figure 3: Compositional geometric mean barplot of groups of days (min/day) of one patient. Time Level 2 Hyperglycemia 107.39 18.61 0.65 199.55 (min/day) the individual spent relatively more time in the highest glu- Basal (UI) 18.50 19.46 19.91 19.95 cose range and less time in low glucose ranges. Bolus (UI) 21.27 20.65 20.22 24.33 Table 1 show some clinically relevant outcomes consider- # Bolus 4.33 7.00 6.64 7.25 ing the classification of the days. As suggested by the biplot CHO (ex) 12.17 14.43 13.23 13.71 of Figure 2, days of group A are those related to the occur- bolus:C (U/ex) 1.93 1.47 1.55 1.81 rence of level 2 hypoglycemic events. Days of this group Time Pump also presented fewer boluses than days of other groups and Suspension 98.33 67.50 49.55 30.63 also the highest time with the insulin delivery suspended by (min/day) the pump, however, the highest bolus:C ratio is presented for days of this group. Days of group B are characterized by the existence of level spent in different glucose ranges are relative contributions to 1 hypoglycemic events, but without the occurrence of level 2 the 24-h period, and therefore, should be analyzed as CoDa. hypoglycemic events. In these days, patients delivered more Although the proposed methodology has been applied in a boluses when compared with group A and also consumed dataset collected from a single T1D patient in a retrospective more CHO. way, this novel approach was able to classify the days into Days of group C present the lowest BGV between all different groups that reflect patient’s behavior. With this groups and there is no incidence of hypoglycemic events nor information, physicians may use this classification as an level 2 hyperglycemic events. However, days of this group analysis tool to adjust insulin therapy for each group of days are also characterized by the occurrence of level 1 hyper- to improve overall glycemic control. The method provides glycemia. tools for personalized analysis of the data, making possible Even achieving the highest Basal, Bolus, # Bolus between the comparison of characteristics of groups of days with all groups, days of group D present the highest Avg BG and in the whole dataset. Even though more effort is required for days of this group, individuals also spent more time in hyper- the categorization of days in real time, the analysis allows glycemia. Days of this group are characterized by relatively extraction of information of different groups of days, and high values in the parts corresponding to BG >250 mg/dL, as consequently, the inference of correction measures that showed both by the biplot of Figure 2 and by the barplot of should be taken for each specific group. Figure 3. It is very likely that patients’s behavior during these days require an adjustment on his/her insulin therapy, such as increasing the total amount of basal or increasing bolus:C Acknowledgments ratio. This project has been partially supported by the Spanish Gov- 4 Conclusion ernment MINECO through Grant DPI-2016-78831-C2-2-R and the National Council of Technological and Scientific De- The aim of this paper is to present the feasibility of the usage velopment, CNPq Brazil through Grants 202050/2015-7 and of CoDa for the analysis of glucose profiles of patients with 207688/2014-1. The authors thank the Spanish Consortium T1D. This methodology was based on the discretization of on Artificial Pancreas and Diabetes Technology for sharing daily glucose profiles considering different ranges. The time their database. References [Palarea-Albaladejo and Martı́n-Fernández, 2015] J Palarea- [Agiostratidou et al., 2017] Gina Agiostratidou, Henry An- Albaladejo and Josep Antoni Martı́n-Fernández. zCom- positions – R package for multivariate imputation of left- halt, Dana Ball, Lawrence Blonde, Evgenia Gourgari, censored data under a compositional approach. Chemo- Karen N. Harriman, Aaron J. Kowalski, Paul Madden, Ali- metrics and Intelligent Laboratory Systems, 143:85–96, cia H. McAuliffe-Fogarty, Molly McElwee-Malloy, Anne 2015. Peters, Sripriya Raman, Kent Reifschneider, Karen Ru- bin, and Stuart A. Weinzimer. Standardizing clinically [Palarea-Albaladejo et al., 2012] Javier Palarea-Albaladejo, meaningful outcome measures beyond hba1c for type 1 Josep Antoni Martı́n-Fernández, and Jesús A. Soto. Deal- diabetes: A consensus report of the american association ing with distances and transformations for fuzzy c-means of clinical endocrinologists, the american association of clustering of compositional data. Journal of Classification, diabetes educators, the american diabetes association, the 29(2):144–169, Jul 2012. end. . . . Diabetes Care, 40(12):1622–1630, 2017. [Pawlowsky-Glahn et al., 2015] Vera Pawlowsky-Glahn, [Aitchison, 1982] J Aitchison. The Statistical Analysis of Juan José Egozcue, and Raimon Tolosana-Delgado. Compositional Data. Journal of the Royal Statistical Modeling and Analysis of Compositional Data. John Society. Series B (Methodological) J. R. Statist. Soc. B, Wiley & Sons, 2015. 44(2):139–177, 1982. [Aitchison, 1986] J. Aitchison. The statistical analysis of compositional data. Monographs on statistics and applied probability. Chapman and Hall, 1986. Reprinted 2003 with additional material by The Blackburn Press. [Comas-Cufı́ and Thió-Henestrosa, 2011] Marc Comas-Cufı́ and Santi Thió-Henestrosa. CoDaPack 2.0: a stand-alone, multi-platform compositional software. In J.J. Egozcue, R. Tolosana-Delgado, and M.I Ortego, editors, CoDa- Work’11: 4th International Workshop on Compositional Data Analysis, Sant Feliu de Guı́xols, 2011. [Contreras et al., 2016] Iván Contreras, Carmen Quirós, Marga Giménez, Ignacio Conget, and Josep Vehi. Pro- filing intra-patient type I diabetes behaviors. Computer Methods and Programs in Biomedicine, 136:131–141, 2016. [Egozcue and Pawlowsky-Glahn, 2005] J J Egozcue and V Pawlowsky-Glahn. Groups of Parts and Their Balances in Compositional Data Analysis. Mathematical Geology, 37(7), 2005. [Egozcue et al., 2003] J. J. Egozcue, V. Pawlowsky-Glahn, G. Mateu-Figueras, and C. Barceló-Vidal. Isometric logra- tio transformations for compositional data analysis. Math- ematical Geology, 35(3):279–300, Apr 2003. [Gabriel, 1971] K R Gabriel. The Biplot Graphic Display of Matrices with Application to Principal Component Analy- sis. Source: Biometrika Biometrika Trust, 58(3):453–467, 1971. [Hartigan and Wong, 1979] J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. JSTOR: Applied Statis- tics, 28(1):100–108, 1979. [Kovatchev and Cobelli, 2016] Boris Kovatchev and Claudio Cobelli. Glucose variability: Timing, risk analysis, and relationship to hypoglycemia in diabetes. Diabetes Care, 39(4):502–510, 2016. [Martı́n-Fernández et al., 2011] Josep Antoni Martı́n- Fernández, Javier Palarea-Albaladejo, and Ricardo An- tonio Olea. Dealing with Zeros. Compositional Data Analysis: Theory and Applications, pages 43–58, 2011.