=Paper=
{{Paper
|id=Vol-2675/paper25
|storemode=property
|title=Analysis of the performance of Genetic Programming on the Blood Glucose Level Prediction Challenge 2020
|pdfUrl=https://ceur-ws.org/Vol-2675/paper25.pdf
|volume=Vol-2675
|authors=David Joedicke,Gabriel Kronberger,José Manuel Colmenar,Stephan Winkler,Jose Manuel Velasco,Sergio Contador,Ignacio Hidalgo
|dblpUrl=https://dblp.org/rec/conf/ecai/JoedickeKCWVCH20
}}
==Analysis of the performance of Genetic Programming on the Blood Glucose Level Prediction Challenge 2020==
Analysis of the performance of Genetic Programming on the Blood Glucose Level Prediction Challenge 2020 David Joedicke1,4 , Oscar Garnica 2, Gabriel Kronberger1 , J. Manuel Colmenar3 Stephan Winkler4,1 , J. Manuel Velasco2 , Sergio Contador2,3 , J. Ignacio Hidalgo2 Abstract. In this paper we present results for the Blood Glucose 2 ALGORITHMIC PROPOSAL Level Prediction Challenge for the Ohio2020 dataset. We have used four variants of genetic programming to build white-box models for 2.1 Data pre-processing predicting 30 minutes and 60 minutes ahead. The results are com- pared to classical methods including multi-variate linear regression, Data pre-processing proved to be challenging in this competition as random forests, as well as two types of ARIMA models. Notably, the exact rules of the competition were rather opaque especially re- we have included future values of bolus and basal into some of the garding usage of future information and the difference between the models because we assume that these values can be controlled. Addi- online and the offline case. The main pitfalls were: (i) the set of fea- tionally, we have used a convolution filter to smooth the information tures is different for the six data contributors, (ii) different sampling in the bolus volume feature. We find that overall tree-based GP per- rates for features, (iii) variance in the duration between sampling val- forms well and better than multi-variate linear regression and random ues (e.g. blood glucose values are usually sampled every five minutes forest, while ARIMA models performed worst on the here analyzed but not always), (iv) some missing values are encoded as zeros (e.g. data. zero values for skin temperature). In the ARIMA model we only used the glucose level. For all the other models we used the following data pre-processing steps. We prepared a Python script that we used for pre-processing training as 1 INTRODUCTION well as testing data. We used only the set features which are available for all data contributors even though we built six individual models. Correspondingly, we only used the following features: glucose level, This paper describes our contribution to the Blood Glucose Level basal, bolus type, bolus dose, galvanic skin response (gsr), and skin Prediction Challenge (BGLPC) for the Ohio2020 dataset described temperature. We used numerical encoding to encode the categorical in [15]. We present a comparison among different algorithmic tech- variable bolus type. For the skin temperature we removed all zeros niques related to linear regression applied to this glucose prediction values. For the basal value we replaced all missing values with zeros. problem, where we highlight four of them, based on tree-based Ge- We incorporated lagged variables for our models (e.g. the glucose netic Programming (GP) [14]: GP, GP with offspring selection [1] level five minutes ago). For this, we extended our dataset with lagged (GP-OS); and a single-objective as well as a multi-objective vari- features, whereby we used a maximum lag of two hours. So, for each ant of Grammatical Evolution[16] denoted as GE and MOGE. In ad- feature we produced 24 (120 min / 5 min) additional features. Hence, dition, we present three approaches based on classical methods. In we require values at multiples of five minutes. This is not the case particular, we consider Random Forest [2], denoted as RF, a multi- in the provided datasets. Therefore, we first prepared a intermediate variate linear regression, denoted as LR, and two ARIMA models larger dataset with one row for every minute (equidistant sampling). [18], denoted as A-0 and A-1. All the methods will be briefly de- In this dataset, we had to fill missing values for glucose level, gal- scribed in the following section, as well as the pre-processing of data vanic skin response, and skin temperature. For the training data we we have performed. In data pre-processing several features where de- used linear interpolation to fill these gaps, for the test data we used rived from exising data and added to the dataset. The experimental the last known value, since future values should not be used to predict results will be analyzed in Section 3. We use root mean squared er- the glucose value. Using the sub-sampled and interpolated dataset ror (RMSE) and mean absolute error (MAE) as metrics to measure we prepared the lagged features and finally we reduced the number the accuracy of our results. Finally, the conclusions will be drawn in of rows again by keeping only rows where we have a target glucose Section 4. value (either 30 or 60 minutes ahead). In our modelling efforts for GP and GP-OS, we assume that the basal value as well as the bolus type and dose can be controlled ex- 1 Josef Ressel Center for Symbolic Regression, Upper Austria Uni- ternally. This assumes an application of the model as part of a model- versity of Applied Sciences. Emails: david.joedicke@fh-ooe.at, predictive controller for an insulin pump, whereby the goal is to op- Gabriel.Kronberger@fh-hagenberg.at, stephan.winkler@fh-ooe.at. 2 Universidad Complutense de Madrid. Emails: ogarnica@ucm.es, mve- timize the automatic administration of insulin. Therefore, we have lascc@ucm.es, hidalgo@ucm.es. included “future information” for the blood glucose prediction. The 3 Rey Juan Carlos University. Email: josemanuel.colmenar@urjc.es. variables that we assume to be controlled and known are: basal, bolus 4 Johannes Kepler University Linz type, and bolus dose. For these variables we included forward look- Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ing features up to the prediction horizon (6 features for 30 minutes • Offspring selection GP (OSGP): 1000 individuals, random par- and 12 features for 60 minutes). ents selection, strict offspring selection (i.e., individuals are sent Finally, we added features for smoothed bolus dose values using to the next generation if they are better than their parents [1]), a convolution process. Even though the bolus dose is administered elitism, termination criterion: maximum selection pressure 200 almost instantaneously, the effect is not immediate. Instead, the un- (i.e., as soon as the number of individuals that have to be cre- derlying dynamic uptake process has a longer-lasting diminishing ef- ated so that 1000 successful ones are found in one generation has fect. We used a convolution function (Bateman function) to produce reached 200000). smoothed features for the bolus dose. For this smoothed bolus dose we also prepared lagged features (backwards and forwards) using the same scheme as described above. 2.2.2 Grammatical Evolution Grammatical Evolution (GE) [16] is a variant of GP which uses chro- 2.2 Algorithms mosomes to encode the information of the individuals (trees). In GE, a grammar is applied to perform the decoding process that generates After pre-processing the data as described above, we used machine the trees which, in this case, will be the mathematical expressions learning methods to find models that describe future values of glu- that represent prediction models of glucose values. Given that this cose after 30 minutes, ĝt+30 and after 60 minutes, ĝt+60 , as a func- method uses chromosomes, it allows the application of classical ge- tion of basal value (bv), bolus dose (bd), basis GSR value (gsr), basis netic operators such as crossover or mutation directly at the chromo- skin temperature (sk), bolus type (bt) and glucose level (gl): some level, instead of the tree level, as happens in GP. We evaluate two GE proposals: ĝt+30/t+60 = f (bv(t − 60...t), bd(t − 60...t), ...) (1) • Standard GE: we follow the same implementation and grammars We used seven different algorithms to model the function de- of [12].The GE approach only considers one objective function, scribed in Equation (1). Linear Regression (LR) and Random For- which will be either RSME or MAE. We present here only the est (RF) are well known methods that are used as benchmarks for results with RMSE, since they are significantly better with the pa- our models. Additionally, we used two GP, two GE algorithms, and rameters used. two ARIMA models to predict the glucose value. Next, we detail our • Multi-Objective GE (MOGE): we propose a multi-objective im- proposals5 . plementation of GE where the underlying algorithm is the well- known NSGA-II [9]. The MOGE approach considers RSME as one of the objective functions and a custom objective function 2.2.1 Genetic Programming called FCLARKE as the second objective. FCLARKE is based Symbolic regression (SR) is a specific method of regression anal- on the Clarke Error Grid (CEG) metric, and was defined as shown ysis, where the model is represented as a closed-form mathematical in Equation (2). In the expression, |E| represents the number of expression [14]. A unique characteristic of SR is that the model struc- points that belong to zone E of CEG, which is the most dangerous ture does not have to be pre-specified. Instead, a SR solver (i.e. GP) one for the patient, |D| corresponds to the second most dangerous automatically constructs mathematical expressions from the set of zone, D, and |C| corresponds to zone C. Zone B was not included input variables (with their respective allowed time offsets) as well as in the formula because it represents a not very dangerous zone, mathematical operators and functions. and A corresponds to the safe zone. A more detailed explanation We use genetic programming (GP), an evolutionary technique that of FCLARKE can be found in [7]. iteratively produces solutions for a given optimization problem. GP . FCLARKE = 100 · |E| + 10 · |D| + |C| (2) is specifically designed to find programs that solve given tasks; when applied to SR, these programs are formulas that are based on of Prediction models with GE use information of the previous 60 mathematical operators, variables, and constants. Being an evolu- minutes while MOGE models can use data from the previous two tionary algorithm, GP initially creates a randomly set of formulas hours. Additional configurations will be explored and presented at and then, over many generations, produces new formulas by means the workshop. In all the experiments, both GE and MOGE, we per- of crossover and mutation operators. The improvement of these for- form 10 runs with 400 individuals over 1000 generations, random mulas is reached by selection operators: in each generation the par- initialization of the population (half-ramped) allowing a maximum ents for the new solution candidates are selected, and new individuals number of 5 wrappings using a crossover probability of 0.7 and a can be inserted into the next generation either automatically or only mutation probability of 0.1. Executions were run on our Pancreas if they are selected by some kind of offspring selection. We used Model Tool described in [13]. Unlike the GP description above, with the GP implementation in HeuristicLab 6 and created models with the two GE variants we only use information of the past and present. a maximum size of 100 nodes and ten levels. We used GP in two We did not use all the generated features, but only those of every different variants: 15 minutes before. So, we use historical data from 120, 105, 90, 75, 60, 45, 30 and 15 minutes ago for MOGE and 60, 45, 30 and 15 • Standard GP (GP): 1000 individuals, tournament selection as par- for GE. We only consider the glucose level, basal, bolus type, bo- ent selection mechanism, elitism, termination criterion: 1000 gen- lus dose, galvanic skin response, and skin temperature variables. We erations. would like to highlight that recent papers that combines GE with 5 Source files are available under request at absys@ucm.es other techniques, such as, data augmentation [17], random GE and https://drive.google.com/drive/folders/ bagging [11] or clustering [8] achieved better results than the GE 1TOGvl55iR10aqRFO8GoD2v6TQD4djiCE?usp=sharing configurations studied in this paper. We limit GE in order to follow 6 https://dev.heuristiclab.com the instructions of the Challenge. 2 2.2.3 ARIMA model of parameters and select the models with better trade-off between goodness-of-fit and the number of parameters of the model, a.k.a In addition to the GP and GE models, we have also fitted two auto- parsimony. regressive integrated moving average, ARIMA(p, d, q) models to es- timate glucose values. Equation (3) presents the expression of an ARIMA(p, d, q) model where gs is the actual value of the glucose In some cases, the procedure cannot bring the best model because and s is the random error at sample s, respectively, while p, q, and d the parameters that provided the best estimation either 30 minutes or integers called the orders of the model. All our models only include 60 minutes ago cannot produce a stable ARIMA model in the current glucose values and do not use exogenous variables such as insulin time. Due to this fact, the overall best-performing criteria is to choose doses or carbohydrates. the current ARIMA model using the Akaike Information Criterion. p q ! d X X X ĝs = αi gs−i + s + θi s−i + φi si (3) 3 EXPERIMENTAL RESULTS i=1 i=1 i=0 We evaluate both off-line and on-line models. The off-line mod- Table 1 presents the experimental results in terms of RMSE and els are created using the training data for each patient. We define MAE for all the algorithms and for both 30 and 60 minutes pre- 192 models by sweeping the three ARIMA parameters as follows. diction horizons. For GP, GP-OS, and GE, predictions are obtained The auto-regressive order ranges in p ∈ [2, 10], the moving average with the models that obtained the lowest RMSE value in the train- q ∈ [2, 10], and the integrative part uses the values d ∈ [0, 1], so ing phase after 10 runs. The remaining 9 models were not evaluated that 9 × 9 × 2 = 192. The basics behind the election of these ranges nor reported. For the MOGE, also 10 run were made in the training is that the model takes into account glucose values up to 10 samples phase, and from all the solutions of the 10 Pareto fronts, we also se- (50 minutes) previous to the current time. The model’s coefficients lected the model with RMSE value, independently of the value of –up to p + q + d coefficients per model– are estimated using max- the FCLARKE . Results represent the values for the predictions of imum likelihood given the univariate glucose time series, gs , on the this selection. GE and MOGE were run just with the configuration complete training dataset for each patient. Once the 192 models have explained on section 2.2.2 and no parameters optimization was per- been estimated, we select two models per patient: the model with formed. the lowest RMSE at 30-minutes horizon and the one with the lowest Regarding GP and GP-OS results on table 1 may differ from those RMSE at 60 minutes. reported in the submitted files. After analyzing the results we no- Regarding on-line models, with each new glucose value in the test- ticed that at the beginning and the end of the data our results are ing dataset, the procedure defines a 4-hour time window using the last fluctuating. A few high and low predicted glucose values influence 48 samples –including the last one–, and it estimates the 192 ARIMA the quality of the results a lot. We decided to remove those unnatural models over the time window using maximum likelihood. Again, the values by more likely results (lower boundary: 40, upper boundary: 192 models are created by sweeping the three ARIMA parameters, as 400). This procedure is only included in the results of this paper, not stated above. Next, we select the best model. Unlike off-line models, in the submitted files. now we cannot use future glucose values to select the model that will #P RF GP GP-OS LR GE MOGE A-0 A-1 provide the lowest RMSE in the future. Hence, we select the current 60 minutes - RMSE best model based on the history of the best models up to the cur- 540 44,06 37,13 39,97 38,87 41.16 40.94 47,26 57,40 544 28,08 28,45 28,77 28,40 33.46 29.64 35,61 45,63 rent sample. We have evaluated four different criteria to choose the 552 27,24 26,08 25,91 28,90 31.04 29.85 27,18 34,39 567 37,76 35,99 35,82 36,19 39.68 37.82 47,53 51,16 best model for 30-minutes predictions and six criteria for 60-minute 584 38,11 37,84 34,63 37,12 38.17 37.84 41,05 48,03 predictions. 596 29,58 27,56 27,12 27,77 30.31 28.65 33,33 42,36 Avg. 34,14 32,18 32,04 32,88 35.64 34.12 38,66 46,49 60 minutes - MAE • We select the values of (p, q, d) of the model with the lowest ab- #P RF GP GP-OS LR GE MOGE A-0 A-1 540 31,62 27,83 30,33 29,65 32.01 31.76 31,71 39,35 solute error 30 minutes ago to create the current model for 30- 544 20,37 20,13 20,35 21,17 26.77 22.50 23,38 29,96 552 20,47 19,78 19,51 22,42 23.56 23.08 15,28 19,56 minutes and 60-minutes predictions. Note that given the current 567 27,65 26,06 25,87 27,14 30.26 28.50 30,60 35,11 glucose value, we know the model with the lowest error 30 min- 584 29,18 27,45 26,09 27,74 29.08 28.82 26,24 32,83 596 21,70 20,26 20,07 20,89 22.82 21.27 21,44 21,44 utes ago. Avg. 25,17 23,58 23,70 24,83 27.42 25.99 24,77 29,71 • We select the values of (p, q, d) of the model with the lowest ab- 30 minutes - RMSE #P RF GP GP-OS LR GE MOGE A-0 A-1 solute error 60 minutes ago in the prediction of the current glucose 540 27,00 21,67 22,26 22,00 23.10 22.04 31,09 41,39 544 17,96 17,83 17,46 17,54 19.20 17.62 21,49 31,82 to create the current model for a 60-minutes prediction. 552 17,45 17,50 20,84 19,42 17.29 16.61 16,59 22,66 • We select the values of (p, q, d) of the off-line model for 30- 567 584 25,61 25,69 22,16 24,83 23,03 25,81 23,56 27,08 23.31 22.87 22.17 22.21 29,66 27,01 35,59 36,96 minutes and 60-minutes predictions. 596 19,90 16,76 16,85 17,68 18.58 16.96 21,23 21,23 Avg. 22,27 20,13 21,04 21,22 20.73 19.60 24,51 31,61 • We define an “ensemble” ARIMA averaging the value of p and q 30 minutes - MAE for the six best models 30, 35, 40, . . . , and 55 minutes ago. We #P RF GP GP-OS LR GE MOGE A-0 A-1 540 19,19 15,82 15,89 16,22 16.26 16.36 20,17 26,88 use the rounded averaged values of p and q to create the current 544 12,60 12,10 12,06 12,44 13.80 12.97 13,92 19,51 552 13,30 13,13 15,38 14,55 12.33 12.44 9,41 12,30 model for 30-minutes and 60-minutes predictions. 567 17,12 14,69 15,80 16,18 16.41 14.97 18,87 23,17 • Similar approach than the previous item, but we average the pa- 584 17,71 16,63 17,72 17,54 17.00 16.64 17,06 23,28 596 14,23 11,91 12,03 12,84 13.36 12.10 13,57 13,57 rameters of the six best models 60, 65, . . ., and 85 minutes ago. Avg. 15,69 14,05 14,81 14,96 14.86 14.25 15,50 19,79 We use the rounded averaged values of p and q to create the cur- rent model for a 60-minutes prediction. Table 1. Quality of the models created for 30 / 60 minutes predictions. For • We select the model with the lowest Akaike Information Criterion each modeling method we give error metrics (RSME, MAE) for 30 / 60 (AIC) value to estimate both, 30-minutes and 60-minutes predic- minutes predictions. tions. AIC is a criteria to compare models with different number 3 Table 2 shows the percentage of predictions on zones of the Clarke Error Grid [6] for both time horizons. Results are ordered by higher %A, then higher %B, lower %E, lower %D and lower %C. The first thing that can be said is that, in terms of CEG, 30 minutes is not very hard to predict. Most of the algorithms achieved excellent re- sults with less than 3% of the predictions in the dangerous zones. For a prediction horizon of 60 minutes, all the machine learning tech- niques obtained less than 5% of dangerous predictions, and GP ap- proaches seems to be the best option. However, a deeper analysis for statistical significance in required. First, we depict in figure 1 a Figure 2. Density plots of the distribution of the RMSE results for all the algorithms for 30 minutes (left). The distribution are clearly multi-modal and Algorithm %A %B %C %D %E a non parametric test is recommended. Similar plots were obtained for 60 30 minutes GP 88.03 10.43 0.25 1.45 0.02 minutes (right) and for MAE. GP-OS 86.40 11.97 0.17 1.54 0.02 LR 86.03 12.23 0.25 1.65 0.02 RF 85.25 12.57 0.45 2.10 0.00 MOGE 87.52 11.29 0.00 1.19 0.00 GE 86.46 12.69 0.03 0.82 0.00 sis that the algorithms have the same performance is rejected. Fig- A-0 A-1 84,93 77.91 13,90 19.66 0,33 1.60 0,83 0.64 0.07 0.20 ure 3 shows the graphical comparison where statistical differences 60 minutes are demonstrated to be significant. Finally we follow the Bayesian GP 69.90 26.56 0.26 3.32 0.03 GP-OS 69.90 26.73 0.21 3.13 0.05 LR 67.35 28.76 0.35 3.60 0.03 RF 67.57 28.73 0.30 3.61 0.00 MOGE 64.27 31.15 0.29 4.31 0.00 GE 60.82 34.66 0.29 4.24 0.00 A-0 62.25 36.08 1.66 1.66 0.36 A-1 54.37 39.53 4.50 0.97 0.63 Table 2. Average percentage of predictions on zones of the Clarke Error Grid [6] for both time horizons. Results are ordered by higher %A, then higher %B, lower %E, lower %D and lower %C. graphical ranking (in terms of RMSE) of all the algorithms for each Figure 3. Nemenyi test for all the algorithms and RMSE (30 min, left, 60 min right) using the graphical representation of [10]. patient and for 30 a 60 minutes prediction horizons. Each algorithm is represented by its acronym and a different color, the closer the po- sition to the name of id of the patient, the better, i.e the lower RMSE on test files. GP is the best for all the patients in 30 minutes and for model of [3, 5] based on the Plackett-Luce distribution over rankings 4 out of 6 in 60 minutes. Looking for statistical significance, the first to analyse multiple algorithms in multiple problems. Figure 4 shows that GP and MOGE have the highest probability of being the best for 30 minutes, however there is not clear evidence for 60 minutes. Figure 1. A graphical view of the ranking of each algorithm for each patient dataset. Clearly GP approach is the best as a general rule in terms of Figure 4. Bayesian model of [5] to analyse the algorithms in the set of RMSE. patients and RMSE. Figure represents the probability of being the best and its standard deviation.(30 min, left, 60 min right) plots we created are density plots, using a kernel density estimation (KDE) of the distribution of the samples to visualize it. The objec- tive is to visualize if the data meets the conditions for a parametric test, which is not the case. Figure 2 shows that the data is not dis- 4 CONCLUSION tributed according to a Gaussian distribution and, nor the variance is the same for all the algorithms. Data distribution is multi-modal The competition proved to be a very good test-bed for the mod- and a non-parametric test is necessary. All the plots were obtained elling approaches as it is concerned with real-world data. The large with [4]. We use the graphical representation of the Nemenyi test amount of data for training proved to be challenging. For instance the [10], that compares all the algorithms pairwise. This non parametric ARIMA training process took several days to complete. test is based on the absolute difference of the average rankings of The decision made by the organizers of the competition to disal- the predictors. For a significance level α = 0.05 the test determines low usage of all future data is in our point of view not ideal. If we the critical difference (CD) and if the difference between the average want to use prediction models for optimal blood glucose control it ranking of two algorithms is grater than CD, then the null hypothe- is necessary to assume that we can control the bolus and basal for 4 the forecasting horizon. Of course, a large amount of uncertainty re- [13] J Ignacio Hidalgo, J Manuel Colmenar, J Manuel Velasco, Gabriel mains because of unknown events in the forecasting horizon such as Kronberger, Stephan M Winkler, Oscar Garnica, and Juan Lanchares, ‘Identification of models for glucose blood values in diabetics by gram- meals and higher activity or stress levels. matical evolution’, in Handbook of Grammatical Evolution, 367–393, It would be interesting to try to improve the models by using all Springer, (2018). the available data for each data contributor. We only used the inter- [14] John R. Koza, Genetic Programming: On the Programming of Comput- section of features available in all data sets which however limits the ers by Means of Natural Selection, MIT Press, Cambridge, MA, USA, potential for specialization of models to individuals. 1992. [15] Cindy Marling and Razvan Bunescu, ‘The ohiot1dm dataset for blood glucose level prediction: Update 2020’, (2020). [16] Michael O’Neill and Conor Ryan, ‘Grammatical evolution’, IEEE ACKNOWLEDGMENTS Trans. Evolutionary Computation, 5(4), 349–358, (2001). [17] Jose Manuel Velasco, Oscar Garnica, Juan Lanchares, Marta Botella, and J Ignacio Hidalgo, ‘Combining data augmentation, edas and gram- This work has been also partially funded with the support of the matical evolution for blood glucose forecasting’, Memetic Computing, Christian Doppler Research Association within the Josef Ressel Cen- 10(3), 267–277, (2018). tre for Symbolic Regression. This work has been also partially sup- [18] G.Peter Zhang, ‘Time series forecasting using a hybrid arima and neural ported by the Spanish Ministerio de Ciencia, Innovación y Universi- network model’, Neurocomputing, 50, 159 – 175, (2003). dades (MCIU/AEI/FEDER, UE) under grant ref. PGC2018-095322- B-C22; and Comunidad de Madrid y Fondos Estructurales de la Unión Europea with grant ref. P2018/TCS-4566. UCM group is sup- ported by Spanish Ministerio de Economı́a y Competitividad grant RTI2018-095180-B-I00, Fundación Eugenio Rodrı́guez Pascual, Co- munidad de Madrid grants B2017/BMD3773 (GenObIA-CM) and Y2018/NMT-4668 (Micro-Stress - MAP-CM), and structural Funds of European Union. REFERENCES [1] Michael Affenzeller and Stefan Wagner, ‘Offspring selection: A new self-adaptive selection scheme for genetic algorithms’, in Adaptive and Natural Computing Algorithms, 218–221, Springer, (2005). [2] Leo Breiman, ‘Random forests’, Machine Learning, 45, 5–32, (2001). [3] Borja Calvo, Josu Ceberio, and Jose A Lozano, ‘Bayesian inference for algorithm ranking analysis’, in Proceedings of the Genetic and Evolu- tionary Computation Conference Companion, pp. 324–325, (2018). [4] Borja Calvo and Guzmán Santafé Rodrigo, ‘scmamp: Statistical com- parison of multiple algorithms in multiple problems’, The R Journal, Vol. 8/1, Aug. 2016, (2016). [5] Borja Calvo, Ofer M Shir, Josu Ceberio, Carola Doerr, Hao Wang, Thomas Bäck, and Jose A Lozano, ‘Bayesian performance analy- sis for black-box optimization benchmarking’, in Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1789–1797, (2019). [6] W.L. Clarke, D. Cox, L.A. Gonder Frederick, W. Carter, and S.L. Pohl, ‘Evaluating clinical accuracy of systems for self-monitoring of blood glucose’, Diabetes Care, 10(5), 622–628, (September 1987). [7] Sergio Contador, J Manuel Colmenar, Oscar Garnica, and J Ignacio Hi- dalgo, ‘Short and medium term blood glucose prediction using multi- objective grammatical evolution’, in International Conference on the Applications of Evolutionary Computation (Part of EvoStar), pp. 494– 509. Springer, (2020). [8] Sergio Contador, J Ignacio Hidalgo, Oscar Garnica, J Manuel Velasco, and Juan Lanchares, ‘Can clustering improve glucose forecasting with genetic programming models?’, in Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1829–1836, (2019). [9] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyari- van, ‘A fast and elitist multiobjective genetic algorithm: Nsga-ii’, IEEE transactions on evolutionary computation, 6(2), 182–197, (2002). [10] Janez Demšar, ‘Statistical comparisons of classifiers over multiple data sets’, Journal of Machine learning research, 7(Jan), 1–30, (2006). [11] J Ignacio Hidalgo, Marta Botella, J Manuel Velasco, Oscar Garnica, Carlos Cervigón, Remedios Martı́nez, Aranzazu Aramendi, Esther Maqueda, and Juan Lanchares, ‘Glucose forecasting combining markov chain based enrichment of data, random grammatical evolution and bagging’, Applied Soft Computing, 88, 105923, (2020). [12] J Ignacio Hidalgo, J Manuel Colmenar, Gabriel Kronberger, Stephan M Winkler, Oscar Garnica, and Juan Lanchares, ‘Data based prediction of blood glucose concentrations using evolutionary methods’, Journal of medical systems, 41(9), 142, (2017). 5