Using Wearable and Environmental Data to Improve the Prediction of Amyotrophic Lateral Sclerosis and Multiple Sclerosis Progression: an Explorative Study Notebook for the iDPP Lab on Intelligent Disease Progression Prediction at CLEF 2024 Elena Marinello1 , Alessandro Guazzo1 , Enrico Longato1 , Erica Tavazzi1 , Isotta Trescato1 , Martina Vettoretti1 and Barbara Di Camillo1,2,* 1 Department of Information Engineering, University of Padova, Padova, Italy 2 Department of Comparative Biomedicine and Food Science, University of Padova, Padova, Italy Abstract Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases with a severe impact on patients’ lives. Both diseases create significant psychological and economic burdens due to alternating acute phases requiring hospital and home care. One possible solution could be the employment of sensor data to develop predictive models that can assist clinicians in making treatment and therapeutic decisions. In the context of the iDPP@CLEF 2024 challenge, this work aims to develop and compare different machine-learning approaches for predicting the Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised (ALSFRS-R) scores in ALS patients, and relapses in MS patients, using wearable and environmental data, respectively. Specifically, the analysis focuses on the impact of these data and seeks to determine whether their incorporation enhances predictive performance. The results showed that there is indeed an improvement in the models’ performance when sensor data are considered, in both the disease. In particular, in the case of ALS the Root Mean Square Error (RMSE) range, over the predicted twelve ALSFRS-R score, improved from [0.463-0.733] to [0.286-0.582] when incorporating the wearable data, as well as in the case of MS, where the inclusion of environmental data has improved the prediction of relapse, with the RMSE decreasing from 72.992 to 69.564. Keywords Amyotrophic Lateral Sclerosis, Multiple Sclerosis, Logistic Regression, Ridge Regression, Random Forest, Wearable Data, Environmental Data 1. Introduction Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic neurodegenerative diseases. ALS affects the motor neurons, causing progressive degeneration of nerve cells in the spinal cord and brain, leading to an average life expectancy of three to five years [1]. ALS symptoms usually are primarily related to weakness in the upper and lower limbs, or slurred speech and difficulty in swallowing [2]. On the other hand, MS affects the myelinated axons in the central nervous system, causing damage to both the myelin and the axons to varying degrees. The progression of MS is highly variable and unpredictable, with the most common phenotype being relapsing-remitting: a progression pattern characterized by periods of exacerbations of the symptoms, called relapse, alternated with more stable periods [3]. Given the heterogeneous and unpredictable nature of these diseases, patients end up alternating periods in the hospital and at home, while dealing with the uncertainty of how long each acute or stable phase will last [4]. This can represent a psychological and economic burden for both patients and CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ elena.marinello@unipd.it (E. Marinello); guazzoales@dei.unipd.it (A. Guazzo); enrico.longato@unipd.it (E. Longato); erica.tavazzi@unipd.it (E. Tavazzi); isotta.trescato@phd.unipd.it (I. Trescato); martina.vettoretti@unipd.it (M. Vettoretti); barbara.dicamillo@unipd.it (B. D. Camillo)  0009-0007-2445-5762 (E. Marinello); 0000-0001-5155-2567 (A. Guazzo); 0000-0001-5940-645X (E. Longato); 0000-0001-6188-6413 (E. Tavazzi); 0000-0003-0625-993X (I. Trescato); 0000-0002-5020-1818 (M. Vettoretti); 0000-0001-8415-4688 (B. D. Camillo) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings caregivers. Clinicians, on their part, would welcome tools that can assist them throughout all stages of patient treatment by offering personalized therapeutic recommendations and identifying when urgent interventions are necessary. Predictive tools can indeed be powerful in predicting the progression of ALS disability and the occurrence of relapses in MS. In the context of the iDPP@CLEF 2024 challenge, participants were asked to predict the progression of the ALS patients’ disability status using prospective data, and predict the occurrence of relapses for MS patients by exploiting environmental and MS-specific retrospective data [5, 6]. The Challenge consisted of three tasks, described in the following sections: Section 1.1 and 1.2 refer to Task 1 and Task 2, respectively, while Section 1.3 refers to Task 3. 1.1. Task 1: ALS Disability Score from Wearable Data Task 1 focused on using data collected through wearable devices to predict the patient’s disability status measured by the twelve scores of the revised ALS functional rating scale (ALSFRS-R) [7]. These ALSFRS-R scores were assigned by medical doctors during routine visits scheduled every three months. The goal of this task was to determine whether the ALSFRS-R scores assigned by clinical experts could be reliably predicted from wearable data. 1.2. Task 2: ALS Patient Self-assessment Score from Wearable Data Similarly to Task 1, Task 2 consisted of the use of data collected through wearable devices, to predict the patient’s disability status, measured by the ALSFRS-R scores. In this case, the scores were self-assessed by patients via an auto-evaluation questionnaire delivered through an app once a month. The goal was to determine whether the ALSFRS-R scores obtained from self-assessment questionnaires could be reliably predicted from wearable data. 1.3. Task 3: Relapse from EDDS Sub-scores and Environmental Data Task 3 considered the prediction of an MS relapse using environmental data and Expanded Disability Status Scale subscores (EDSS) [8]. The goal of this task was to explore whether exposure to different pollutants can be considered a useful variable in predicting the occurrence of relapses in MS patients. To address the proposed problems, a broad set of predictive models based on different methodological approaches were trained using different subsets of the variables, provided by the challenge organizers. This study aimed to evaluate whether considering wearable data to predict ALS disability and environ- mental data to predict MS relapses leads to better performance with respect to models that only consider disease-specific variables collected during routine visits. To ensure consistency, all models were trained using a common framework including feature selection (via backward elimination), and hyperparameter optimization (via random search). The results suggest that collecting data from wearable devices can improve the prediction of ALS disability status. However, patients must be properly trained to use the sensors correctly. Similarly, environmental data can be beneficial for predicting the progression of MS by identifying the occurrence of relapses, focusing mainly on sensor data recorded a few days before the relapse. The paper is organized as follows: Section 2 introduces related works and the main methodological approaches implemented until now to address ALS and MS progression prediction. Section 3 describes the methodologies employed in this study in terms of data processing and the machine-learning techniques used. Section 4 discusses the obtained results and, finally, Section 5 summarizes the key take home messages of this work. 2. Related Work Different approaches have been proposed in the literature to predict the prognosis of ALS and MS patients. For both of the diseases, prediction tasks frequently employ a variety of machine-learning methodologies, with classification and regression being the most common approaches. The choice between these methods typically depends on the specific research question and the chosen outcome [9]. Regarding ALS prognosis, most studies aimed to estimate changes in the ALSFRS-R over time [10, 11, 12, 13, 14]. Different studies classified patients by disease progression rates (e.g., Slow/Fast, Low/High) [15, 16, 17], while others have developed a model to predict when a patient will need Non-Invasive Ventilation (NIV) support within a given time window [18, 19, 20]. Relevant biomarkers for prediction include BMI, Forced Vital Capacity (FVC), age at onset, and disease duration, as well as longitudinal data (e.g., slope, minimum, maximum, mean, standard deviation) [10]. Magnetic resonance imaging (MRI) has also shown a significant impact on prediction, alongside these clinical variables [21]. Regression models include Random Forest (RF) regressor and generalized boosting models [18, 22]. Recently, also graphical modeling techniques such as Dynamic Bayesian Networks (DBN) have been employed to model ALS disease progression [23]. Classification models included Support Vector Machine (SVM) [16], and RF classifier [15]. On the other hand, most of the models related to MS prognosis considered as outcomes the occurrence of relapses [24, 25] and the evolution over time of the EDSS [26, 27, 28, 29]. The models most commonly used for classification were Logistic Regression (LR) and SVM [30], while for regression, the most popular technique was Linear Regression [31]. Demographic (including age and sex), clinical, MRI (such as T2 lesion volume or number and brain atrophy), cerebrospinal fluid, and electrophysiology variables were retained as predictors in the models studied in the literature [31]. In general, for both ALS and MS, the inclusion of wearable and environmental data, respectively, in literature models is limited [32, 33]. Typically, studies focus on defining a baseline, where data are collected, and then developing a model based on this baseline to provide predictions for future outcomes [34, 35]. The main limitation of this approach is that it does not thoroughly exploit the dynamic aspect of the disease described by the full temporal evolution of data sequences, conversely to what is extensively investigated within the scope of the Challenge. 3. Methodology A common data processing was performed for Task 1 and Task 2 involving ALS data, instead, the data processing for Task 3, which considered MS data, was slightly different. Then, a single model-training framework was considered for all methodological approaches across the three tasks. The following sections describe: the data processing steps needed to obtain the final set of input variables for Tasks 1, 2 (Section 3.1.1), and 3 (Section 3.1.2); the training framework used to develop the models (Section 3.2); and the description of the submitted runs (Section 3.4). 3.1. Data Processing 3.1.1. ALS Data Processing (Tasks 1 and 2) The structure of the datasets provided for Task 1 and 2 was identical. The main difference between the data provided for these tasks lay in how ALSFRS-R scores were collected. In fact, for Task 1 ALSFRS-R scores were assigned by clinicians during routine visits performed more or less every three months. Instead, for Task 2, ALSFRS-R scores were self-assigned by the patients via a questionnaire delivered periodically (∼ once a month) through the BRAINTEASER app. Hence, the same processing pipeline was adopted for these two tasks. Six static variables evaluated at the first visit were available, namely: sex, diagnostic delay, age at diagnosis, FVC, weight, and BMI. The only processing performed on these static variables concerned the sex variable which was mapped to a boolean variable equal to 0 for male patients and 1 for female patients. All ALSFRS-R measurements collected for each patient were made available to participants despite the Task 1 and 2 goals being only the prediction of the ALSFRS-R subscores following the first visit (Task 1) or of the self-assessment score (Task 2). Hence, all available information was fully exploited to obtain a more rich and robust dataset. Specifically, each pair of consecutive ALSFRS-R subscores was considered as an independent entry characterized by the same static information of the patient they belonged to. The first set of ALSFRS-R subscores of each pair was used as input variables named start_Q*, where * represents the ALSFRS-R question number and ranges from 1 to 12. Instead, the second set of ALSFRS-R subscores of each pair were used as the target variables named end_Q*. The final sample size of data used to train models for the first task was of 131 entries (from 52 unique patients) and the one of data used to train models for the second task was of 163 entries (from 52 unique patients). For each patient, 90 variables collected multiple times through wearable sensors were available. The processing of these variables consisted of the extraction of first-order descriptors (such as mean, first and last recorded values, and minimum and maximum values) considering all values recorded within a time window starting from the date of the start ALSFRS-R of the considered entry to the date of the end ALSFRS-R score of the same entry. The window length, expressed in days, was also included in the set of possible predictors. Moreover, the slopes of change of the following variables were also considered: total_calories, total_steps, spo2_av, heart_rate_mean, heart_rate_baseline. The slope of change was obtained as the angular coefficient of a linear fit of all recorded values for each variable within the considered time window. To build the training set, it was instrumental to consider ALSFRS-R pairs collected after the first visit, in order to obtain a robust and rich set of variables extracted from wearable data. The richness and quality of such data tend to improve over time as the patient learns how to properly use, and becomes more familiar with, the device provided at the first visit. After this first processing step, 487 variables were available for each entry in the dataset. Specifically, one variable for the unique patient identifier, one variable for the window length expressed in days, 12 variables for the start ALSFRS-R scores, 12 target variables for the ALSFRS-R scores to be used as outcomes, 90 * 5 = 450 variables for the first-order descriptors of the 90 wearable sensor variables, 5 variables for the considered slopes of change, and the 6 static variables. From this full set of variables, those with more than 50% missing values and those that were almost constant (auto-correlation coefficient > 0.9) were removed. Finally, collinear variables were removed by iteratively excluding those with a correlation coefficient > 0.9. After this step, 131 out of 487 variables were considered for Task 1, and 134 out of 487 variables were considered for Task 2. Then, normalization was performed to avoid introducing bias related to the different dynamic ranges of each variable and to promote consistency between the scale of the coefficients that might be estimated during model training. Specifically, min-max scaling was used and the normalization parameters were derived considering only the whole training set and applied to the test set. Finally, the imputation of missing values in the processed input variables was performed using the mice R package [36]. Also for the imputation, parameters were estimated on the whole training set and applied to the test set. 3.1.2. MS Data Processing (Task 3) The processing concept for Task 3 was similar to the one proposed for Task 1 and 2 but had to be adapted considering the different structure of data available for this task. Fifteen static variables evaluated at the first visit were available. Five variables were related to demographic information, five variables were related to MS diagnosis, and five variables were related to symptoms. The sex variable was mapped to a boolean variable equal to 0 if the patient was male and 1 if female. The variable centre was mapped to a boolean variable equal to 0 if the patient was followed at the clinic in Pavia and 1 if at the clinic in Turin. The variable residence classification consisted of three possible levels: cities, towns, and rural area. This variable was mapped to two dummy variables: residence_city and residence_rural_area. The variable ethnicity was excluded as almost all patients were caucasian. Two variables related to diagnosis criteria were excluded as almost all patients were diagnosed according to the same criterion. After these steps, 12 static variables remained. Multiple EDSS recordings were also available from the baseline date to the date of the first relapse. Hence, first-order descriptors were extracted also for the EDSS value considering all measurements within this time window. For each patient, a set of 20 environmental measurements related to pollutant levels and meteorological indicators were available. Such measurements were available both before and after the baseline. Hence, similarly to what was done for wearable sensor data in Tasks 1 and 2, a set of first-order descriptors (such as mean, first and last recorded values, and minimum and maximum values) was extracted for each variable considering all values recorded within two time windows. The first time window started at the date of the first available environmental measure and ended at the baseline date. The second time window, instead, started at the baseline date and ended at the first recorded relapse date. After this first processing step, 219 variables were available for each patient in the dataset. Specifically, one variable for the unique patient identifier, one target variable for the relapse week to be used as the outcome, 20*5 = 100 variables for the first-order descriptors of the 20 environmental variables measured before the baseline, 20*5 = 100 variables for the first-order descriptors of the 20 environmental variables measured after the baseline, 5 variables for the first-order descriptors of the EDSS measurements, and the 12 static variables. From this full set of variables, those with more than 50% missing values and those that are almost constant (auto-correlation coefficient > 0.9) were removed. Finally, collinear variables were removed by iteratively excluding those with a correlation coefficient > 0.9. After this step, 69 out of 219 variables were considered for Task 3. Two patients were also excluded as almost all their variables were missing, hence, 197 unique patients were considered to train the MS models. Following what was done for the previous tasks, normalization was performed via min-max scaling and the imputation of missing values in the processed input variables was performed using the mice R package. 3.2. Model Training and Evaluation In Tasks 1 and 2, the prediction targets were the 12 ALSFRS-R scores evaluated, respectively by the clinician and the patients themselves. Each score must be predicted independently and it was an integer within the range [0-4]. Intuitively, this problem can be cast as a multiclass classification with five classes. However, it can also be framed as a regression problem by modifying the model output by rounding it to the nearest integer. Instead, in Task 3, the goal was to predict the week of the first relapse occurrence after the baseline, and, as the weeks are not within a finite range, this can only be approached as a regression problem. However, as the challenge submission rules require an integer value also for the predicted relapse week, the output of regression models developed for Task 3 was rounded to the nearest integer as well. The core of the model training framework involved the Backward Feature Selection technique [37] and the model’s performances were evaluated through the Root Mean Squared Error score [38]. The process started with all the features and iteratively they were removed one by one. At each iteration, for every feature combination, hyperparameter tuning was performed via random search over a given hyperparameter grid [39], using a 5-fold cross-validation (CV). The subset of features that resulted in the lowest RMSE score, was then chosen to train a final model. Its hyperparameters were optimized again using 5-fold CV and random search within the same hyperparameter space. Ultimately, this optimized model was tested on an independent test set, and the results were submitted to the challenge organizers for performance evaluation. This model training framework was designed to be flexibile, allowing its application across the three different tasks with a variety of methodological approaches. The approaches considered in this study included both linear models (LR and ridge regression), as well as non-linear models (RF). For each of these models, different sets of hyperparameters were tested. For the LR, a single hyperparameter needs optimization: the strength of the regularization applied to the model, C. Similarly, for the ridge regression, the only hyperparameter that needs optimization is the strength of the L2 regularisation, 𝛼. Both C and 𝛼 were randomly sampled from 250 values in a log-uniform distribution with support [10-4 - 104 ]. Finally, the RF’s hyperparameter space consisted of two hyperparameters: the number of trees in each RF, which was uniformly sampled in the interval [50 - 500], and the maximum depth of each tree, which was uniformly sampled in the interval [1 - 100]). By default, the square root of the total number of features was evaluated at each node for splitting. 3.3. Considered Subsets of Input Variables To evaluate whether considering wearable data to predict ALS disability and environmental data to predict MS relapses led to better performance with respect to models that only consider disease-specific variables collected during routine visits, different sets of variables were considered as input for the predictive models. Hence, for Tasks 1 and 2, the target ALSFRS-R value (e.g., end_Q1, see Section 3.1.1) was first predicted by simply holding the corresponding initial ALSFRS-R evaluation (e.g., start_Q1 see Section 3.1.1). The idea behind this approach is to provide a baseline reference point that does not involve any particular prediction model. Then, to provide a slightly more complex benchmark approach, a LR model was trained using only the 12 initial ALSFRS-R scores (e.g., all start_Q*) as possible input variables. The idea behind this second set of considered features is to assess whether considering scores from other ALSFRS-R questions leads to a more accurate prediction of the target ALSFRS-R score with respect to the one obtained by simply holding its initial value. Finally, different models were trained using all available variables (i.e., static, ALSFRS-R, and wearable data) to evaluate whether models developed including also data collected through wearable devices led to better performance with respect to the one developed using only the initial ALSFRS-R scores. The models included LR, ridge regression, RF regressor, and RF classifier. Similarly, for Task 3, first, a ridge model was trained considering as possible input only static and EDSS variables. Then, a ridge and an RF regressor were trained after including environmental-derived variables in the pool of possible predictors. The idea behind this approach was to check whether including environmental data could improve the first-relapse-week prediction with respect to models that only consider data collected at the first visit and EDSS evaluations. 3.4. Description of Submitted Runs The following runs were submitted for Tasks 1 and 2: • Logistic regression (logistic): LR model with multiclass outcome. All available variables were considered in the pool of possible predictors. Each question was predicted with its independent model trained specifically for that question. • Logistic regression considering only ALSFRS-R scores (logistic_ALSFRS): LR model with multiclass outcome. Only start_Q* variables were considered in the pool of possible predictors. Each question was predicted with its independent model trained specifically for that question. • Random Forest classifier (rf): RF classifier with multiclass outcome. All available variables were considered in the pool of possible predictors. Each question was predicted with its independent model trained specifically for that question. • Ridge regression (ridge): Ridge regression model. All available variables were considered in the pool of possible predictors. Each question was predicted with its independent model trained specifically for that question. • Random Forest regressor (rf_reg): RF regressor model. All available variables were considered in the pool of possible predictors. Each question was predicted with its independent model trained specifically for that question. • hold: Each question was predicted by holding its starting value (i.e., considering the start_Q* variables as predicted end_Q* targets) • average: Each predicted score was obtained as the average output of the LR, RF classifier, ridge, and RF regressor models rounded to the nearest integer (i.e., column-wise average rounded to the nearest integer of logistic, rf, ridge, and rf_reg runs). • optrun: Each question was predicted with the best-performing model for that question (i.e., the one highlighted in bold in Table 1 and 3). The following runs were submitted for Task 3: • Ridge regression (ridge): Ridge regression model. All available variables were considered in the pool of possible predictors. • Ridge regression without considering environmental data (ridge_noenv): Ridge regression model. Environmental variables were excluded from the pool of possible predictors. • Random Forest regressor (rf_reg): RF regressor model. All available variables were considered in the pool of possible predictors. • average: Each predicted first relapse week was obtained as the average output of the ridge and RF regressor models rounded to the nearest integer (i.e., column-wise average rounded to the nearest integer of ridge and rf_reg runs). 4. Results The results for the three tasks are reported in the sections below. For Tasks 1 and 2, the results for ALSFRS-R scores prediction are reported in Section 4.1 and Section 4.2, respectively. For Task 3, the results for the week of the occurrence of the first relapse are reported in Section 4.3. 4.1. Task 1 Results Table 1 presents the CV results for Task 1. Each column represents one of the predicted ALSFRS-R scores (Q1 - Q12), while the rows indicate the considered models. Each cell displays the average CV RMSE. RMSE values highlighted in bold represent the lowest value of each column, thus indicating the best-performing model for each predicted question. The ridge model was the best-performing one for six out of twelve scores (Q1, Q4, Q6, Q7, Q11, Q12), with RMSE values ranging between 0.228 for Q1 and 0.570 for Q7. The LR model, when considering all available variables, also showed reliable performance, achieving the best prediction for four out of twelve scores (Q2, Q3, Q9, Q10). Its RMSE values ranged from 0.286 for Q2 to 0.582 for Q9. Conversely, the RF regressor yielded the best predictions for Q5 and Q8, with RMSE scores of 0.508 and 0.479, respectively. Finally, the hold approach and RF classifier were the worst-performing among all the models. Additionally, the LR model using only the ALSFRS-R score did not perform well, suggesting that performance improved when wearable data was added. In general, it is possible to observe that adding first all the ASLFRS-R scores, and consequentially all the other sensor variables, increased the performance in the cross-validation phase, leading to lower RMSE values. Table 2 shows the results of Task 1 submitted runs as evaluated by the challenge organizers. The name of the submitted run is reported in the first column of Table 2. Then, columns two and three of Table 2 show the two metrics used by the organizers to evaluate participants’ submitted runs on the independent test set: RMSE and Mean Absolute Error (MAE), respectively. Results observed in CV were not confirmed on the test set, with the best-performing model being the hold method (RMSE = 0.491, MAE = 0.202) and the LR using all available variables yielding the worst result (RMSE = 0.830, MAE = 0.511). One possible explanation could be that the training set is more robust compared to the test set, since it includes data from later visits, while the test set only contains data from the initial visits. Therefore, these results are likely due to insufficient data collection during the initial visits when patients either have not started using the wearable devices or are still becoming familiar with how to use them. Table 1 CV RMSE values for methods considered in Task 1. Each column represents an ALSFRS-R question. The minimum values of each column are highlighted in bold. Model Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 LR 0.395 0.286 0.288 0.526 0.522 0.606 0.604 0.568 0.582 0.243 0.393 0.504 LR ALSFRS only 0.463 0.456 0.438 0.568 0.583 0.700 0.733 0.640 0.805 0.507 0.470 0.548 Ridge 0.228 0.351 0.302 0.495 0.533 0.493 0.570 0.532 0.676 0.318 0.389 0.439 RF classifier 0.416 0.379 0.386 0.591 0.564 0.580 0.636 0.589 0.619 0.461 0.471 0.557 RF regressor 0.443 0.370 0.397 0.512 0.508 0.561 0.621 0.479 0.648 0.455 0.457 0.511 Hold 0.531 0.479 0.471 0.665 0.710 0.796 0.762 0.654 0.856 0.553 0.546 0.630 Table 2 Runs RMSE and MAE values for methods considered in Task 1. The minimum values are highlighted in bold. Runs RMSE MAE logistic 0.830 0.511 logistic_ALSFRS 0.636 0.341 ridge 0.687 0.392 rf 0.650 0.361 rf_reg 0.636 0.373 hold 0.491 0.202 average 0.596 0.333 optrun 0.707 0.412 4.2. Task 2 Results Table 3 presents the CV results for Task 2. Each column represents one of the predicted ALSFRS-R scores (Q1 - Q12), while the rows indicate the considered models. Each cell displays the average CV RMSE. RMSE values highlighted in bold represent the lowest value of each column, thus indicating the best-performing model for each predicted question. In this task, the LR model, when considering all available variables, achieved the best results for seven out of twelve scores (Q2, Q3, Q5, Q7, Q10, Q11, Q12), with RMSE values ranging between 0.139 for Q3 and 0.595 for Q10. The ridge regression, also showed good performance compared to other models, achieving the best prediction for four out of twelve scores (Q1, Q4, Q6, Q9). Its RMSE values ranged from 0.292 for Q1 to 0.449 for Q6. Conversely, the RF regressor yielded the best prediction only for Q8 with an RMSE of 0.372. Finally, the hold and RF classifier performed the worst among all the models. In general, it is possible to observe that adding first all the ASLFRS-R scores, and then all the other sensor variables, led to a performance increase in the CV phase, resulting in lower RMSE values. Table 2 shows the results of Task 2 submitted runs as evaluated by the challenge organizers. The name of the submitted run is reported in the first column of Table 2. Then, columns two and three of Table 2 show the two metrics used by the organizers to evaluate participants’ submitted runs on the independent test set: RMSE and MAE, respectively. Results observed in CV were not confirmed on the test set also for this second task, with the best- performing model being once again the hold method (RMSE = 0.577, MAE = 0.287) and the LR with wearable data available yielding the worst results on the test set (RMSE = 0.9930, MAE = 0.659). These results are in line with those observed in Task 1. Additionally, the scores assigned during this period are based on self-evaluation, which may further impact the accuracy of the data. Overall, in this task, the RMSE values were lower than those obtained in Task 1, especially for the hold method. This improvement may be attributed to the fact that clinicians are able to better assign ALSFRS-R scores during visits, resulting in greater variability which leads to a more challenging prediction task. Instead, patients are typically more conservative and tend to assign similar scores between questionnaires. This leads to less variability, which makes the prediction task slightly easier, especially for the hold method. Table 3 CV RMSE values for methods considered in Task 2. Each column represents an ALSFRS-R question. The minimum values of each column are highlighted in bold. Model Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 LR 0.360 0.453 0.139 0.466 0.525 0.497 0.412 0.447 0.452 0.595 0.414 0.263 LR ALSFRS Only 0.399 0.551 0.291 0.513 0.621 0.620 0.565 0.454 0.606 0.738 0.553 0.309 Ridge 0.292 0.500 0.219 0.437 0.535 0.449 0.437 0.378 0.381 0.701 0.702 0.358 RF classifier 0.357 0.490 0.223 0.481 0.565 0.531 0.465 0.412 0.418 0.688 0.769 0.347 RF regressor 0.367 0.513 0.280 0.497 0.525 0.567 0.488 0.372 0.492 0.638 0.680 0.345 Hold 0.384 0.701 0.313 0.514 0.631 0.597 0.612 0.525 0.602 0.821 0.917 0.450 Table 4 Runs RMSE and MAE values for methods considered in Task 2. The minimum values are highlighted in bold. Runs RMSE MAE logistic 0.993 0.659 logistic_ALSFRS 0.854 0.500 ridge 0.850 0.545 rf 0.778 0.515 rf_reg 0.818 0.515 hold 0.577 0.287 average 0.783 0.493 optrun 0.962 0.606 4.3. Task 3 Results Table 5 reports the CV results for Task 3. Each row shows a considered model and its corresponding CV RMSE. The RMSE value highlighted in bold represents the lowest score, indicating the best-performing approach. The ridge regression with environmental data performed best among others, with an RMSE equal to 69.564. However, in the independent test set, the best performance was achieved without including environmental variables as evidenced in Table 6. In Task 3, the RMSE is very high, indicating low precision in predicting the relapse week. During the training phase, incorporating environmental data helped achieve better results. However, in the test phase, the performance was better without the environmental data. This discrepancy is likely due to the presence of significant sequences of missing data that needed to be imputed, as there are long intervals between visits in both the MS training and test sets. Table 5 CV RMSE values for methods considered in Task 3. The minimum value is highlighted in bold. Model RMSE Ridge 69.564 Ridge without environmental data 72.992 RF regressor 74.972 Average 82.702 Table 6 Runs RMSE and MAE values for methods considered in Task 3. The minimum values are highlighted in bold. Runs RMSE MAE ridge 89.83 68.59 ridge_no_env 78.62 61.37 rf_reg 79.73 66.63 average 79.25 65.80 5. Conclusions and Future Work This study aimed at addressing the three tasks proposed within the iDPP@CLEF 2024 challenge, while also evaluating whether the inclusion of sensor and environmental data helps in improving prediction of ALS and MS progression. The challenge consisted of three different tasks. In Task 1 and Task 2, the goal was to predict the ALSFRS-R scores, assigned, respectively, by clinicians and by the patients themselves. Instead, Task 3 consisted of predicting the week of the first relapse for MS patients. A flexible training workflow was developed in order to evaluate different methodological approaches and different subsets of input variables under a common, robust training workflow. For Task 1 and Task 2, both classification and regression approaches were explored, namely: LR, ridge regression, RF regressor and RF classifier. In Task 3, only regression models were considered due to the different nature of this task, namely: ridge regression and RF regressor. In the first two tasks, classification approaches were able to better capture the ALSFRS-R scores variability among the five classes. Instead, the regression approach tended to frequently predict the mean value within the range [0-4]. Moreover, in these tasks, the best CV results were achieved by the ridge regression and LR when including variables derived from wearable devices. On the contrary, when evaluating the models on the independent test sets, the best results were obtained by the hold method. The robustness of the results during CV can be attributed to the nature of the training set, which includes data from all visits. This results in a richer, more complete, and robust dataset characterized by a more refined wearable data collection process with respect to the test set which included only the first couple of visits when patients are still getting familiar with the data collection process and the BRAINTEASER app. Hence, the test data were more noisy and sparse. In Task 3, the best CV results were achieved by the ridge model incorporating the environmental data. On the contrary, in the independent test set the best performance was obtained by the ridge model without considering the environmental data. This weak result for this task could be attributed to the not properly optimized variable creation process which was designed for the first two tasks and directly applied also in the third task. One possible solution could be to consider dynamic variables instead of computing first-order descriptors, given the long periods between visits, and consequently employ models that account for these dynamic data. In conclusion, the developed models performed well within the iDPP@CLEF 2024 challenge, while contributing to raise important considerations that go beyond the competition itself. In fact, Tasks 1 and 2 results suggest that collecting wearable data can be a viable path to follow in order to improve the prediction of ALS disability status. However, a key condition that must be respected in order to benefit from the inclusion of these data, is that patients must be properly informed, trained, and followed in order to obtain rich and high-quality data over long periods of time. Otherwise, it might be more effective to rely on data that are commonly collected during routine visits of ALS patients. On the other hand, regarding MS, since the given environmental data and observations have been measured also after the relapse that needed to be predicted, it would be more effective to focus only on the environmental pollutants measured a few days before the relapse, as also confirmed by the literature [40]. References [1] M. C. Kiernan, S. Vucic, B. C. Cheah, M. R. Turner, A. Eisen, O. Hardiman, J. R. Burrell, M. C. Zoing, Amyotrophic lateral sclerosis, The Lancet 377 (2011) 942–955. [2] L. P. Rowland, N. A. Shneider, Amyotrophic lateral sclerosis, New England Journal of Medicine 344 (2001) 1688–1700. [3] M. Goldenberg, Multiple sclerosis review, P & T: a peer-reviewed journal for formulary manage- ment 37 (2012) 175–84. [4] M.-H. Soriani, C. Desnuelle, Care management in amyotrophic lateral sclerosis, Revue Neu- rologique 173 (2017) 288–299. [5] G. Birolo, P. Bosoni, G. Faggioli, H. Aidos, R. Bergamaschi, P. Cavalla, A. Chiò, A. Dagliati, M. de Carvalho, G. M. D. Nunzio, P. Fariselli, J. M. G. Dominguez, M. Gromicho, A. Guazzo, E. Longato, S. C. Madeira, U. Manera, S. Marchesin, L. Menotti, G. Silvello, E. Tavazzi, E. Tavazzi, I. Trescato, M. Vettoretti, B. D. Camillo, N. Ferro, Overview of idpp@clef 2024: The intelligent disease progression prediction challenge, in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024. [6] G. Birolo, P. Bosoni, G. Faggioli, H. Aidos, R. Bergamaschi, P. Cavalla, A. Chiò, A. Dagliati, M. de Carvalho, G. M. D. Nunzio, P. Fariselli, J. M. G. Dominguez, M. Gromicho, A. Guazzo, E. Longato, S. C. Madeira, U. Manera, S. Marchesin, L. Menotti, G. Silvello, E. Tavazzi, E. Tavazzi, I. Trescato, M. Vettoretti, B. D. Camillo, N. Ferro, Intelligent disease progression prediction: Overview of idpp@clef 2024, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 15th International Conference of the CLEF Association, CLEF 2024, Grenoble, France, September 9th to 12th, 2024, Lecture Notes in Computer Science, Springer, 2024. [7] J. M. Cedarbaum, N. Stambler, E. Malta, C. Fuller, D. Hilt, B. Thurmond, A. Nakanishi, The alsfrs-r: a revised als functional rating scale that incorporates assessments of respiratory function, Journal of the Neurological Sciences 169 (1999) 13–21. [8] S. Twork, S. Wiesmeth, M. Spindler, M. Wirtz, S. Schipper, D. Pöhlau, J. Klewer, J. Kugler, Disability status and quality of life in multiple sclerosis: non-linearity of the expanded disability status scale (edss), Health and Quality of Life Outcomes 8 (2010). [9] E. Tavazzi, E. Longato, M. Vettoretti, H. Aidos, I. Trescato, C. Roversi, A. S. Martins, E. N. Castanho, R. Branco, D. F. Soares, A. Guazzo, G. Birolo, D. Pala, P. Bosoni, A. Chiò, U. Manera, M. de Carvalho, B. Miranda, M. Gromicho, I. Alves, R. Bellazzi, A. Dagliati, P. Fariselli, S. C. Madeira, B. Di Camillo, Artificial intelligence and statistical methods for stratification and prediction of progression in amyotrophic lateral sclerosis: A systematic review, Artificial Intelligence in Medicine 142 (2023) 102588. [10] F. Papaiz, M. E. T. Dourado, R. A. d. M. Valentim, A. H. F. de Morais, J. P. Arrais, Machine learning solutions applied to amyotrophic lateral sclerosis prognosis: A review, Frontiers in Computer Science 4 (2022). [11] T. Hothorn, H. H. Jung, Randomforest4life: A random forest for predicting als disease progres- sion, Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration 15 (2014) 444–452. PMID: 25141076. [12] K. D. Ko, T. El-Ghazawi, D. Kim, H. Morizono, Predicting the severity of motor neuron disease progression using electronic health record data with a cloud computing big data approach, in: 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, 2014, pp. 1–6. [13] D. Halbersberg, B. Lerner, Temporal modeling of deterioration patterns and clustering for disease prediction of als patients, in: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019, pp. 62–68. [14] A. A. Taylor, C. Fournier, M. Polak, L. Wang, N. Zach, M. Keymer, J. D. Glass, D. L. Ennist, T. P. R. O.-A. A. C. T. Consortium, Predicting disease progression in amyotrophic lateral sclerosis, Annals of Clinical and Translational Neurology 3 (2016) 866–875. [15] R. Kueffner, et Al., Stratification of amyotrophic lateral sclerosis patients: a crowdsourcing approach, Scientific Reports 9 (2019). [16] A. Greco, M. R. Chiesa, I. Da Prato, A. M. Romanelli, C. Dolciotti, G. Cavallini, S. M. Masciandaro, E. P. Scilingo, R. Del Carratore, P. Bongioanni, Using blood data for the differential diagnosis and prognosis of motor neuron diseases: a new dataset for machine learning applications, Scientific Reports 11 (2021). [17] M. F. a. Roberto Gomeni, Amyotrophic lateral sclerosis disease progression model, Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration 15 (2014) 119–129. PMID: 24070404. [18] S. Pires, M. Gromicho, S. Pinto, M. Carvalho, S. C. Madeira, Predicting non-invasive ventilation in als patients using stratified disease progression groups, in: 2018 IEEE International Conference on Data Mining Workshops (ICDMW), 2018, pp. 748–757. [19] A. S. Martins, M. Gromicho, S. Pinto, M. de Carvalho, S. C. Madeira, Learning prognostic models using disease progression patterns: Predicting the need for non-invasive ventilation in amyotrophic lateral sclerosis, IEEE/ACM Transactions on Computational Biology and Bioinformatics 19 (2022) 2572–2583. [20] S. Pires, M. Gromicho, S. Pinto, M. de Carvalho, S. C. Madeira, Patient stratification using clinical and patient profiles: Targeting personalized prognostic prediction in als, in: I. Rojas, O. Valenzuela, F. Rojas, L. J. Herrera, F. Ortuño (Eds.), Bioinformatics and Biomedical Engineering, Springer International Publishing, Cham, 2020, pp. 529–541. [21] H. K. van der Burgh, R. Schmidt, H.-J. Westeneng, M. A. de Reus, L. H. van den Berg, M. P. van den Heuvel, Deep learning predictions of survival based on mri in amyotrophic lateral sclerosis, NeuroImage: Clinical 13 (2017) 361–369. [22] B. Hadad, B. Lerner, Domain adaptation from clinical trials data to the tertiary care clinic – appli- cation to als, in: 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2020. [23] E. Tavazzi, et Al., Predicting functional impairment trajectories in amyotrophic lateral sclerosis: a probabilistic, multifactorial model of disease progression, Journal of Neurology 269 (2022) 3858–3878. [24] K. Chalkou, E. Steyerberg, M. Egger, A. Manca, F. Pellegrini, G. Salanti, A two-stage prediction model for heterogeneous effects of treatments, Statistics in Medicine 40 (2021) 4362–4375. [25] Y. Ahuja, N. Kim, L. Liang, T. Cai, K. Dahal, T. Seyok, C. Lin, S. Finan, K. Liao, G. Savovoa, T. Chitnis, T. Cai, Z. Xia, Leveraging electronic health records data to predict multiple sclerosis disease activity, Annals of Clinical and Translational Neurology 8 (2021) 800–810. [26] R. Schlaeger, M. D’Souza, C. Schindler, L. Grize, S. Dellas, E. Radue, L. Kappos, P. Fuhr, Prediction of long-term disability in multiple sclerosis, Multiple Sclerosis Journal 18 (2012) 31–38. [27] M. Filippi, P. Preziosa, M. Copetti, G. Riccitelli, M. A. Horsfield, V. Martinelli, G. Comi, M. A. Rocca, Gray matter damage predicts the accumulation of disability 13 years later in ms, Neurology 81 (2013) 1759–1767. [28] V. Popescu, F. Agosta, H. E. Hulst, I. C. Sluimer, D. L. Knol, M. P. Sormani, C. Enzinger, S. Ropele, J. Alonso, J. Sastre-Garriga, A. Rovira, X. Montalban, B. Bodini, O. Ciccarelli, Z. Khaleeli, D. T. Chard, L. Matthews, J. Palace, A. Giorgio, N. De Stefano, P. Eisele, A. Gass, C. H. Polman, B. M. J. Uitdehaag, M. J. Messina, G. Comi, M. Filippi, F. Barkhof, H. Vrenken, MAGNIMS Study Group, Brain atrophy and lesion load predict long term disability in multiple sclerosis, J. Neurol. Neurosurg. Psychiatry 84 (2013) 1082–1091. [29] R. Schlaeger, M. D’Souza, C. Schindler, L. Grize, S. Dellas, E. W. Radue, L. Kappos, P. Fuhr, Prediction of long-term disability in multiple sclerosis, Mult. Scler. 18 (2012) 31–38. [30] Y. Zhao, B. C. Healy, D. Rotstein, C. R. G. Guttmann, R. Bakshi, H. L. Weiner, C. E. Brodley, T. Chitnis, Exploration of machine learning techniques in predicting multiple sclerosis disease course, PLOS ONE 12 (2017) e0174866. [31] F. S. Brown, S. A. Glasmacher, P. K. A. Kearns, N. MacDougall, D. Hunt, P. Connick, S. Chandran, Systematic review of prediction models in relapsing remitting multiple sclerosis, PLoS One 15 (2020) e0233575. [32] S. A. Johnson, M. Karas, K. M. Burke, M. Straczkiewicz, Z. A. Scheier, A. P. Clark, S. Iwasaki, A. Lahav, A. S. Iyer, J.-P. Onnela, J. D. Berry, Wearable device and smartphone data quantify als progression and may provide novel outcome measures, npj Digital Medicine 6 (2023). [33] V. Fuh-Ngwa, Y. Zhou, J. C. Charlesworth, A.-L. Ponsonby, S. Simpson-Yap, J. Lechner-Scott, B. V. Taylor, A. I. Group, Developing a clinical–environmental–genotypic prognostic index for relapsing-onset multiple sclerosis and clinically isolated syndrome, Brain Communications 3 (2021) fcab288. [34] I. Trescato, A. Guazzo, E. Longato, E. Hazizaj, C. Roversi, E. Tavazzi, M. Vettoretti, B. Di Camillo, Baseline machine learning approaches to predict amyotrophic lateral sclerosis disease progression notebook for the idpp lab on intelligent disease progression prediction at clef 2022, 2022. [35] A. Guazzo, I. Trescato, E. Longato, E. Tavazzi, M. Vettoretti, B. Camillo, Baseline machine learning approaches to predict multiple sclerosis disease progression, in: CLEF, 2023. [36] S. Van Buuren, K. Groothuis-Oudshoorn, mice: Multivariate imputation by chained equations in R, Journal of statistical software 45 (2011) 1–67. [37] I. M. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. [38] T. O. Hodson, Root-mean-square error (rmse) or mean absolute error (mae): when to use them or not, Geoscientific Model Development 15 (2022) 5481–5487. [39] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res. 13 (2012) 281–305. [40] J. Roux, D. Bard, E. Le Pabic, C. Segala, J. Reis, J. C. Ongagna, J. ze, E. Leray, Air pollution by particulate matter PM10 may trigger multiple sclerosis relapses., Environ Res 156 (2017) 404–410.