Overview of iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge Giovanni Birolo1,† , Pietro Bosoni2,† , Guglielmo Faggioli3,† , Helena Aidos4 , Roberto Bergamaschi2 , Paola Cavalla1,5 , Adriano Chiò1 , Arianna Dagliati2 , Mamede de Carvalho4 , Giorgio Maria Di Nunzio3 , Piero Fariselli1 , Jose Manuel García Dominguez6 , Marta Gromicho4 , Alessandro Guazzo3 , Enrico Longato3 , Sara C. Madeira4 , Umberto Manera1 , Stefano Marchesin3 , Laura Menotti3 , Gianmaria Silvello3 , Eleonora Tavazzi7 , Erica Tavazzi3 , Isotta Trescato3 , Martina Vettoretti3 , Barbara Di Camillo3 and Nicola Ferro3 1 University of Turin, Italy 2 University of Pavia, Italy 3 University of Padua, Italy 4 University of Lisbon, Lisbon, Portugal 5 “Città della Salute e della Scienza”, Turin, Italy 6 Gregorio Marañon Hospital in Madrid, Spain 7 IRCCS Foundation C. Mondino in Pavia, Italy Abstract Multiple Sclerosis (MS) and Amyotrophic Lateral Sclerosis (ALS) are neurodegenerative diseases characterized by progressive or fluctuating impairments in motor, sensory, visual, and cognitive functions. Patients with these diseases endure significant physical, psychological, and economic burdens due to hospitalizations and home care while grappling with uncertainty about their conditions. AI tools hold promise for aiding patients and clinicians by identifying the need for intervention and suggesting personalized therapies throughout disease progression. The objective of iDPP@CLEF is to develop AI-based approaches to describe the progression of these diseases. The ultimate goal is to enable patient stratification and predict disease progression, thereby assisting clinicians in providing timely care. iDPP@CLEF 2024 continues the work of the previous editions, iDPP@CLEF 2022 and 2023. The 2022 edition focused on predicting ALS progression and utilizing explainable AI. The 2023 edition expanded on this by including environmental data and introduced a new task for predicting MS progression. This edition extends the MS dataset with environmental data and introduces two new ALS tasks aimed at predicting disease progression using data from wearable devices. This marks the first iDPP edition to utilize prospective data directly collected from patients involved in the BRAINTEASER project. 1. Introduction Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are two severe and impactful diseases that cause progressive neurological impairment in individuals living with them. The progression of these diseases is typically heterogeneous, resulting in significant variability in aspects such as treatment, outcomes, quality of life, and overall patient needs. This variability presents challenges not only for patients but also for clinicians and caregivers. For example, patients with ALS often need specific treatments like Non-Invasive Ventilation (NIV) or Percutaneous Endoscopic Gastrostomy (PEG) at certain stages of their disease progression. Similarly, MS patients may experience debilitating relapses that severely impact their quality of life. Therefore, it would be highly beneficial to anticipate the needs of individuals affected by these diseases to provide them with the most timely and effective care. However, the heterogeneous nature of these conditions CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France † These authors contributed equally. © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings makes it challenging to develop effective prognostic tools that work the same and are effective for every patient. This underscores the importance of creating automatic tools to assist clinicians in decision-making throughout disease progression, facilitating personalized therapeutic choices. In particular, developing new automatic predictive approaches based on AI requires a proper framework for designing and evaluating different tasks, such as: • Stratifying patients according to their phenotype throughout disease evolution. • Predicting disease progression in a probabilistic, time-dependent manner. • Providing a better and more explainable understanding of the mechanisms underlying MS and ALS. A key aspect is that these approaches should rely on shared resources that enable proper benchmarking, comparable, and reproducible experimentation. In fact, only by properly measuring and comparing the effectiveness of the various developed tools we can understand how to improve them. The Intelligent Disease Progression Prediction at CLEF (iDPP@CLEF) Lab aims to provide an evaluation infrastructure for developing such AI algorithms. iDPP proposes to go beyond the current state of the art by systematically addressing issues related to applying AI in clinical practice for ALS and MS. In addition to defining risk scores based on the probability of short- or long-term events, iDPP@CLEF also focuses on providing clinicians with structured and understandable data. iDPP@CLEF 2024 is the final iteration of an evaluation cycle begun in 2022, comprising three challenges aimed at fostering reproducible and comparable evaluation of AI-based approaches for predicting the progression of ALS and MS. The first edition, iDPP@CLEF 2022, focused exclusively on ALS, challenging participants to predict the probability that patients would need specific medical treatments based on their medical history. The second edition, iDPP@CLEF 2023, not only built upon iDPP@CLEF 2022 by extending its dataset with environmental data to determine the impact of the environment on patient needs, but it also introduced a new task to predict the risk for patients living with MS to undergo deterioration. This final edition, iDPP@CLEF 2024, further extends the 2023 dataset by including environmental data for MS patients to measure the impact of pollution and external environmental factors on MS progression. Additionally, two new tasks have been introduced: predicting the progression of ALS, measured by the ALSFRS-R scale, based on the patient’s clinical history and data obtained from wearable devices and sensors. The paper is organized as follows: Section 2 presents related challenges; Section 3 describes its tasks; Section 4 discusses the developed dataset; Section 5 explains the setup of the Lab and introduces the participants; Section 6 introduces the evaluation measures adopted to score the runs; Section 7 analyzes the experimental results for the different tasks; finally, Section 8 draws some conclusions and outlooks some future work. This is an extended version of the condensed overview for the iDPP@CLEF 2024 Lab [1]. 2. Related Challenges There have been no other Labs on this or similar topics within CLEF before the start of iDPP@CLEF. iDPP@CLEF 2022 and 2023 were the first two iterations of the Lab and the current is the third. While no major challenges – besides iDPP@CLEF 2023 – regarding MS have been carried out yet, more interest has been shown toward ALS. In particular, three major challenges were organized on this topic: the DREAM 7 ALS Prediction challenge1 in 2012 and the DREAM ALS Stratification challenge2 in 2015 and a Kaggle challenge3 in 2021. The DREAM 7 ALS Prediction challenge consisted of using 3 months 1 https://dreamchallenges.org/dream-7-phil-bowen-als-prediction-prize4life/ 2 https://dx.doi.org/10.7303/syn2873386. 3 https://www.kaggle.com/alsgroup/end-als of ALS clinical trial information (months 0–3) to predict the future progression of the disease (months 3–12), expressed as the slope of change in ALS Functional Rating Scale Revisited (ALSFRS-R) [2]. Later on, the DREAM ALS Stratification challenge [3] required participants to stratify ALS into subgroups based on their characteristics, to understand patient profiles better and provide personalized ALS treatments. Finally, the Kaggle challenge employed clinical and genomic data to obtain a better understanding of the mechanisms underlying ALS and determine why some people with ALS tend to have a faster progression of the disease compared to others. At the current time, most of the datasets used to evaluate AI algorithms for MS are based on closed and proprietary datasets. In this sense iDPP@CLEF paved the way for a reproducible and effectively open science in the research domain of the AI used for predicting the progression of MS. 2.1. iDPP@CLEF 2022 iDPP@CLEF 20224 [4, 5] was the first edition of the Lab and concerned exclusively the ALS disease progression prediction. Being the pilot Lab, a large share of effort was devoted to understanding the challenges and limitations linked to the shared evaluation campaigns, when it comes to AI applied in the medical domain. iDPP@CLEF 2022 was organized into 3 tasks: • Pilot Task 1 - Ranking Risk of Impairment: The focus of the first task of iDPP@CLEF 2022 was on ranking patients based on the risk of impairment, defined as the need for specific medical treatments, such as NIV, PEG, or death. Participants were given information on the motor functioning of the patients, measured according to the ALSFRS-R scale [2], in time and were asked to rank patients based on the time-to-event risk of experiencing impairment in each specific domain. • Pilot Task 2 - Predicting Time of Impairment: it refined Task 1 by asking participants to predict when specific impairments will occur (i.e. in the correct time window). In this regard, The task focused on assessing model calibration in terms of the ability of the proposed algorithms to estimate the probability of an event close to the true probability within a specified time window. • Position Paper Task 3 - Explainability of Artificial Intelligence (AI) algorithms: The task focused on the evaluation and discussion of AI-based explainable frameworks for intelligent disease progression prediction able to explain the multivariate nature of the data and the model predictions. One of the major outputs of iDPP@CLEF 2022 was the 3 datasets released. In particular, the datasets contain data for the prediction of specific events related to ALS. Such datasets are fully anonymized retrospective details about 2250 real patients. The patients were recruited from two medical institutions in Turin, Italy, and Lisbon, Portugal. The datasets contain static data about patients (e.g. age, onset date, gender) and event data (i.e. 18,512 ALSFRS-R questionnaires and 4,015 spyrometries). 6 groups participated in iDPP@CLEF 2022 and submitted a total of 120 runs. 2.2. iDPP@CLEF 2023 Similarly to iDPP@CLEF 2022, also iDPP@CLEF 20235 [6, 7] were organized into three tasks, focusing on either ALS or MS. More in detail, Tasks 1 and 2 of iDPP@CLEF 2023 concerned MS, while Task 3 built upon iDPP@CLEF 2022 and extended the ALS tasks of the previous iteration of the Lab. To summarize iDPP@CLEF 2023 tasks: • Task 1: Predicting Risk of Disease Worsening (MS) This task focused on predicting the probability that, given the history of the patient, they would undergo a worsening, according to two different definitions of worsening. 4 https://brainteaser.health/open-evaluation-challenges/idpp-2022/ 5 https://brainteaser.dei.unipd.it/challenges/idpp2023/ • Task 2: Predicting Cumulative Probability of Worsening (MS) The second task had a similar objective to task 1, with the major difference that, instead of predicting the risk at an absolute level, participants were required to predict the cumulative probability of worsening over 10 years. • Task 3: Position Papers on the Impact of Exposition to Pollutants (ALS) The third task extended the first task of iDPP@CLEF 2022 and concerned the ranking of the patients based on the risk of impairment. The major difference to iDPP@CLEF 2022 was that participants were given environmental data to determine if such data was a good predictor of the risk of impairment. iDPP@CLEF 2023 extended the iDPP@CLEF 2022 datasets with three 2 datasets for MS. In particular, such datasets contained static data about patients, MS-related details (e.g., the EDSS score, results of MRIs, evoked potentials measures), and a label indicating if the patient underwent a worsening, based on the worsening definitions of Task 1 and 2. 10 teams submitted a total o 163 runs at the end of iDPP@CLEF 2023. 3. Tasks In the remainder of this section, we describe each task in more detail. 3.1. Task 1: Predicting ALSFRS-R Score from Sensor Data (ALS) Task 1 focuses on predicting the twelve scores of the ALSFRS-R (ALS Functional Rating Scale - Revised), assigned by medical doctors roughly every three months, from the sensor data collected via the app. The ALSFRS-R is a somehow “subjective” evaluation usually performed by a medical doctor and this task will help in answering a currently open question in the research community, i.e. whether it could be derived from objective factors. Participants were given the ALSFRS-R questionnaire at the first visit with the scores for each ques- tion together with the time (number of days from diagnosis) at which the questionnaire was taken. Participants will be given the time of the second visit (number of days from diagnosis) together with all the sensor data up to the time of the second visit. Participants had to predict the values of the ALSFRS-R sub-scores at the second visit. 3.2. Task 2: Predicting Patient Self-assessment Score from Sensor Data (ALS) The second task concerning ALS focuses on predicting the self-assessment score assigned by patients from the sensor data collected via the app. Self-assessment scores correspond to each of the ALSFRS-R scores but, while the latter ones are assigned by medical doctors during visits, the these scores are assigned via auto-evaluation by patients themselves using the provided app. If the self-assessment performed by patients, more frequently than the assessment performed by medical doctors every three months or so, can be reliably predicted by sensor and app data, we can imagine a proactive application which, monitoring the sensor data, alerts the patient if an assessment is needed. Participants were given the first set of self-assessed scores together with the time (number of days from diagnosis) at which the questionnaire was taken. Participants were also given the time of the second auto-evaluation (number of days from diagnosis) together with all the sensor data up to the time of the second auto-evaluation. Participants had to predict the values of the self-assessed scores at the second auto-evaluation, happening one or two months after the first one. 3.3. Task 3: Predicting Relapses from EDSS Sub-scores and Environmental Data (MS) The third task focuses on predicting a relapse using environmental data and EDSS (Expanded Disability Status Scale) sub-scores. This task allows us to assess if exposure to different pollutants is a useful variable in predicting a relapse. Participants were asked to predict the week of the first relapse after the baseline considering envi- ronmental data based on a weekly granularity, given the status of the patient at the baseline, which is the first visit available in the considered time span (after January 1, 2013). For each patient, the date of the baseline will be week 0 and all the other weeks will be relative to it. Participants were given all the environmental data about a patient, i.e. also observations which may happen after the relapse to be predicted. All the patients are guaranteed to experience, at least, one relapse after the baseline. 4. Dataset For iDPP@CLEF 2024 we release three datasets: two completely new datasets for ALS and an extension of the iDPP@CLEF 2023 dataset concerning MS. More in detail, the two new ALS datasets comprise a common training part with 52 training patients, whose ALSFRS-R scores were both annotated by the clinicians and self-assessed. Concerning the test sets, 21 and 11 patients were included in them for Task 1 and Task 2, respectively. Regarding MS, the part of the dataset concerning static variables and MS-related information is the same as the one used for iDPP@CLEF 2023. The major improvement regards environmental data that have been added to the dataset. 4.1. Tasks 1 and 2: ASL Dataset with Clinical or self-assessed ALSFRS-R The datasets for Task 1 and Task 2 were collected from ALS-diagnosed patients recruited during the BRAINTEASER project from three centers in Lisbon, Madrid, and Turin. At recruitment, patients were given a commercial fitness tracker (the Garmin VivoActive 4 smartwatch), and data from its sensors was collected during a follow-up period with a median duration of 270 days. Patients were encouraged to wear the watch as much as they were comfortable with, ideally all the time, both while awake and sleeping. Each day of data for each patient was summarized into a vector of 90 statistics related to heart rate and beat-to-beat interval, respiration rate, and nocturnal pulse oximetry. Sensor data was not available every day for each patient. During the same period, disease progression was assessed by their clinician using the ALSFRS-R questionnaire (roughly every three months, following standard clinical practice). Patients also used the same questionnaire to self-assess their progression through a smartphone app developed specifically by the BRAINTEASER project. They were prompted for the assessment once per month, though the actual frequency varied and depended on patient compliance. 4.1.1. Creation of the datasets Patients with insufficient data were excluded from the challenge dataset. Specifically, this included those with less than three months of follow-up data, those with more than 50% of sensor data missing, and those without at least two clinical or self-assessed ALSFRS-R evaluations. After applying these criteria, a dataset of 83 patients was obtained, with a median of 254 days of sensor data per patient. These patients and their data were then divided into a training group (common to both Tasks 1 and 2) and two task-specific testing groups. 4.1.2. Split into training and test The patients were split into three groups: training patients with at least two clinical and two self-assessed ALSFRS-R evaluations; test-ct patients with at least two clinical but without two self-assessed ALSFRS-R evaluations; test-app patients with at least two self-assessed but without two clinical ALSFRS-R evaluations. Table 1 Comparison between training and test populations for Task 1 and 2. Continuous variables are presented as median (interquartile range); categorical variables as count (percentage on available data), for each level. “Sensor adherence” is the ratio of days with available sensor data during the whole sensor follow-up. Variable Level Task 1/2 Train Task 1 test Task 2 test Sex Female 11 (21.15%) 9 (42.86%) 4 (36.36%) Male 41 (78.85%) 12 (57.14%) 7 (63.64%) Diagnostic delay (months) median (IQR) 0.8 (0.4-1.3) 0.9 (0.4-1.8) 1.0 (0.4-1.6) Age at diagnosis median (IQR) 56 (49-64) 62 (57-66) 60 (52-66) FVC median (IQR) 85 (79-95) 84 (79-98) 92 (79-113) Weight median (IQR) 75 (64-81) 67 (60-71) 65 (60-70) BMI median (IQR) 25 (23-27) 24 (22-26) 22 (21-25) ALSFR-R CT (count) median (IQR) 3.5 (2.0-5.0) - - ALSFR-R APP (count) median (IQR) 5.0 (3.0-8.0) - - Sensor follow-up (months) median (IQR) 9.8 (5.2-13.6) 8.9 (5.3-14.2) 5.9 (5.5-8.3) Sensor adherence median (IQR) 98% (89%-100%) 98% (85%-100%) 100% (99%-100%) The training set thus included 52 patients with a median of 3.5 clinical and 5 self-assessed ALSFRS-R evaluations (189 and 301 in total, respectively). The test-ct set (the test set for Task 1) included 21 patients, whose first clinical ALSFRS-R evaluations were included as features and the second evaluations were the prediction target. The test-app set (the test set for Task 2) included 11 patients and was built in the same way using the self-assessed ALSFRS-R evaluations. The full available sensor data for all patients was included in both the training and test datasets, while only the clinical (resp. self-assessed) ALSFRS-R evaluations were included for Task 1 (resp. Task 2). A comparative description of the datasets is shown in Table 1. 4.2. Task 3: MS Dataset The dataset used for Task 3 in iDPP@CLEF 2024 is structured similarly to those from iDPP@CLEF 2023, though some features (e.g., evoked potentials, MRIs) were not included, and certain records have been filtered based on the purpose of the task. 4.2.1. Updates over IDPP@CLEF 2023 In the 2024 dataset, EDSS data before January 1, 2013 (aligned with the start of environmental data collection) were filtered, and patients without EDSS follow-ups were removed. Additionally, patients who did not experience a relapse after their first non-filtered EDSS follow-up (i.e., the baseline for each patient) were excluded. The dataset has been expanded to incorporate environmental data, which includes information on patients’ exposure to various air pollutants identified as significant public health risks in the latest World Health Organization (WHO) global air quality guidelines [8], such as particulate matter (PM) - encompassing both PM2.5 (particles with an aerodynamic diameter of 2.5 micrometers or less) and PM10 (particles with an aerodynamic diameter of 10 micrometers or less) - as well as ozone (O3 ), nitrogen dioxide (NO2 ), sulfur dioxide (SO2 ), carbon monoxide (CO), and several weather factors (including wind speed, relative humidity, sea level pressure, global radiation, precipitation, and average, minimum, and maximum temperatures). Air pollutant data from public monitoring stations were collected daily from the European Air Quality Portal using the DiscoMap tool 6 . The geographical coordinates (longitude and latitude) of each monitoring station were matched to specific postcodes, identifying the nearest station to each patient’s residence postcode. Instead, weather data were gathered daily from the European Climate Assessment and Dataset station network, which provides access to the E-OBS dataset, a daily gridded land-only observational dataset over Europe 7 . Each grid was matched with the nearest monitoring station using Euclidean distance based on geographical coordinates. This approach ensured that 6 https://discomap.eea.europa.eu/Index 7 https://www.ecad.eu/download/ensembles/download.php Figure 1: Boxplots of weekly average air pollutant concentrations across patients. Red stars represent the World Health Organization (WHO) recommended air quality guideline levels for 24-hour exposure air pollution and weather data were aligned with the same spatial and temporal granularity. Daily environmental measurements were aggregated into weekly averages from each patient’s baseline. As additional features, the number of days per week spent over the respective WHO recommended air quality guideline levels for short-term (24 hours) exposure was computed for each air pollutant [8]. Finally, a subset of 380 MS patients from the Turin and Pavia research centers was selected for Task 3 in iDPP@CLEF 2024, compared to 550 patients for Task 1 and 638 for Task 2 in iDPP@CLEF 2023. The resulting MS dataset 8 includes static variables with demographic and clinical information, EDSS scores with corresponding Functional System (FS) sub-scores, environmental measurements, and the outcome time, representing the week of the first relapse occurrence after the baseline for each patient. EDSS follow-ups are reported between the baseline and the outcome time, while environmental measurements span from January 1, 2013, to December 30, 2023. It is important to note that environmental data may have gaps due to availability. When considering only environmental data preceding the outcome time, the median number of weeks available for each patient is 59, with an interquartile range of 103.25 weeks. The distributions of air pollutant concentrations (measured in micrograms per cubic meter), averaged across patients over these weeks, are depicted in the boxplots of Figure 1, where the red stars indicate the WHO recommended air quality guideline levels for 24-hour exposure [8]. 4.2.2. Split into training and test The dataset was split into a training set (70%) and a test set (30%), with subjects stratified by outcome time to ensure an even distribution across both sets. The distribution of static data, including demographic and clinical information, and EDSS were verified to be similar in both training and test sets. Additionally, since environmental exposure is considered, the distribution of patients from the two clinical centres and their residence classification (Cities, Rural Areas, and Towns) was checked to be balanced. 8 https://brainteaser.dei.unipd.it/challenges/idpp2024/assets/other/ms/ms-variables-description.txt Table 2 Comparison between training and test populations for MS task. Continuous variables are presented as median (interquartile range); categorical variables as count (percentage on available data), for each level. Variable Level Levels Training Levels Test Sex Female 148 (74.37%) 54 (66.67%) Male 51 (25.63%) 27 (33.33%) Ethnicity Caucasian 181 (90.96%) 77 (95.06%) Hispanic 2 (1.00%) - Black African 2 (1.00%) - NA 14 (7.04%) 4 (4.94%) Residence classification Cities 53 (26.63%) 20 (24.69%) Rural Area 52 (26.13%) 22 (27.16%) Towns 94 (47.24%) 39 (48.15%) Centre Pavia 129 (64.82%) 58 (71.61%) Turin 70 (35.18%) 23 (28.39%) Occurrence of MS in pediatric age FALSE 176 (88.44%) 77 (95.06%) TRUE 23 (11.56%) 4 (4.94%) Age at onset median (IQR) 28 (22-36) 30 (24-34) Age at baseline median (IQR) 38 (31-47) 38 (33-47) Diagnostic delay median (IQR) 12 (4-47) 12 (3-28) Spinal cord symptom FALSE 143 (71.86%) 54 (66.67%) TRUE 56 (28.14%) 27 (33.33%) Brainstem symptom FALSE 146 (73.37%) 57 (70.37%) TRUE 53 (26.63%) 24 (29.63%) Eye symptom FALSE 148 (74.37%) 59 (72.84%) TRUE 51 (25.63%) 22 (27.16%) Supratentorial symptom FALSE 140 (70.35%) 50 (61.73%) TRUE 59 (29.65%) 31 (38.27%) Other symptoms FALSE 197 (99.00%) 80 (98.77%) Sensory 1 (0.50%) 1 (1.23%) Epilepsy 1 (0.50%) - EDSS median (IQR) 2.0 (1.5-3.0) 2.0 (1.5-3.5) NA 3 (0.36%) 0 (0.00%) Outcome time median (IQR) 59 (24-122) 53 (25-130) Statistical tests, including the Kruskal-Wallis test for continuous variables and the Chi-squared test for categorical and ordinal variables, were performed to assess the appropriateness of the stratification. Special attention was given to sparsely observed levels in categorical variables to ensure rare levels appeared only in the training set if at all. Table2 provides a comparison of variable distributions between the training and test sets, confirming that the split meets the best-practice quality standards. 5. Lab Setup and Participation In the remainder of this section, we detail the guidelines the participants had to comply with to submit their runs and the submissions received by iDPP@CLEF. 5.1. Guidelines Participating teams should satisfy the following guidelines: • The runs should be submitted in the textual format described below; • Each group can submit a maximum of 30 runs for each of Task 1 and Task 2 and Task 3. 5.1.1. Task 1 Run Format Runs should be submitted as a text file (.txt) with the following format: 10061925618906738677 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc 10160033396142711519 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc 10287479530859953248 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc 12398828804459792214 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc 10038199677222038201 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc ... where: • Columns are separated by a white space; • The first column is the patient ID, an hashed version of the original patient ID (should be considered just as a string); • Columns from 2 to 13 represent the predicted ALSFRS-R sub-score. Each column corresponds to an ALSFRS-R question, e.g. column 2 to Q1, column 3 to Q2, and so on). Each values is expected to be integer in the range [0, 4]; • The last column is the run identifier, according to the format described below. It must uniquely identify the participating team and the submitted run. It is important to include all the columns and have a white space delimiter between the columns. No specific ordering is expected among patients (rows) in the submission file. 5.1.2. Task 2 Run Format Runs should be submitted as a text file (.txt) with the following format: 10061925618906738677 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc 10160033396142711519 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc 10287479530859953248 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc 12398828804459792214 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc 10038199677222038201 1 2 3 4 1 2 3 4 1 2 3 4 upd_T1_myDesc ... where: • Columns are separated by a white space; • The first column is the patient ID, an hashed version of the original patient ID (should be considered just as a string); • Columns from 2 to 13 represent the predicted self-assessd sub-score. Each column corresponds to an ALSFRS-R question, e.g. column 2 to Q1, column 3 to Q2, and so on). Each values is expected to be integer in the range [0, 4]; • The last column is the run identifier, according to the format described below. It must uniquely identify the participating team and the submitted run. It is important to include all the columns and have a white space delimiter between the columns. No specific ordering is expected among patients (rows) in the submission file. 5.1.3. Task 3 Run Format Runs should be submitted as a text file (.txt) with the following format: 10061925618906738677 10 upd_T3_myDesc 10160033396142711519 47 upd_T3_myDesc 10287479530859953248 13 upd_T3_myDesc 12398828804459792214 1 upd_T3_myDesc 10038199677222038201 9 upd_T3_myDesc ... where: • Columns are separated by a white space; • The first column is the patient ID, a hashed version of the original patient ID (should be considered just as a string); • The second column is the predicted week at which the first relapse after the baseline happens. The value is expected to be an integer starting from 1; • The third column is the run identifier, according to the format described below. It must uniquely identify the participating team and the submitted run. It is important to include all the columns and have a white space delimiter between the columns. No specific ordering is expected among patients (rows) in the submission file. 5.1.4. Submission Upload Runs should be uploaded to the repository provided by the organizers. Following the repository structure discussed above, for example, a run submitted for the first task should be included in submission/task1. Runs should be uploaded using the following name convention for their identifiers: _T<1|2|3>_, where: • teamname is the name of the participating team; • T<1|2|3> is the identifier of the task the run is submitted to, e.g. T1 for Task 1; • freefield is a free field that participants can use as they prefer to further distinguish among their runs. Please, keep it short and informative. For example, a complete run identifier may look like upd_T1_myDesc, where: • upd is the University of Padua team; • T1 means that the run is submitted for Task 1; • myDesc suggests an appropriate description for the run. The name of the text file containing the run must be the identifier of the run followed by the txt extension. In the above example upd_T1_myDesc.txt 5.2. Participants A total of 28 teams registered to iDPP@CLEF 2024, out of which eight teams were able to submit one run in at least one task. Table 3 reports the details about teams that managed to submit at least one run. Furthermore, Table 4 outlines in which tasks each team participated in and how many runs they were able to submit. In total, 97 runs were submitted to iDPP@CLEF 2024. The most participated task was Task 1 with 59 runs and 6 teams participating. Subsequently, Task 2 had 31 runs submitted by six different teams. Finally, only two teams participated in Task 3, with a total of 7 runs submitted. The most prolific participant was UNIPD, with a total of 20 runs. Table 3 Teams participating in iDPP@CLEF 2024. Team Name Affiliation Country Repository Paper BIT.UA IEETA/DETI, LASI, Uni- Portugal https://bitbucket.org/ Silva and versity of Aveiro brainteaser-health/ Oliveira [9] idpp2024-bitua CompBiomedUniTO University of Torino Italy https://bitbucket.org/ Barducci et. brainteaser-health/ al. [10] idpp2024-compbiomedunito FCOOL LASIGE, Faculty of Sci- Portugal https://bitbucket.org/ Martins et. al. ences, University of Lis- brainteaser-health/ [11] bon idpp2024-fcool iDPPExplorers Georgia Institute of United States https://bitbucket.org/ Metha et. Technology, Atlanta, brainteaser-health/ al. [12] GA idpp2024-idppexplorers Mandatory University of Bucharest Romania https://bitbucket.org/ — brainteaser-health/ idpp2024-mandatory Stefagroup University of Pavia, Italy https://bitbucket.org/ Bosoni et. al. BMI lab "Mario Ste- brainteaser-health/ [13] fanelli" idpp2024-stefagroup UBCS University of Botswana Botswana https://bitbucket.org/ Okere et. al. brainteaser-health/ [14] idpp2024-ubcs UNIPD University of Padova Italy https://bitbucket.org/ Martinello et. brainteaser-health/ al. [15] idpp2024-unipd Table 4 Number of runs submitted by each participant team in iDPP@CLEF 2024 Task 1 (ALS) Task 2 (ALS) Task 3 (MS) Total BIT.UA 7 7 — 14 CompBiomedUniTO 1 1 — 2 FCOOL 9 9 — 18 iDPPExplorers 15 — — 15 Mandatory 19 — — 19 Stefagroup — — 3 3 UBCS — 6 — 6 UNIPD 8 8 4 20 Total 59 31 7 97 6. Evaluation Measures In both Tasks 1 and 2, the prediction targets were the future scores of the ALSFRS-R evaluation, which are integers in the [0-4] range. Since the scores are discrete, we could have framed the predictive task as a classification problem. However, we opted for a regression problem to be able to penalize larger errors more (e.g., with a target score of 3, predicting 1 should be worse than predicting 2). Task 3, where the target was the week of the relapse, was also framed quite naturally as a regression task for similar reasons. Thus, we evaluated all tasks using the same two state-of-the-art evaluation measures to assess the performance of regression models: the Root Mean Square Error (RMSE) and the Mean Absolute Error (MAE). The formulas for RMSE and MAE are shown in Equation 1 and Equation 2, respectively, where 𝑛 represents the number of observations, 𝑦𝑖 is the actual value of the dependent variable for the 𝑖-th observation, and 𝑦ˆ𝑖 is the predicted value of the dependent variable for the 𝑖-th observation. Both metrics can explain the performance of a model in an interpretable manner since their units are the same as the target variable (e.g., weeks); together, they can provide a comprehensive evaluation of the three prediction tasks, with smaller values indicating better simulation results. The RMSE measures how much, on average, the model’s predictions deviate from the actual values. By squaring the errors before averaging them, RMSE gives higher weight to large errors. MAE represents the average absolute difference between actual and predicted values. Unlike RMSE, MAE treats all errors equally, regardless of their magnitude. Therefore, it provides a clear representation of the average error, is less sensitive to outliers, but does not emphasize large errors as much as RMSE. ⎯ ⎸ 𝑛 ⎸ 1 ∑︁ RMSE = ⎷ (𝑦𝑖 − 𝑦ˆ𝑖 )2 (1) 𝑛 𝑖=1 𝑛 1 ∑︁ MAE = |𝑦𝑖 − 𝑦ˆ𝑖 | (2) 𝑛 𝑖=1 Both metrics can explain the performance of a model in an interpretable manner since their units are the same as the target variable (e.g., weeks); together, they can provide a comprehensive evaluation of the three prediction tasks. The RMSE measures how much, on average, the model’s predictions deviate from the actual values. This statistical index ranges from 0 to ∞, with smaller values indicating better simulation results. By squaring the errors before averaging them, RMSE gives higher weight to large errors. MAE represents the average absolute difference between actual and predicted values. Unlike RMSE, MAE treats all errors equally, regardless of their magnitude. Therefore, it provides a clear representation of the average error, is less sensitive to outliers, but does not emphasize large errors as much as RMSE. 7. Results For each task, we report the analysis of the performance of the runs submitted by the Lab’s participants according to the measures described in Section 6. 7.1. Task 1: Predicting ALSFRS-R Score from Sensor Data (ALS) Clinicians monitor ALS progression through frequent visits, typically every two to three months, to promptly detect any worsening of symptoms. Consequently, ALSFRS-R scores usually remain fairly stable between these appointments, making the most recent score a reliable predictor for the next assessment. While some deterioration in at least one score is not uncommon, using the last observed value as a predictive measure is both simple and effective, as most scores will not change. This approach, which we will call “naive” since it does not use sensor data, is particularly useful for bulbar and respiratory scores, which show more stability in the challenge dataset, and where sensor data might Figure 2: Distribution and average ratio of worsening between two consecutive ALSFRS-R evaluations in the training set, for each score. not be as effective in detecting eventual changes. The distribution of ALSFRS-R scores and the amount of worsening between consecutive visits in the training set is shown in Figure 2. Four teams—iDPPExplorers, Mandatory, FCOOL, and UNIPD—employed this strategy in one of their runs for Task 1, achieving the lowest errors with both metrics (0.20 MAE and 0.49 RMSE) and securing joint first place. The full error scores and rankings for all submitted runs are reported in Table 5. Note that other runs, which also utilize sensor data, demonstrate performance very close to the first place. Due to the small size of the test set, error estimates exhibit large standard deviations, making it impossible to assert significant differences in the top scores. The rankings are obtained considering the average of the performance for all twelve ALSFRS-R scores and show how the naive predictors that propagate the last observed score are globally optimal. However, this is not the case for each single ALSFRS-R score, where other runs often have lower errors, as can be seen in Figure 3. Again, given the small size of the test sets, these differences in performance are not statistically significant. However, it is also reasonable that the data collected by the sensors can be more helpful in prediction for some scores than others: this is especially evident for Q9 and Q11 in Task 1 and for Q4 and Q12 in Task 2. Figure 3: Minimum MAE reached by naive runs versus non-naive runs for each score in Task 1 and 2. Naive runs are those that use the last observed values as their prediction. 7.2. Task 2: Predicting Patient Self-assessment Score from Sensor Data (ALS) Task 2 is very similar to Task 1, with several teams employing the same methods as they did for Task 1. However, in Task 2, the ALSFRS-R assessments by patients are less regular in timing and less consistent in scoring compared to assessments by clinicians, although they are generally more closely spaced. The predict-the-last-scores approach remains the top performer, albeit with slightly higher errors (0.29 MAE and 0.58 RMSE), placing the UNIPD and FCOOL teams in joint first place again. Full results are reported in Table 6. 7.3. Task 3: Predicting Relapses from EDSS Sub-scores and Environmental Data (MS) Table7 displays the RMSE and MAE scores for all submissions made for Task 3, with consistent scoring positions across both metrics. Additionally, the scatter plot in Figure 4 offers a visual representation of the performance of all submitted runs, where the x-axis denotes actual values and the y-axis represents predicted values. Ideally, perfect predictions would result in points aligning along a straight line with a slope of 1. The top-performing strategy is associated with the UNIPV UNIPV_t3_rf run [13], which employs a Random Forest (RF) model after thorough preprocessing stages. Regarding the adoption of envi- ronmental features, it is notable that all submissions from the UNIPV (Stefagroup) team incorporate environmental variables for relapse predictions. In contrast, the UNIPD team offers both methods with and without the inclusion of environmental variables, achieving their best results with the UNIPD UNIPD_t3_ridge_noenv run, which excludes environmental variables [15]. 7.4. Approaches In this section, we provide a short summary of the approaches adopted by participants in iDPP@CLEF. There are two separate sub-sections, one for Task 1 and 2 – focused on ALS progression prediction – and one for Task 3 – which concerns the MS relapse prediction, using environmental data. 7.4.1. Tasks 1 and 2 Silva and Oliveira [9] (Team BIT.UA) focus on Tasks 1 and 2. Their proposed approaches employ machine learning techniques that rely on RF ensembles. They observed that the most effective solutions are based on temporal analysis, with the maximization strategy being the top-performing approach. Additionally, they emphasize the importance of proper handling of missing data. The authors noted inconsistent performance across the two tasks. Specifically, their approaches tended to be more effective Table 5 For both MAE and RMSE, results are reported as the average error across all twelve ALSFRS-R scores, the average standard deviation (computed by bootstrapping the test set one thousand times) and their respective ranking metric MAE RMSE team run fcool locf 0.20±0.20 (#1) 0.49±0.20 (#1) idppexplorers naive 0.20±0.22 (#1) 0.49±0.22 (#1) unipd hold 0.20±0.21 (#1) 0.49±0.21 (#1) mandatory d1 0.20±0.19 (#1) 0.49±0.19 (#1) idppexplorers EN 0.22±0.17 (#2) 0.50±0.17 (#2) CBMUnito RF-MonoWindow 0.23±0.19 (#3) 0.52±0.19 (#3) bitua ensemble-max 0.25±0.18 (#4) 0.54±0.18 (#4) temporalAnalysis 0.29±0.24 (#5) 0.61±0.24 (#6) unipd average 0.33±0.18 (#6) 0.60±0.18 (#5) logistic-ALSFRS 0.34±0.21 (#7) 0.64±0.21 (#8) fcool RFClassifier 0.35±0.22 (#8) 0.68±0.22 (#15) unipd rf 0.36±0.22 (#9) 0.65±0.22 (#11) idppexplorers voting 0.37±0.15 (#10) 0.65±0.15 (#10) bitua moremetrics 0.37±0.23 (#10) 0.68±0.23 (#16) mandatory 12hist14 0.37±0.19 (#11) 0.65±0.19 (#9) unipd rf-reg 0.37±0.19 (#12) 0.64±0.19 (#7) mandatory 1hist09 0.38±0.31 (#13) 0.72±0.31 (#30) bitua median 0.38±0.23 (#14) 0.70±0.23 (#20) fcool 2nd-best-both-metrics 0.39±0.26 (#15) 0.71±0.26 (#25) bitua mean 0.39±0.26 (#15) 0.71±0.26 (#21) mandatory 1hist05 0.39±0.20 (#16) 0.66±0.20 (#12) unipd ridge 0.39±0.20 (#17) 0.69±0.20 (#17) idppexplorers gb 0.40±0.18 (#18) 0.69±0.18 (#18) mandatory 1hist04 0.40±0.26 (#18) 0.66±0.26 (#13) 12hist10 0.41±0.23 (#19) 0.67±0.23 (#14) unipd optrun 0.41±0.19 (#20) 0.71±0.19 (#22) idppexplorers svm 0.41±0.23 (#20) 0.75±0.23 (#33) fcool best-both-metrics 0.41±0.22 (#20) 0.71±0.22 (#26) mandatory 12hist13 0.42±0.24 (#21) 0.72±0.24 (#28) bitua ensemble-avg 0.42±0.23 (#22) 0.71±0.23 (#24) idppexplorers lr 0.42±0.20 (#23) 0.73±0.20 (#32) mandatory 1hist03 0.42±0.24 (#24) 0.69±0.24 (#19) 12hist11 0.43±0.28 (#25) 0.72±0.28 (#27) fcool 3rd-best-both-metrics 0.43±0.26 (#25) 0.78±0.26 (#39) mandatory d0 0.44±0.14 (#26) 0.72±0.14 (#29) 1hist08 0.44±0.26 (#27) 0.71±0.26 (#23) idppexplorers et 0.44±0.24 (#27) 0.78±0.24 (#36) dt 0.44±0.22 (#28) 0.72±0.22 (#31) knn 0.46±0.19 (#29) 0.77±0.19 (#35) bitua ensemble-min 0.47±0.30 (#30) 0.80±0.30 (#40) idppexplorers bestModels 0.47±0.24 (#31) 0.81±0.24 (#42) lstm 0.48±0.27 (#32) 0.82±0.27 (#43) mandatory 1hist07 0.48±0.21 (#33) 0.75±0.21 (#34) 1hist02 0.48±0.32 (#34) 0.78±0.32 (#37) idppexplorers nn 0.49±0.24 (#35) 0.80±0.24 (#41) mandatory 1hist06 0.49±0.29 (#36) 0.78±0.29 (#38) idppexplorers rf 0.51±0.29 (#37) 0.86±0.29 (#47) fcool LogisticRegression 0.51±0.28 (#38) 0.84±0.28 (#46) idppexplorers bagging 0.51±0.35 (#39) 0.89±0.35 (#49) unipd logistic 0.51±0.27 (#40) 0.83±0.27 (#45) fcool SVC 0.54±0.34 (#41) 0.89±0.34 (#48) XGBClassifier 0.57±0.15 (#42) 0.83±0.15 (#44) majority-class 0.66±0.52 (#43) 1.09±0.52 (#50) on Task 1, while performance on Task 2 was less satisfactory. Silva and Oliveira attribute this behavior to the variability of the underlying data: Task 1 data, produced by clinicians, was more stable, whereas Task 2 data, produced directly by patients, appeared to be less stable. Barducci et. al. [10] (Team CompBiomedUniTO) tested different approaches to preselect the sensor features to be fed to a RF Classifier. The first solution exploits the mono window approach, which keeps only sensor data recorded within seven days before the considered questionnaire. The other approach instead considers two windows: the first window is the same as before, and the second window instead Table 6 For both MAE and RMSE, results are reported as the average error across all twelve ALSFRS-R scores, the average standard deviation (computed by bootstrapping the test set one thousand times) and the respective ranking metric MAE RMSE team run fcool locf 0.29±0.15 (#1) 0.58±0.15 (#1) unipd hold 0.29±0.15 (#1) 0.58±0.15 (#1) CBMUnito RF-MonoWindow 0.31±0.16 (#2) 0.60±0.16 (#2) bitua ensemble-max 0.33±0.14 (#3) 0.61±0.14 (#3) moremetrics 0.37±0.17 (#4) 0.65±0.17 (#4) mean 0.39±0.18 (#5) 0.71±0.18 (#8) median 0.40±0.21 (#6) 0.69±0.21 (#5) fcool 2nd-best-both-metrics 0.41±0.15 (#7) 0.71±0.15 (#6) bitua ensemble-avg 0.42±0.22 (#8) 0.71±0.22 (#7) idpp2024-bitua 0.43±0.24 (#9) 0.72±0.24 (#9) unipd average 0.49±0.20 (#10) 0.78±0.20 (#12) fcool 3rd-best-both-metrics 0.50±0.13 (#11) 0.78±0.13 (#10) unipd logistic-ALSFRS 0.50±0.19 (#11) 0.85±0.19 (#18) bitua ensemble-min 0.50±0.24 (#12) 0.82±0.24 (#14) unipd rf 0.52±0.20 (#13) 0.78±0.20 (#11) rf-reg 0.52±0.12 (#14) 0.82±0.12 (#13) fcool best-both-metrics 0.53±0.20 (#15) 0.84±0.20 (#15) RFClassifier 0.53±0.24 (#16) 0.85±0.24 (#17) unipd ridge 0.55±0.27 (#17) 0.85±0.27 (#16) fcool LogisticRegression 0.57±0.21 (#18) 0.89±0.21 (#19) XGBClassifier 0.59±0.17 (#19) 0.93±0.17 (#20) unipd optrun 0.61±0.27 (#20) 0.96±0.27 (#21) logistic 0.66±0.29 (#21) 0.99±0.29 (#22) fcool SVC 0.67±0.19 (#22) 1.01±0.19 (#23) ubcs features100 0.82±0.43 (#23) 1.20±0.43 (#26) featuresall 0.89±0.41 (#24) 1.25±0.41 (#27) features10 0.94±0.49 (#25) 1.33±0.49 (#28) features25 0.96±0.21 (#26) 1.14±0.21 (#24) features20 1.02±0.24 (#27) 1.18±0.24 (#25) fcool majority-class 1.03±0.44 (#28) 1.47±0.44 (#29) ubcs features50 1.11±0.51 (#29) 1.51±0.51 (#30) Table 7 MAE and RMSE results (with the respective rankings) for all the submitted runs for Task 3 Team Run MAE RMSE UNIPV_t3_rf 22.49 (#1) 41.52 (#1) Stefagroup UNIPV_t3_lmer_first 28.05 (#2) 48.07 (#2) UNIPV_t3_lmer_last 47.74 (#3) 72.51 (#3) UNIPD_t3_ridge_noenv 61.37 (#4) 78.62 (#4) UNIPD_t3_average 65.80 (#5) 79.26 (#5) UNIPD UNIPD_t3_rf_reg 66.63 (#6) 79.74 (#6) UNIPD_t3_ridge 68.59 (#7) 89.84 (#7) considers sensor data recorded when the previously available questionnaire occurred. The second approach aims to provide the model with more information about the changes over time. However, the irregularity of sensor data penalizes the two-windows approach. Indeed, 20 out of 54 patients did not have two 7-day periods with a minimum of three days of sensor data. As a result, only the model using the mono window approach was submitted. In general, the results vary significantly depending on the questionnaire and showed better performance for the first task. The lower error in Task 1 may be due to the questionnaire being completed by clinical staff, whose responses are typically more reliable and objective compared to the subjective opinions provided by patients. To address the raised issue, data augmentation is proposed as a possibile solution to increase the number of questionnaires in the training set. In this way, deep learning models could be tested to improve predictions and leverage Figure 4: Actual versus Predicted values for each run submitted for Task 3 longer sensor data sequences. Martins et. al. [11] (Team FCOOL) proposed a methodology consisting of independent multi-class models, each predicting a distinct ALSFRS-R question. The authors tested four classification models: Logistic Regression, RF, XGBoost, and Support Vector Machine. To manage sensor data, they first derived static features from the longitudinal data via summarization techniques, and then reduced the feature set using three methods: top-k selection across questions, top-k selection by question, and biclustering. In both tasks, RF achieved top performance among the considered models, but failed to outperform the Last Observation Carried Forward (LOCF) baseline, except for a few individual questions. Moreover, no consensus was found about the best feature selection or extraction approach. Instead, top-k selection by question was the best approach in Task 1, while biclustering in Task 2. Mehta et. al. [12] (Team iDPPExplorers) submitted runs only for Task 1 but analyze the approaches for Task 2 on their working notes paper. Their work focuses on handling the temporal aspect of the sensor data, by studying how to compress it via statistical methods that provide interpretability. Among the set of approaches tested in their work, Mehta et. al. observe that the optimal performance is achieved by both a naive baseline and ElasticNet regression. Nevertheless, the authors also observe that, despite the similar performance, the ElasticNet model is more robust and allows a better understanding of the contribution of various features. While they did not take part in Task 2, they observed that the proposed approach is able to achieve better results on self-assessed data provided by the patients. Finally, their conclusive remark hints that, while this preliminary analysis did not highlight any major benefit of using sensor data, a larger dataset with a more diverse set of patients might lead to different conclusions. In Tasks 1 and 2, Martinello et. al. [15] (Team UNIPD) developed a broad set of predictive models based on different methodological approaches using different subsets of the provided variables. The aim of their study was to evaluate whether considering wearable data to predict ALS disability leads to better performance with respect to models that only consider disease-specific variables collected during routine visits. They observe that collecting data from wearable devices can improve the prediction of ALS disability status. However, patients must be properly trained to use the sensors correctly in order to acquire high-quality data leading to significant datasets. Otherwise, if the quality of the acquired wearable data is poor, predicting the next visit ALSFRS-R score by simply holding the current one seems to be a better approach. This is especially true when predicting scores that are self-assigned by patients (task 2), who seem to be more stable and conservative with respect to their clinician during the disability evaluation process over time. Okere et. al. [14] (Team UBCS) explores different deep-learning techniques to process data, especially to handle missing values. In particular, the authors exploit auto-encoders and multiple imputation techniques to handle missing values and use a RF algorithm to select relevant features. Subsequently, four deep neural networks, such as Multi-Layer Perceptron (MLP), Feed Forward Neural Network (FFNN), Recurrent Neural Network (RNN), and Long-Short Term Memory (LSTM), were trained to perform the two tasks. Experimental results revealed that ensemble predictive models, such as the XGBoost algorithm, show better performance than deep learning models. The authors link the low performance of the models with the small size of the training data. 7.4.2. Task 3 Bosoni et. al. [13] (Team Stefagroup) used Topological Data Analysis to compute personal exposure patterns and then employed two predictive approaches. The former relied on applying Linear Regression, RF, and XGBoost to the last follow-up data. The latter used Mixed-Effects modeling on longitudinal data from first to last follow-up. The results showed that incorporating environmental variables provides information statistically significant for predicting relapses. This outcome underlined the need for better methods to compute personal pollution exposure patterns, thereby enhancing the precision of MS progression predictions. In task 3 Martinello et. al. [15] (Team UNIPD) developed a broad set of predictive models based on different methodological approaches using different subsets of the provided variables. The aim of their study was to evaluate whether considering environmental data to predict MS relapses leads to better performance with respect to models that only consider disease-specific variables collected during routine visits. They observe that environmental data can be beneficial for predicting the occurrence of MS relapses, however, better solutions should be explored to refine the data collection and variable extraction process in order to obtain more precise and focused predictions. 8. Conclusions and Future Work iDPP@CLEF 2024 is the third and last iteration of the iDPP@CLEF evaluation campaign. The focus of this evaluation campaign was on developing AI models capable of preemptively estimating the risks that patients affected by ALS and MS will need medical support and to describe the progression of their disease, to foster patient stratification and aid clinicians in providing the due care in the most effective and rapid way. iDPP@CLEF 2024 operated in continuation with iDPP@CLEF 2022 and iDPP@CLEF 2023, expanding previously proposed tasks, but also identifying novel tasks. In particular, iDPP@CLEF was organized into 3 tasks. The first two tasks focused on predicting the ALSFRS-R for patients affected by ALS, using data collected via environmental sensors and wearable devices. This makes iDPP@CLEF 2024 the first edition of making use of data collected on patients currently involved in the BRAINTEASER project. The third task of iDPP@CLEF 2024 built upon the results of iDPP@CLEF 2023, by focusing on the prediction of the disease progression of patients affected by MS. More in detail, this task focused on predicting when an MS patient will experience a relapse. Aa an improvement over the previous iDPP edition, this year participants were also provided with environmental data that could have been used to improve the AI models. In terms of participation, 28 teams registered in the Lab, suggesting overall interest in the topic from the research community, and 8 teams were able to submit their results for a total of 97 submitted runs. The task that received the most interest was the first, with 59 submissions alone. While this cycle concludes the evaluation campaign of iDPP@CLEF, we envision several possible research paths for which iDPP@CLEF paved the way. First of all, novel and more effective AI approaches can be developed in the future, by using iDPP@CLEF data as training and evaluation sets. Secondly, iDPP@CLEF has identified several guidelines and good practices that can be adapted to devise novel shared tasks and evaluation campaigns in the future, either concerning ALS and MS, other neurological diseases, or the medical domain at large. Acknowledgments The work reported in this paper has been partially supported by the BRAINTEASER9 project (contract n. GA101017598), as a part of the European Union’s Horizon 2020 research and innovation programme. References 1. G. Birolo, P. Bosoni, G. Faggioli, H. Aidos, R. Bergamaschi, P. Cavalla, A. Chiò, A. Dagliati, M. de Carvalho, G. Di Nunzio, P. Fariselli, J. García Dominguez, A. G. Marta Gromicho, E. Longato, S. Madeira, U. Manera, S. Marchesin, L. Menotti, G. Silvello, E. Tavazzi, E. Tavazzi, I. Trescato, M. Vettoretti, B. D. Camillo, N. Ferro, Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2024, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 15th International Conference of the CLEF Association, CLEF 2024, Grenoble, France, September 9-12, 2024, Proceedings, 2024. 2. J. M. Cedarbaum, N. Stambler, E. Malta, C. Fuller, D. Hilt, B. Thurmond, A. Nakanishi, The ALSFRS-R: a revised ALS functional rating scale that incorporates assessments of respiratory function, Journal of the Neurological Sciences 169 (1999) 13–21. 3. R. Küffner, N. Zach, R. Norel, J. Hawe, D. Schoenfeld, L. Wang, G. Li, L. Fang, L. Mackey, O. Hardi- man, M. Cudkowicz, A. Sherman, G. Ertaylan, M. Grosse-Wentrup, T. Hothorn, J. van Ligtenberg, J. H. Macke, T. Meyer, B. Schölkopf, L. Tran, R. Vaughan, G. Stolovitzky, M. L. Leitner, Crowd- sourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression, Nature Biotechnology 33 (2015) 51–57. 4. A. Guazzo, I. Trescato, E. Longato, E. Hazizaj, D. Dosso, G. Faggioli, G. M. Di Nunzio, G. Silvello, M. Vettoretti, E. Tavazzi, C. Roversi, P. Fariselli, S. C. Madeira, M. de Carvalho, M. Gromicho, A. Chiò, U. Manera, A. Dagliati, G. Birolo, H. Aidos, B. Di Camillo, N. Ferro, Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2022, in: A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science (LNCS) 13390, Springer, Heidelberg, Germany, 2022, pp. 395–422. 5. A. Guazzo, I. Trescato, E. Longato, E. Hazizaj, D. Dosso, G. Faggioli, G. M. Di Nunzio, G. Silvello, M. Vettoretti, E. Tavazzi, C. Roversi, P. Fariselli, S. C. Madeira, M. de Carvalho, M. Gromicho, A. Chiò, U. Manera, A. Dagliati, G. Birolo, H. Aidos, B. Di Camillo, N. Ferro, Overview of iDPP@CLEF 2022: The Intelligent Disease Progression Prediction Challenge, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), CLEF 2022 Working Notes, CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073. http://ceur-ws.org/Vol-3180/, 2022, pp. 1130–1210. 6. G. Faggioli, A. Guazzo, S. Marchesin, L. Menotti, I. Trescato, H. Aidos, R. Bergamaschi, G. Birolo, P. Cavalla, A. Chiò, A. Dagliati, M. de Carvalho, G. M. D. Nunzio, P. Fariselli, J. M. G. Dominguez, M. Gromicho, E. Longato, S. C. Madeira, U. Manera, G. Silvello, E. Tavazzi, E. Tavazzi, M. Vettoretti, B. D. Camillo, N. Ferro, Overview of idpp@clef 2023: The intelligent disease progression prediction challenge, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th 9 https://brainteaser.health/ to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 1123–1164. URL: https://ceur-ws.org/Vol-3497/paper-095.pdf. 7. G. Faggioli, A. Guazzo, S. Marchesin, L. Menotti, I. Trescato, H. Aidos, R. Bergamaschi, G. Birolo, P. Cavalla, A. Chiò, A. Dagliati, M. de Carvalho, G. M. D. Nunzio, P. Fariselli, J. M. G. Dominguez, M. Gromicho, E. Longato, S. C. Madeira, U. Manera, G. Silvello, E. Tavazzi, E. Tavazzi, M. Vettoretti, B. D. Camillo, N. Ferro, Intelligent disease progression prediction: Overview of idpp@clef 2023, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18-21, 2023, Proceedings, volume 14163 of Lecture Notes in Computer Science, Springer, 2023, pp. 343–369. URL: https://doi.org/10.1007/978-3-031-42448-9_24. doi:10.1007/ 978-3-031-42448-9\_24. 8. World Health Organization, WHO global air quality guidelines: Particulate matter (PM2.5 and PM10 , ozone, nitrogen dioxide, sulfur dioxide and carbon monoxide, World Health Organization, Geneva, 2021. Review. 9. J. Silva, J. Oliveira, Bit.ua at idpp: Predictive analytics on als disease progression using sensor data with machine learning, in: CLEF 2024 Working Notes, 2024. 10. G. Barducci, F. Sartori, G. Birolo, T. Sanavia, P. Fariselli, Alsfrs-r score prediction for amyotrophic lateral sclerosis, in: CLEF 2024 Working Notes, 2024. 11. A. Martins, D. Amaral, E. Castanho, D. Soares, R. Branco, S. Madeira, H. Aidos, Predicting the functional rating scale and self-assessment status of als patients with sensor data, in: CLEF 2024 Working Notes, 2024. 12. R. Mehta, A. Pramov, S. Verma, Machine learning for alsfrs-r score prediction: Making sense of the sensor data, in: CLEF 2024 Working Notes, 2024. 13. P. Bosoni, M. Vazifehdan, D. Pala, E. Tavazzi, R. Bergamaschi, R. Bellazzi, A. Dagliati, Predicting multiple sclerosis relapses using patient exposure trajectories, in: CLEF 2024 Working Notes, 2024. 14. C. Okere, E. Thuma, G. Mosweunyane, Ubcs at idpp: Predicting patient self-assessment score from sensor data using machine learning algorithms, in: CLEF 2024 Working Notes, 2024. 15. E. Marinello, A. Guazzo, E. Longato, E. Tavazzi, I. Trescato, M. Vettoretti, B. D. Camillo, Using wearable and environmental data to improve the prediction of amyotrophic lateral sclerosis and multiple sclerosis progression: an explorative study, in: CLEF 2024 Working Notes, 2024.