ALSFRS-R Score Prediction for Amyotrophic Lateral Sclerosis Notebook for the iDPP Lab on Intelligent Disease Progression Prediction at CLEF 2024

ALSFRS-R Score Prediction for Amyotrophic Lateral Sclerosis Notebook for the iDPP Lab on Intelligent Disease Progression Prediction at CLEF 2024 GuidoBarducci guido.barducci@unito.it Dept. of Medical Sciences Computational Biomedicine Unit University of Turin

Turin Italy

FlavioSartori flavio.sartori@unito.it Dept. of Medical Sciences Computational Biomedicine Unit University of Turin

Turin Italy

GiovanniBirolo giovanni.birolo@unito.it Dept. of Medical Sciences Computational Biomedicine Unit University of Turin

Turin Italy

TizianaSanavia tiziana.sanavia@unito.it Dept. of Medical Sciences Computational Biomedicine Unit University of Turin

Turin Italy

PieroFariselli piero.fariselli@unito.it Dept. of Medical Sciences Computational Biomedicine Unit University of Turin

Turin Italy

ALSFRS-R Score Prediction for Amyotrophic Lateral Sclerosis Notebook for the iDPP Lab on Intelligent Disease Progression Prediction at CLEF 2024 1613-0073 986F9A75F55AE7E45A30A68022C02EAB GROBID - A machine learning software for extracting information from scholarly documents Machine Learning, ALS, ALSFRS-R (P. Fariselli) 0009-0005-1052-8495 (G. Barducci) 0009-0004-3833-6551 (F. Sartori) 0000-0003-0160-9312 (G. Birolo) 0000-0003-3288-0631 (T. Sanavia) 0000-0003-1811-4762 (P. Fariselli)

Amyotrophic lateral sclerosis (ALS) is a progressive neurodegenerative disorder that results in the gradual deterioration of motor abilities, leading to challenges in breathing, speaking, swallowing, and ultimately death, typically occurring within a few years. The symptoms of ALS can vary significantly from one individual to another, affecting various bodily functions and areas. To assess this wide range of symptoms, the Amyotrophic Lateral Sclerosis Functional Rating Scale -Revised (ALSFRS-R) is utilized. Predicting the ALSFRS-R score is clinically relevant for personalizing patient monitoring. To address this need, the Intelligent Disease Progression Prediction challenge was organized, tasking participants with developing novel methods to predict these scores using non-invasive sensor data that monitor some individual characteristics. The competition included two tasks that differed only in the way the ALSFRS-R questionnaires were completed: either by medical staff (task 1) or by the patient (task 2). Given the limited number of patients on the training set, it was decided to use a relatively simple model, Random Forest, and to preselect sensor features by retaining those most correlated with the outcome to be predicted. We selected the model with the lowest MAE estimated by cross-validation on the challenge training set. The competition results demonstrate that our method attained on the test set an average Mean Absolute Error (MAE) of 0.234 and 0.311, along with a Root Mean Square Error (RMSE) of 0.519 and 0.601 for tasks 1 and 2, respectively. Although the error may appear very low, this is because questionnaire values tend to remain constant from one visit to another, thus facilitating prediction.

Introduction

Amyotrophic lateral sclerosis (ALS), also known as neuropathy, is a rapidly progressive and ultimately fatal neurological disease that affects the neurons controlling voluntary muscles in the arms, legs, and face. The yearly incidence of ALS is around 1 to 2.6 cases per 100, 000 individuals, while the prevalence is approximately 6 cases per 100, 000. ALS belongs to a group of motor neuron disorders and typically results in death. Previous studies report approximately 48% and 24% survival rates at 3 and 5 years respectively, with around 4% surviving beyond 10 years. However, population-based studies show lower 5-year survival rates, ranging from 4% to 30% [1] [2].

The symptoms of this disease can vary greatly from case to case and can affect different functions and areas of the body. To describe this wide range of symptoms, the Amyotrophic Lateral Sclerosis Functional Rating Scale -Revised (ALSFRS-R) is employed. It consists of a 12-item inventory, with each item rated on a 0-4 scale by patients and/or caregivers, resulting in a maximum score of 48 points. ALSFRS-R assesses patients' levels of self-sufficiency in areas including feeding, grooming, ambulation, and communication [1] [3] .

Given the variability of this disease, monitoring checks should vary depending on its characteristics, such as the progression rate. Currently, there is no system capable of predicting the course of the disease, making it very challenging to personalize patient visits based on disease progression. The goal of this paper is to utilize information collected through the sensors of a commercial fitness smartwatch, past ALSFRS-R scores, and static features (such as age, sex, etc.) to predict future ALSFRS-R scores. The questionnaire data available has two different sources: they can be filled out by a doctor or by the patient through the use of a dedicated smartphone application. Therefore, the challenge has been divided into two tasks with the same goal but using data from different sources characterized by different frequencies of intervals between one questionnaire and the next, as well as differing medical or personal opinions, which may lead to different scoring choices despite similar symptoms. To solve these tasks, classical machine learning models were used instead of deep learning given the small number of patients in the training dataset. For more details we refer the reader to the challenge overview papers [4,5].

The paper is divided into the following sections: 2 Related Work, which reports some papers addressing topics similar to this work; 3 Methodology, where the entire procedure that led to the predictions for the two tasks is outlined; 4 Experimental Setup, which details the procedures used; 5 Results, where the obtained results along with performance metrics are presented; and 6 Conclusions and Future Work, which reviews the essential steps of the paper and proposes alternative methodologies that could be useful for improving predictions.

Related Work

The quest to identify prognostic factors and build predictive models for amyotrophic lateral sclerosis (ALS) progression has been a longstanding challenge, but one of paramount importance. ALS exhibits significant variability in its progression and outcomes, posing obstacles to making accurate predictions. Many methodologies have undergone rigorous testing using data from the PRO-ACT database. While this repository may not perfectly capture the full spectrum of ALS patients in the population, it stands as the largest publicly accessible dataset amalgamating ALS clinical trials [6] [7] [8].

Wearable devices have been effectively used to study individuals with ALS, demonstrating a link between ALS progression and behavior and function patterns in people with amyotrophic lateral sclerosis, as measured by digital wearables [9] [10] [11] [12]. These measurements include total activity volume, active versus sedentary time, and time spent at home. Additionally, wearable devices are increasingly utilized to investigate physical activity in populations with cardiovascular disease, multiple sclerosis, arthritis, and other conditions. Although studies have been conducted to predict characteristics related to the ALSFRS score, such as its score and slope, to date, there are no predictors leveraging data from smartwatches to predict the ALSFRS score [13] [8]. Hence, it is imperative to investigate such data types extensively to ascertain if they can enhance diagnostic predictions.

Methodology

Three different types of data were used to predict ALSFRS-R scores: sensor data from Garmin VivoActive 4 smartwatches, static feature data, and ALSFRS-R questionnaire data. Regarding sensor data, these consist of 90 different features per day and are characterized by a large number of missing values (in many days, no features were recorded, rendering the sensor vector absent for those days) due to both the data collection device and patient behavior. The static data are baseline characteristics recorded at a specific time, they include: sex, diagnostic delay, age at diagnosis, forced vital capacity (FVC), weight, and body mass index (BMI). Finally, the ALSFRS-R data are of the same type as those to be predicted but collected at a previous time.

To leverage these challenging data, two different approaches have been explored: the Mono Window approach and the Double Window approach. The Mono Window approach is the simpler of the two: for each prediction, only the sensors recorded within 7 days prior to the questionnaire to be predicted are used (these can be utilized in various ways, such as averaging or taking the median). The second approach involves considering two sensor data windows instead of one: the first window adjacent to the questionnaire to be predicted, and the second adjacent to the previous available questionnaire. The idea behind this second method is to provide the model with more information about the changes in recorded parameters over time. Despite the large amount of sensor data (13.946 feature vectors) and the fact that the second method seems more natural for handling this type of data, it is heavily penalized by the irregularity of the sensors. Indeed, two 7-day windows with at least 3 days of sensor data were not available in 20 out of 54 patients. For this reason, the choice to use the first approach was mandatory. In this process, for each patient, all sensor data outside the temporal window of the 7 days preceding the ALSFRS-R values to be predicted was disregarded. The sensor data within each time window were averaged along the time to obtain one feature vector of length 90 for each window that is less affected by daily variability. Once the feature vectors representing the sensor data were obtained, they were concatenated with the vectors of static features and with the vectors of previous ALSFRS-R data before the questionnaires to be predicted; by doing this the the final feature vectors were obtained.

Regarding the outcomes to predict, these are the values of ALSFRS-R questionnaires after the last ones available for the training. To solve this task, it has been observed that the questionnaire values tend to remain constant between visits (see Figure 1); therefore, it was decided to use the previous time's questionnaire score as the prediction baseline and to fit the model on the residuals. To obtain the final prediction value, it was sufficient to add the predicted residual to the value of the previous questionnaire relative to the one to be predicted 1 . The Random Forest Classifier was chosen as the model; unlike deep learning models, it does not require large datasets for training, making it suitable for this task. Before being used to fit the model, the data were preprocessed by scaling and performing feature selection, as explained in the section 4.

Experimental Setup

The training dataset for this challenge comprises data from 52 different patients for both tasks. These data can be categorized into three types: static data, ALSFRS-R data, and data from a smartwatch sensor. By analyzing the correlation matrices, two important facts can be observed: as for the ALSFRS-R data, it can be observed that they can be divided into several correlated groups depending on the area affected by the disease (see Figure 2). Meanwhile, regarding the sensor data, there are strongly correlated features, as shown in the figure 3. As shown in the image, the correlated features are grouped into distinct blocks. For the cardiac features, two main blocks can be identified: the first, smaller block pertains to Heart Rate Variability (HRV), while the second block encompasses other cardiac characteristics related to the RR interval. For the respiratory features, two blocks are associated with respiration and blood oxygenation. Lastly, another block of correlated features pertains to the patient's steps. The grouping of features into these blocks is expected, as they represent different statistics describing the same physiological processes.

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10Q11

The two tasks differ only in terms of the subject and the frequency with which the questionnaires are filled out. This makes the two tasks slightly different primarily for two reasons: the data from the first task should reasonably be more objective, as it is a clinician who fills out the questionnaires rather than the individual patient. Additionally, the data from the second task are compiled through an app and have therefore a higher and more irregular frequency, as depicted in Figure 4.

Regarding data preprocessing, the features were initially scaled using Min-Max Scaler and imputed with the mean or mode depending on their continuous or categorical nature. Concerning the sensor data, an additional process was added: the correlation between each of these features and the questionnaire to be predicted was calculated, and only those features with a correlation above a certain threshold were retained. This approach was chosen due to the low number of samples available to train the model compared to the total number of features. After the data preprocessing, they were used to train a Random Forest Classifier. To determine the optimal hyperparameters of the model2 , along with the correlation threshold utilized to select the sensor features 3 , the following cross-validation strategy was employed. The training set of the challenge was divided into an inner training set and a validation set (80-20%). Hyperparameter optimization was conducted via cross-validation on the inner training set using a grid search method 4 , and the chosen hyperparameters for each the tasks are displayed in Table 1 and Table 2.

Results

The challenge was divided into two tasks, Task 1 and Task 2, each involving the construction of a model to predict the values of ALSFRS-R questionnaires completed by either a clinician or the patient using a dedicated smartphone application. Despite testing two different macro types of methods: Mono Window and Double Window, the low number of patients in the training set resulted in significantly lower performance for the second type. Therefore, we only submitted results from the first type. These were obtained using a Random Forest Classifier, which was trained after determining the optimal hyperparameters through five-fold cross-validation; the average metrics for each ALSFRS-R question over the 5 validation folds are reported in Table 3 for the first task and in Table 4 for the second one.

Conclusions and Future Work

The models attempted to predict the ALSFRS-R questionnaire values were constrained by the small size of the training dataset. Despite experimenting with models utilizing various time intervals, the only ones proving useful for prediction were those solely relying on a window of sensor data adjacent to the questionnaire to be predicted, without leveraging information from more distant times. This limitation stems from the challenging nature of the data, which contains a large number of missing values. Among the models tested, the one demonstrating the best performance and subsequently used for submission was based on Random Forest, preceded by a feature selection step to reduce the number of sensor features. The performance obtained shows significant variability depending on the questionnaire number; The Mean Absolute Error (MAE) calculated on the test set is 0.23 and 0.31 respectively for Task 1 and Task 2, while the Root Mean Squared Error (RMSE) is 0.52 and 0.60 respectively. The lower error is observed in the first task, which could be attributed to the fact that the questionnaire compilation by clinical staff tends to be more reliable and objective compared to the subjective opinion from the patient. These seemingly promising results are unfortunately attributed to the ALSFRS-R questionnaires mostly remaining constant from one visit to another, making it very easy to achieve high prediction performance.

To address this issue, one potential approach to improve is using data augmentation to increase the number of questionnaires in the training set. To improve predictions, methods of deep learning could be tested, leveraging much longer sequences of sensor data (such as recurrent neural networks). However,

2 Figure 1 :21Figure 1: Frequencies of residue values in the training ALSFRS-R data for tasks 1 and 2. As you can see, the most frequent value is 0 for each task. Consequently, the majority of questionnaire values remain unchanged compared to the previous ones.

2 Figure 2 :22Figure 2: Correlation matrix of ALSFRS-R values in the training data for both task 1 and task 2. They reveal that the scores can be clustered into distinct groups. This indicates significant correlations among certain ALSFRS-R items, suggesting underlying patterns or relationships within the data.

Figure 3 :3Figure 3: Sensors Correlation Matrix.

Figure 4 :4Figure 4: Here's the histogram showing the difference in days between consecutive questionnaires for task 1 and task 2. As you can see, for task 2, the questionnaires are filled out at more irregular intervals and with greater frequency compared to task 1

Table 11This table pertains to Task 1. It presents the hyperparameters associated with Random Forest along with the correlation threshold used for feature selection. This process involved removing sensor features with a correlation to the outcome lower than the specified threshold.Qdepth max features correlation thresholdQ13sqrt0.2Q22sqrt0.0Q33log20.0Q48sqrt025Q57log20.15Q66sqrt0.2Q78log20.1Q89sqrt0.15Q96sqrt0.15Q102log20.0Q112log20.0Q123log20.25

Table 22This table pertains to Task 2. It presents the hyperparameters associated with Random Forest along with the correlation threshold used for feature selection. This process involved removing sensor features with a correlation to the outcome lower than the specified threshold.Qdepth max features correlation thresholdQ14sqrt0.25Q25log20.0Q33sqrt0.0Q44log20.0Q57sqrt0.05Q69log20.0Q77sqrt0.0Q89log20.25Q98sqrt0.25Q108log20.25Q113sqrt0.0Q124sqrt0.15

these models require significantly more data for training, which represent the biggest obstacle for this task.

Table 33Performance metrics on the task one validation set. They have been calculated by taking the mean across the five folds for each Q.QMAE (std)RMSE (std)Q1 0.155 (0.080) 0.385 (0.0971)Q2 0.091 (0.019) 0.298 (0.0317)Q3 0.159 (0.028) 0.397 (0.034)Q4 0.266 (0.106) 0.508 (0.107)Q5 0.337 (0.112) 0.575 (0.096)Q6 0.370 (0.045) 0.658 (0.063)Q7 0.280 (0.054) 0.528 (0.051)Q8 0.229 (0.071) 0.473 (0.080)Q9 0.319 (0.076) 0.640 (0.098)Q10 0.165 (0.024) 0.397 (0.030)Q11 0.000 (0.000) 0.000 (0.000)Q12 0.128 (0.045) 0.500 (0.089)

Table 44Performance metrics on the task two validation set. They have been calculated by taking the mean across the five folds for each Q.QMAE (std)RMSE (std)

The samples in the training set corresponding to a residual value occurring fewer than 8 times have been discarded; indeed, they are too few to be recognized by the model. The Random Forest hyperparameters tested were maxing deep and max features; they have been tested respectively in ranges[2,9] and [sqrt, 'log2', None]. The number of trees was fixed at .3 The thresholds tested has been [0, 0.05, 0.1, 0.15, 0.2, 0.25].4 During the fold creation process, care was taken to obtain stratified folds for outcomes to ensure partitions with similar percentages for each outcome category.

Amyotrophic lateral sclerosis LCWijesekera PNigelLeigh Orphanet journal of rare diseases 4 2009 The epidemiology of amyotrophic lateral sclerosis EOTalbott AMMalek DLacomis Handbook of clinical neurology 138 2016 The alsfrs-r: a revised als functional rating scale that incorporates assessments of respiratory function JMCedarbaum NStambler EMalta CFuller DHilt BThurmond ANakanishi BA SGroup A Journal of the neurological sciences 169 1999 complete listing of the BDNF Study Group <author> <persName><forename type="first">G</forename><surname>Birolo</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Bosoni</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Faggioli</surname></persName> </author> <author> <persName><forename type="first">H</forename><surname>Aidos</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Bergamaschi</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Cavalla</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Chiò</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Dagliati</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Carvalho</surname></persName> </author> <author> <persName><forename type="first">G</forename><forename type="middle">M</forename><surname>Di Nunzio</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Fariselli</surname></persName> </author> <author> <persName><forename type="first">J</forename><forename type="middle">M</forename><surname>García Dominguez</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Gromicho</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Guazzo</surname></persName> </author> <author> <persName><forename type="first">E</forename><surname>Longato</surname></persName> </author> <author> <persName><forename type="first">S</forename><forename type="middle">C</forename><surname>Madeira</surname></persName> </author> <author> <persName><forename type="first">U</forename><surname>Manera</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Marchesin</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Menotti</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Silvello</surname></persName> </author> <author> <persName><forename type="first">E</forename><surname>Tavazzi</surname></persName> </author> <author> <persName><forename type="first">E</forename><surname>Tavazzi</surname></persName> </author> <idno>0.063</idno> </analytic> <monogr> <title level="j">Q1 0 0 372 0.050) 0.586 (0.062) Q3 0.083 (0.016) 0.286 (0.029) Q4 0.197 (0.045) 0.442 (0.050) Q5 0.360 (0.059) 0.657 (0.040) Q6 0.324 (0.070) 0.572 (0.062) Q7 0.320 (0.056) 0.563 (0.051) Q8 0.229 (0.066) 0.474(0.071) Q9 0.226 (0.049) 0.473 (0.055 Overview of idpp@clef 2024: The intelligent disease progression prediction challenge ITrescato MVettoretti BDi Camillo NFerro Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024) CEUR Workshop Proceedings

Grenoble, France

September 9th to 12th, 2024. 2024 Intelligent disease progression prediction: Overview of idpp@clef 2024 GBirolo PBosoni GFaggioli HAidos RBergamaschi PCavalla AChiò ADagliati MCarvalho GMDi Nunzio PFariselli JMGarcía Dominguez MGromicho AGuazzo ELongato SCMadeira UManera SMarchesin LMenotti GSilvello ETavazzi ETavazzi ITrescato MVettoretti BDi Camillo NFerro Experimental IR Meets Multilinguality, Multimodality, and Interaction -15th International Conference of the CLEF Association, CLEF 2024 Lecture Notes in Computer Science

Grenoble, France

Springer September 9th to 12th, 2024. 2024 Stratification of amyotrophic lateral sclerosis patients: a crowdsourcing approach RKueffner NZach MBronfeld RNorel NAtassi VBalagurusamy BDi Camillo AChio MCudkowicz DDillenberger Scientific reports 9 690 2019 Model-based and model-free techniques for amyotrophic lateral sclerosis diagnostic prediction and patient clustering MTang CGao SAGoutman AKalinin BMukherjee YGuan IDDinov Neuroinformatics 17 2019 Deep learning methods to predict amyotrophic lateral sclerosis disease progression CPancotti GBirolo CRollo TSanavia BDi Camillo UManera AChiò PFariselli Scientific reports 12 13738 2022 Upper limb movements as digital biomarkers in people with als MStraczkiewicz MKaras SAJohnson KMBurke ZScheier TBRoyse NCalcagno AClark AIyer JDBerry EBioMedicine 101 2024 Objectively monitoring amyotrophic lateral sclerosis patient symptoms during clinical trials with sensors: observational study LGarcia-Gancedo MLKelly ALavrov JParr RHart RMarsden MRTurner KTalbot TChiwera CEShaw JMIR mHealth and uHealth 7 e13433 2019 Mobility disability and the pattern of accelerometer-derived sedentary and physical activity behaviors in people with multiple sclerosis VEzeugwu REKlaren EAHubbard PTManns RWMotl Preventive medicine reports 2 2015 Accelerometry for remote monitoring of physical activity in amyotrophic lateral sclerosis: a longitudinal cohort study RPVan Eijk JNBakers TMBunte AJDe Fockert MJEijkemans LHVan Den Berg Journal of neurology 266 2019 Using an onset-anchored bayesian hierarchical model to improve predictions for amyotrophic lateral sclerosis disease progression AGKaranevich JMStatland BJGajewski JHe BMC medical research methodology 18 2018