1. Introduction

10.1016/j.patrec.2015.04.009

Air Pollution Profiling through Patient Stratification: Study of ALS Staging Systems Usefulness in Facilitating Data-driven Disease Subtyping and Discovery of Hazardous Ambient Air Pollutants

Mohamed Chiheb Karray

0 0 Independent Researcher , Sfax , Tunisia

2017

1973 1 46

Amyotrophic Lateral Sclerosis (ALS) is a fatal neurodegenerative disease characterized by heterogeneity in disease progression and short lifespan after symptoms onset. The longstanding heterogeneity problem is still driving the research community to identify robust ALS disease subtypes through diferent approaches. Unsupervised machine learning techniques are among the sought-after computational tools to help discover reliable subtypes. The reproducibility of studies employing automated computational tools heralds a promising future for precision medicine where discovery of groups of similar patients is of paramount importance for clinical trials, personalized treatment plans as well as early disease diagnosis. However, the current state of the art of using machine learning techniques for ALS has started to make many researchers from the community address the problem of low-quality, hard to validate conducted studies. At the same time, accumulating set of findings is associating short and long-term exposure to air pollution with increased risk of having ALS or ALS disease worsening. In our study, we turn to an overlooked ALS prognosis tool to stratify patients using clustering techniques. A subsequent data profiling process made us identify interesting air pollution profiles that corroborates findings about some air pollutants being a risk factor for disease onset or disease worsening. We have also studied the impact of ingesting machine learning-based classification and survival analysis models with environmental data on the performance of such models. The targeted task through which such models were assessed is the prediction of occurrence of ALS-related events characterizing an endpoint of the disease or the reaching of a serious disease stage.

1. Introduction

months to 18 months in general [ 2 ] but could be even longer reaching 27 months in some cases [ 3, 4 ]. The incurable disease is characterized by a high degree of heterogeneity making it hard to diagnose[ 5 ]. The clinical heterogeneity of ALS is noticeable in terms of age at disease onset, site of onset, disease progression rates as well as the manifestation of behavioural and cognitive changes [ 2 ]. Duration of diagnostic delay is another heterogeneity factor of the disease [ 6 ]. The variability of the disease patterns is a long-debated problem in ALS research which is hindering the quest for cures that could stop or at least slow the progression of the disease for more than few months. This multi-factorial variability is considered to be the “greatest challenge” for ALS researchers [ 5 ]. Tackling the problem of finding reliable ALS subtypes by means of clustering tools is a way to help researchers design more personalized treatment plans and more precise diagnosis and prognosis tools.

Throughout our participation at iDPP 2023 competition and knowing that ALS patient stratification is among the goals of the "Brainteaser" project, we have attempted to design a considerably reliable clustering process whose outcome is provided in the greatest possible details for clinicians to analyze and make suggestions for future improvements in terms of expected characteristics of clusters. Being aware of the importance and the recency of studying relationship between air pollutants and ALS, we thought that clustering patients would certainly help us study the potential added value of environmental data on disease progression-related tasks such as the ones targeted by the iDPP 2023 competition [ 7, 8 ].

That being said, our paper is organized as follows: Section 2 introduces related works; Section 3 describes our approach; Section 4 discusses our main findings; finally, Section 5 draws some conclusions and outlooks for future work. 2. Related Work 2.1. ALS and exposure to air pollutants Exposure to particles carried by air has been pinpointed to be a risk factor for neurodegenerative diseases and ALS is no exception. Ambient air pollution is found to be among the plausible causes of oxidative stress and neuroinflammation which could lead to neurodegeneration[ 9, 10 ]. According to many studies, ambient particulate matter (PM) is among the main culprits behind neurotoxicity [ 10 ]. The smaller the particulate the more fatal they become [ 9 ]. We can distinguish between 10 or coarse particles (with diameter < 10µm), 2.5 or fine particle (with diameter < 2.5 µm), 1 submicron particles (with diameter < 1µm) and UFPM/ 0.1 ultrafine particles (with diameter < 100 nm) [ 9, 10 ]. Particulate Matter in general is a combination of chemical and biological particles that difers by regions, seasons, weather patterns and local pollution sources [ 9, 11 ]. In ALS literature, the few studies [12, 13, 14, 11, 15] that approached the efect of air pollutants on the disease aggravation have focused on studying coarse PM and fine PMs for the PM category among other air pollutants such as Ozone 3 and Nitrogen Dioxide 2. The aforementioned studies have mainly studied long-term exposure impact and to a lesser extent the impact of short-term exposure. All the covered studies have conducted association studies between levels of exposure to one or more air pollutants and the aggravation of ALS. This aggravation was sometimes measured through “surrogate” indicators like the admission to hospital for ALS being a primary admission reason [13, 11]. Conclusions of studies investigating the relationship between pollutants and ALS aggravation (or Risk) were contrasted on whether a given particle is associated with ALS risk or not. Nitrogen Dioxide from all the studied air pollutants seems to be a risk factor of ALS on which many studies have agreed [16, 13]. For 10 and 3, consensus was far from being reached on them being behind increased risk of ALS. As for 2.5, covered studies are mostly agreeing on their hazardous impact pertaining to ALS aggravation/risk of occurrence. The subtilty about 2.5 is that its composition and not its concentration level within a region is what makes it harmful and thus coinciding, at higher odds, with people in a region being impacted by ALS [16, 11]. 2.2. Clustering for ALS patient stratification Unsupervised Machine Learning techniques such as clustering are helpful in finding groups of similar patients according to their characteristics (demographic, clinical, labs data, genetic, imaging-derived . . . ). The advantage ofered by such techniques is that they are data-driven. Such techniques spare the clinicians the burden of ‘manually’ dissecting into separate groups the patients, usually characterized by an important number of features, through the use of criteria derived from the literature or inspired from their own medical experience. For diseases such as ALS, characterized by high heterogeneity, the use of clustering techniques could be much of a help to uncover clusters or ‘profiles’ of patients among a studied population which might be indiscernible by a human expert (clinician). However, the improper use of data-driven techniques and the adoption of their outcomes, if applied on commonly used datasets in a given study field, could have catastrophic consequences. In the ALS community, adding to the issue of lack of use of such techniques, outcomes of the few studies relying on clustering for patient stratification have never been quantitatively examined.

During the last few years, some researchers have been voicing their discontent about the improper use of clustering techniques for ALS patient stratification [ 1, 17 ]. For example, the very recent review on ML-related stratification and prediction models for ALS [ 17] has concluded that data-driven stratification techniques for ALS lack validation as well as reproducibility. In [ 1 ], they have contested the design of clustering studies as a whole.

In fact, to ensure that clustering results are reliable, many criteria should be satisfied. The most important of them is the choice of a similarity metric that reflects domain expertise in relation to what makes patients close to each other in terms of data characteristics[18]. The second consequential step is to ensure the absence of distance concentration phenomenon with respect to the chosen metric [19, 20]. That is to assess whether data representation of patients paired with a chosen similarity metric would inherently lead to distant and separable non-interleaving groups. Testing for the absence of a distance concentration phenomenon for the chosen metric is a statistical guarantee of the data clusterability [21] especially for high-dimensional data. The not less important step of choosing the clustering algorithm itself is detrimental to the outcomes to be obtained since the whole progression/diagnosis process will be built on such outcomes. In a context where data tend to form clusters of diferent shapes and densities with an absence of a unanimous agreement on possible number of clusters to obtain; utmost care should be given to the choice of a clustering algorithm. Finally, the internal validation of the clustering outcome which is generally an index that assesses the quality of obtained clusters has to also encode the reality of the domain expertise. Internal clustering validation techniques generally tend to say whether a chosen/produced number of clusters is correct compared to other partitions (stratification) with diferent numbers of clusters. Such ‘correctness’ difers from one validation method to another and depends on the way the validation method defines ‘dense’ and ‘separate’ clusters. Another overlooked problem in clustering validation is the inadequacy of chosen validation metrics to the applied clustering algorithm. Even in the ML community of clustering validation, neglecting one or more of the above-mentioned steps is rampant. The problem itself is not drawing so much attention within the ML community since the inception of the second spring of Neural Networks (a.k.a Deep Learning and Diferential Programming) a decade ago.

Even though problems pertaining to clustering validation are not fully solved or adequately addressed, the last thorough review of internal clustering validation techniques dates back to 2013 [22]. As for external stratification validation, a better performance of a disease progression model could be thought of as a surrogate for a better stratification of patients. However, in a brief preliminary study, we have shown that a better performance of a disease progression model predicting ALSFRS-R slope does not entail a better clustering quality [23]. For the same study we have also managed to slightly improve the disease progression model by applying the same Random Forests model on clusters of patients validated by an Ensemble of Internal Clustering validation metrics. For the 2015 ALS patient stratification challenge, for the same task of disease progression, the winning model used patients as a single group outperforming many submissions having used a varied number of cluster numbers and diferent clustering approaches [24]. Quality of obtained clusters for diferent participating groups were never assessed to the best of our knowledge.

In[25], authors have used K-Means as a clustering algorithm and the Euclidean distance as similarity metric (default choice even if it is not mentioned). For ALS patient stratification, given the approach followed by the study, diferent problems will arise. First, the Euclidean distance is not adapted to mixed-type data (where categorical and continuous data are present). Then, K-Means (an instantiation of Gaussian Mixture Models), assumes that clusters are normally distributed around a mean so that clusters tend to be organized in concentric circled (in 2D) with similar number of patients existing in each cluster (K-Means uniform efect[ 26]). Afterwards, simply executing the clustering algorithm for diferent cluster numbers (ranging from 2 to 15) and concluding that a given cluster number is the most adequate one according to the K-means cost function value for diferent cluster numbers is fallacious. K-Means as an algorithm is not capable of returning the whole dataset as a single cluster if data are not clusterable. Thus, any given cluster number will provide the user with a clustering. Afterwards, the used t-SNE projection in the study to demonstrate the quality of the stratification clearly contradicts the claim of the adequacy of the clustering method to the used data. In fact, the provided projection proved that the used clustering algorithm has divided the dataset to balanced sets of patients even for the case of a cluster that should be taken as is without further divisions. Similar examples are widespread in the data-driven ALS patient stratification literature.

Using a neural network (a feed-forward neural network or an autoencoder) with a diferentiable K-Means loss function would have the advantage of learning a new data representation (with a lower dimensionality) but will not palliate with the inherent assumption of the K-Means loss function. That is to say that patient stratification using clustering techniques within the ALS research community is plagued with a lot of wrong design choices. 3. Methodology 3.1. Feature Engineering 3.1.1. Features obtained from static Data The iDPP dataset is characterized by a high number of variables with missingness rates making them unexploitable for data analysis (percentage of missing value >65%). In our study, we have deemed that some features even if found highly missing among the studied population could be informative if grouped with features with the same semantic. Henceforth, eight features related to diferent diseases recorded were grouped under a single categorical feature dubbed ‘Disease_History’. For the created feature, we have found more than 75 diferent combinations of diseases. Another related feature counting the number of diseases for each patient was also created (‘Number_of_Diseases’). The second feature was thought to be helpful in profiling groups of patients disregarding the details related to their disease history. In the same line of thought, features encoding the presence of ALS-characteristic gene mutations were transformed since they were fragmented over two centers (Turin and Lisbon). We also thought of the possibility of finding regional clusters of ALS especially when augmenting the static and visits data with environmental data. Therefore, a feature called ‘Nationality’ was created using the presence of values in variables with sufixes Turin and Lisbon.

Having in mind that our objective is to better profile diferent groups of patients showing similar disease progress patterns, we divided information carried by the ‘onset_limb_type’ into separate groups of anatomical locations. The groups were Upper/Lower, Right/Left and Proximal/Distal. Proceeding with the encoding of the anatomical groups made us discover a set of instances where the anatomical description of the afected limb was present while ‘onset_limb’ was set to False. The diagnosis delay being the diference between diagnosis day and disease onset date was also included in our feature set extracted from static data. As for features related to height and weight, we have used them to compute two features ‘BMI_Before_Onset’ and ‘BMI_at_Onset’. The rest of features whose missingness rate exceeds 65% were deleted. And to make the set of static data features exploitable by ML algorithms we have proceeded with a simple replacement of missing values with the same constant value. 3.1.2. Features obtained from visits data ALS staging system outcomes are believed to serve as a "potential input" to patient stratification methods according to [ 1 ]. Consequently, in order to uncover the heterogeneity of patients belonging to the studied population we considered relying on diferent staging systems for ALS. MiToS [27], Fine till 9 [28] and D50 [29, 30] staging systems were employed. The King’s staging system requires the presence of an information about the introduction of gastrostomy for ALS patients. Usually, this information is known if answers to item 5 of the ALSFRS-R questionnaire were present with two sufixes (A and B) [ 31]. For our case, this type of information, namely PEG (Percutaneous Endoscopic Gastrostomy), was used as an event to be predicted. Absence of the aforesaid information made the use of King’s staging system impossible for our case.

For the three employed staging systems, a common set of pieces of information could be obtained which is the set of stages reached by each patient. MiToS and Fine till 9 could provide us with a more dynamic description of the disease progress for every new visit. Whereas the D50 staging system being based on a parametrized exponential disease progression model would provide a possible staging that could be optimistic, pessimistic or even misleading (in case of lack of visits data where their corresponding ALSFRS-R are not belonging to two diferent ranges at least). However, with enough data adhering to the requirement of the model we can get a standardized view of disease progress of patients even with diferent disease progression dynamics. The second set of details that could diferentiate patients is the set of afected domains or functional groups being attained at each visit. This type of information is only provided by MiToS and Fine till 9 staging systems. The other reason for using two staging systems other than D50, is the diference between the aforementioned systems regarding their “sensitivity” to diferent phases of ALS progression. At an early phase of the disease progression, MiToS tends to be less sensitive to changes occurring to the patient while Fine till 9 was reported to be more sensitive to changes taking place at the same early disease progression phase [28, 32]. Diferences between both systems could help us identify fine-grained groups of patients with heterogeneous progression patterns.

Features extracted from visits data were categorical (nominal and ordinal) and continuous. The information carried by engineered features falls under one of the following categories: Afected regions, reached disease stage, duration of each stage, disease progression severity and frequency of visits. For MiToS and Fine till 9 we distinguish two types of afected zones for a patient. MiToS looks for afected functional domains “that involve loss of autonomy” [ 33] while Fine till 9 helps at identifying afected functional groups (bulbar, fine motor, gross motor and respiratory). For each patient and after running MiToS and Fine till 9 staging algorithms, we kept afected zones for his initial and last visits. Each sequence of afected zones constitutes a unique category so that a pair of initial and final set of afected zones would be able to diferentiate patients from each other. Staging systems also enable the quantification of the reached disease stage for each newly fed ALS visit data (answers to the ALSFRS-R questionnaire and the total ALSFRS-R score). With the availability of a such information, the sequence of reached stages for each patient could be generated. Obtained sequences of stages could identify disease reversals and cases where the disease progression is stable/ized. Obtained sequences were simplified so that a patient with a stable stage for successive visits would have a single stage taken into account for such visits. Coupling the inferred sequence with visits times, the duration of each stage for each patient is computed. Once durations are computed, we recurred to normalizing them with respect to the overall disease duration. The caveat in determining the normalized disease stage duration is the absence of a disease endpoint in training data (death or tracheostomy)[34]. To circumvent the absence of a unique disease endpoint for patients (in the training set), we assumed that the endpoint is the prediction horizon chosen for the competition. Consequently, the stage reached by a patient during his last recorded visit is supposed to last until the prediction horizon. Our assumption followed the same adopted approach by the competition organizers to compute the ALSFRS-R slope. The slope was computed by supposing that a ‘virtual visit’ took place at the date of disease onset and thus the perfect ALSFRS-R score remained unchanged until the occurrence of the first ‘real’ ALS visit. The principle of assuming the absence of any changes in disease state with the absence of real data remains questionable especially for a disease characterized by high heterogeneity and short lifespan after onset. Such an issue is to be briefly discussed in the results section.

For the third staging system used in our study, D50, we have kept two types of information: the date at which the patient is expected to lose 50% of his functional abilities (or D50) and the ifnal standardized stage reached by the patient according to rD50 values (visits dates transformed using the D50 inferred parameters for a given patient). The final standardized stage is determined using experimental intervals from the literature. According to [35], a patient could be in one of 3 stages: early stable phase of progression (0<rD50<0.25), early progressive phase of progression (0.25<=rD50<0.50) and late progressive and stable progression phase (0.5<=rD50).

To represent the disease progression speed, we have computed the early and late ALSFRS-R slopes, where the early slope is rate of ALSFRS-R points lost by month between the disease onset and the date of the first visit and the late slope is the same rate compute between the last visit and the first one. Both quantities were negative (in contrast to the slope variable present in static data of patients). Categorizing the speed based on the computed slope was intended as a next step to transform the slopes, however the literature was not conclusive on clear-cut thresholds to map slopes to progression categories given certain time periods. The other issue with slopes in our case is the high variability of time before the first ALS visit. 3.1.3. Features extracted from environmental time-series Static representation of diferent times series of environmental particles was preceded by two main exploratory data analysis procedures: count of environmental data presence for each particle for the patients and visualization of time-series along with ALSFRS-R curves. We have noticed that not all patients have environmental data. For the particles whose data are the most present among patients, only 26% of patients belonging to datasets A, B and C do have such particles-related data. Those particles are the atmospheric/weather-related particles. The presence of pollutants data is highly variable ranging from 3% to 21%. Looking for patients having data related to all particles at once was fruitless: none of the patients belonging to the training set had records of time-series pertaining to all 15 environmental particles covered by the iDPP dataset. The objective of the competition is to investigate the impact of the exposure to pollutants on improving the performance of models predicting risk of medical events. As a result, we focused our preliminary analyses on studying the presence of pollutants data among patients as well their patterns with respect to disease progression. In addition to studying the presence of each pollutant data independently of other pollutants, exploring the number of patients having records of two pollutants out of seven led to the necessity to remove the least occurring ones. For example, only two patients in the training set (all datasets included) had data on 66 and 2. The most frequent joint occurrences are of ( 10,3) with 234 patients and ( 2.5,3) with 188 patients out of 467 patients.

Visualizing time-series of the most frequent pollutants such as 10, 2.5 and 3 (for patients having records of one of the three pollutants) along with the ALSFRS-R curve led to noticing patterns such as: the absence of time series data for a large period starting from the date of the first ALS visit, the ALSFRS-R records do exist before/after the presence of any information on the pollutant (both curves do not share any common time periods) . . . That is to say that adding to the irregularity of ALSFRS-R entries, pollutants time-series data are not always useful to study correlation between high exposure and disease progress let alone to study causality between exposure to one or more pollutants and disease worsening. The previous visualization procedure was mainly performed on patients with fast disease progress patterns.

In order to avoid a high rate of missing data for the features representing environmental data, we had to go for the most frequently present pollutants, namely 10, 3 and 2.5. The European Air Quality Index chart (EAQI), provided us with diferent levels of exposure to each of the three particles. Those levels were used to devise two features for each pollutant: ratio_medium and ratio_poor. Ratio_medium represents the ratio of days during which a patient was exposed to medium concentrations of a pollutant over the total number of days of exposure. Ratio_poor stands for the sum of ratios of exposure to high, very high and extremely high concentrations of a pollutant. Adding to the couple of features mentioned beforehand, we added the number of exposure days, and basic aggregate statistics about pollutant time-series such as Min, Max and Mean concentrations of pollutants based on the ‘measurement’ feature present in the competition dataset. Presence of diferent combinations of the most frequent pollutants among the set of patients was also included in the form of Boolean features quantifying the joint existence of environmental data (presence of ( 10, 2.5, 3), presence of ( 10, 2.5), presence of ( 10, 2, 3) . . . ). 3.2. Identification of patient clusters and environmental profiles 3.2.1. Patient Stratification To be able to extract clusters of patients we have only relied on features extracted from visits data with the exception of the diagnosis delay obtained from static data. Since the data to be clustered was a mix of categorical and continuous attributes we chose Gower distance as a similarity matrix. We have checked for the absence of the distance concentration phenomenon using the distribution of distances. Then we applied t-SNE for dimensionality reduction and data visualization. Patients’ data represented by the obtained t-SNE embedding was clustered using the HDBSCAN algorithm[36]. The algorithm was used for its capacity to extract clusters with varying densities (and shapes) without the need to provide it with a pre-specified number of clusters to be obtained. The stratification process was followed by a detailed profiling of each obtained cluster (Tables A.1, A.2 and A.3). 3.2.2. Pollution profiles of stratified patients It can be clearly understood that features extracted from static data and environmental timeseries were not used for the clustering of patients. Having quantified exposure rates according to pollution levels specified by the EAQI, we have used them to create pollution profiles. A pollution profile is characterized by a collection of features that provide an overview of air pollution exposure for a given set of patients. For our case, that profile is given by poor and high rates of exposure to 3 diferent pollutants: 10, 2.5 and 3. This exposure is measured by the ratio of number of days of exposure to medium and higher concentrations of a particle over all the exposure duration to the same particle. Information of such ratios are carried by features <Particle_Name>_ratio_Poor and <Particle_Name>_ratio_Medium.

To measure the similarity between clusters in terms of exposure to a certain pollution level of a pollutant, we have estimated the Probability Density Functions of pollution ratios of each particle within each cluster of patients. Then, we have employed the Jensen-Shannon divergence to obtain a similarity matrix for each particle and each level of exposure for all the clusters. The next step was to group Patient Clusters given the computed similarity matrices. Given that the overall number of clustering operations is equal to the number of particles times the number of exposure levels (Medium and High), a co-occurrence matrix was computed. It will represent which Patients’ clusters had similar exposure ‘footprint’ with respect to diferent pollutants and exposure levels. That co-occurrence matrix will serve as a new output to another clustering process to obtain the sought pollution profiles. For all the clustering operations, HDBSCAN was used as a clustering algorithm. The previously described process is inspired from the way of performing ensemble clustering. 3.3. Risk Prediction Models Risk prediction for the 3 provided datasets was cast as a survival analysis task and as classification task. For the classification task, we have relied on an Ensemble of Classifiers where their number, their weights, their algorithms and their parameters were determined by an Automated Machine Learning method [37]. The ensemble was validated through a 5-fold cross validation process maximizing the accuracy score of the ensemble classifier. Two separate Ensemble classifiers were retained. Each one of them was trained on a diferent data representation (data with Environmental Features and data without Environmental Features). The time of the automatic search process was scheduled to be the same for both ensembles. As for the survival analysis model, we have relied on Survival RandomForests. The hyperparameter search process looked for the optimal set through the maximization of the C-index of the trained model. The same model was applied for data with and without environmental features.

For both scenarios, the hyperparameters search process was executed on a single dataset: dataset A for the Survival Analysis case and dataset B for the Classification case. Models with the retained set of hyperparameters were employed to predict outcomes of the rest of the datasets. For the classification case, the Ensemble was trained once on dataset B and predictions for tasks A and C were based on ‘knowledge’ acquired from data of task B. That is to say, that for the classification scenario we have performed a basic transfer of learnt ‘knowledge’ from task B that was used to make predictions for the two remaining tasks.

We have also made some initial attempts to train our models on obtained clusters of patients but the preliminary results were not encouraging due in part to the small size of some clusters (especially for the case of patients with environmental data). We have also paid attention to the important class imbalance phenomenon for the diferent tasks. For the class imbalance issue, we have tried a two-process data augmentation scenario. Its first step consisted in undersampling the majority class. The next step was to generate synthetic data for both classes using deep-learning based methods (2 variants of Variational Autoencoders [38]). The obtained data (original and synthetic) were merged and obtained models were applied on the blended ‘new’ dataset. Again, the preliminary results led to a decrease of the performance of survival and classification models compared to the sole use of the original data in the training process. Little to no feature selection procedures were performed during our experiments.

4. Results

4.1. Patient Stratification Results The proposed clustering procedure led to obtaining 15 clusters and a cluster of outliers. From the t-SNE projection of the data where clusters are distinguishable by diferent colors (Figure 1), we can see that there are 4 main groups of clusters. Further details about the general characteristics of each cluster are provided in Tables A.1, A.2 and A.3 of appendix A.

Clusters to left side of Figure 1 are clusters where we have patients with a unique visit but with diferent characteristics. Clusters at the center of the figure (Clusters 11, 12, 13 and 14) are clusters with the most noticeably rapid disease progression patterns.

The blue cluster (labeled as "Cluster 1") at the uppermost right side of the figure is a cluster of patients with very slow to constant disease progress rate (in terms of early and late ALSFRS-R slopes). In fact, it contains the highest number of people with constant disease progress after the first ALS visit (for the whole study population and for the population of patients with environmental data). Out of 174 patients (belonging to the whole study population) with late ALSFRS-R slopes equaling zero, 62 patients belong to "Cluster 1" which is equivalent to 35% of the patients with constant disease progress after the first ALS visit. The cluster with the second highest number of patients with constant progress is cluster “Cluster 3" situated below “Cluster 1” in figure 1. It contains 16% of patients with constant disease progress. Groups labeled as “Cluster 7” and “Cluster 6” come next in line with 15.5% and 12.6% of patients with constant progress respectively. Among the most prominent diferences separating "Cluster 1" from clusters 3, 6 and 7 is the last Fine till 9 reached stage. For the first cluster 100% of patients have stayed at stage 0 of the aforementioned staging system. Without any exception patients in clusters 3, 6 and 7 have reached stage 1 of the Fine till 9.

Coming to clusters visible at the exact center of the figure, "Cluster 11" is the most interesting one in terms of speed of disease progression with ALSFRS-R slope passing from -0.81 to -1.89 and a D50 stage of 2 for all the patients. Worse than patients of "Cluster 11", are those of "Cluster 13" with a mean ALSFRS-R slope decreasing from -0.75 to -2.2. Interestingly enough fast progressors of "Cluster 13" have a very similar pollution profile with the slow progressors of "Cluster 6" according to our results. Details about both pollution profiles will be provided in the next section.

"Cluster 12" is characterized by interesting properties in terms of progression speed as its mean ALSFRS-R slope goes from -1.4 to -2.58. More than half of the patients of this cluster have reached an advanced Fine till 9 stage (stage 4) and the rest of them reached stage 3. In 32.9% of the cases, patients belonging to this cluster have lost 3 functional groups and 8.2% of them have had their 1st ALS visit with 4 lost functional groups. This clearly entails that those patients have reached stages 2 and 1 long before their first visit. Consequently, if we want to extract time of each stage for such patients we wrongly obtain a mean standardized duration of 65% at stage 0 and near 5% for stages 1 and 2. Such obviously misleading durations are due to the fact of considering that patients had the perfect ALSFRS-R score since disease onset until their ifrst ALSFRS-S visit. Even if such absence of data would probably not occur in clinical trials, coming with methods to estimate approximate dates of transition from stage 0 to stages 1 and 2 in such cases would make both stratification and disease progression models more reliable. D50 progression model will not be capable of filling the gap for various reasons. Among them we can cite the most important problem of subjectivity of the ALSFRS-R scale, the constraining need to provide the model with data points from diferent ALSFRS-R ranges and his tendency of nonlinear decrease which will not take into account cases of reversals (if they are really happening), improvement of functionality of diferent regions being covered by the ALSFRS-R scale . . .

Coming now to the cases of ALS reversals, we have detected 47 over the 1787 patients (2.6%) belonging to the pool of patients in training data for the diferent tasks subject of the competition. Over 21% of the reversal cases belonged to "Cluster 6", 19% of them were part of "Cluster 2" and over 12% belonged to "Cluster 1". With respect to MiToS and Fine Till 9 staging systems, they have indicated cases of reversals for 6 and 16 cases respectively. For one case, Fine till 9 returned a reversal from stage 3 to stage 0 (patient "0x73a002c9518c344906b2929c4a299cf9"). For the mentioned patient, his ALSFRS-R score has increased from 35 to 46 according to his visits data in less than 4 months. As for MiToS we have found only one noticeable case for a patient who went back from stage 2 to stage 1 (patient "0xabfbe3b27f02e837e604375dfddb6978"). For the same patient, his last Fine till 9 stage was 4. 4.2. Environmental profiles of stratified patients Applying the method discussed in the methodology section led us to obtaining 3 clusters of pollution profiles and 1 cluster of outliers. Patients’ "Cluster 10" was among the outliers. In fact, the computed Jensen-Shannon dissimilarity matrices have shown us that the pollution profile of "Cluster 10" was the most distant from the rest of profiles. It is characterized by the lowest exposure rates to 10 and 2.5. Profiles of Clusters 4 and 8 were diferent from the rest of the profiles but were less distant than "Cluster 10". Cluster 4 pollution profile is quite specific in terms of the mean rate of exposure to high concentrations of 2.5. As a matter of fact, it is the only cluster where patients were exposed to high concentrations of 2.5 particles for the longest period of time (almost 30% of 245 days in average). Patients of "Cluster 12 "had been exposed to high 2.5 concentrations in a very similar way as "Cluster 4". Even though pollution profile of "Cluster 12" had more similarities with ‘clusterable’ pollution profiles more than the rest of outliers, it was also labelled as an outlier. Yet, it was almost equal to ‘high profile’ Clusters 11 and 13 for the rate of exposure to medium concentrations of 2.5.

None of the pollution profiles of the 15 clusters had a perfect similarity for the probed pollutants PM10; 2.5 and 3. However, we have had interesting groupings of pollution profiles of Clusters 1 and 14 from one side and Clusters 6 and 13 from the other side. Clusters 1 and 14 shared similar pollutant exposure profiles for medium concentrations of 10 (Medium and High concentration levels) as well as 2.5 High concentration levels. Both clusters had the same distribution of exposure levels to medium 10 concentrations. As for Clusters 6 and 13, they were characterized by similar exposure rates to Medium and High 10 concentrations as well as similar exposure to Medium 3 concentrations. They were dissimilar regarding their exposure to high and medium 2.5 concentrations. 4.3. Disease risk prediction results As we have explained in the methods section, the outcome of the patients clustering was not exploited for the models we have used to generated our submitted results. In the present section, we will try to describe and comment diferent aspects of the performance of our models for the diferent tasks.

Figure 2, depicts obtained C-index scores for the diferent models submitted to the competition. In x-axis, ’CE’ stands for Classifier Ensemble and ’SuRF’ stands for Survival Random Forests.Each model’s type is followed by the task name. Blue bars stand for models using static and visits data only and the orange ones represent models which are using the environmental in addition.

We will now analyze performance of submitted models from a classification perspective. For task A, as we can see from Figure 3, the introduction of environmental features does not seem to benefit both models. From another side, the classifier did better than the survival analysis model both in terms of AUROC over time and in term of the confidence of its prediction on the long run. As we go further in time, the confidence interval of the survival model widens while the ensemble classifier keeps the same width all along the months following the prediction horizon.

For task B, we can clearly see in Figure 4 that the Ensemble classifier model is scoring its best score in terms of AUROC at 36 months. It exceeds its performances in tasks A and C and the reason is obvious: The model was trained and ‘hyper-parametrized’ on task B data. Noticeably, the width of the confidence interval at 12 months for ensemble classifiers is at its largest. The survival analysis models tend to remarkably outperform ensemble classifiers for the period going from 12 to 30 months when environmental features are introduced. Then, the ensemble classifier reclaims the lead again.

Finally, considering the figure 5 depicting AUROC for task C, the first noticeable phenomenon is how unconfident and inaccurate was the ensemble classifier when trying to correctly predict the occurrence of death for a given patient. From that observation, we can assume that knowledge being transferred from task B was mainly related to the occurrence of other events other than death. The co-existence of another event along the ‘death’ event mitigates the propensity of the classifier to perform badly. The introduction of environmental data to the ensemble classifier goes hand in hand with performance deterioration. The phenomenon was consistently present when assessing the classifier performance through two diferent ways (C-Index and AUROC).

All in all, we can conclude that introducing the environmental features to the survival analysis model made it perform better for all the tasks and sometimes with significant gaps with respect to its counterpart trained on static and visits data. The opposite was the case for the ensemble classifier model. However, from an AUROC point of view the ensemble classiefir model managed to beat the survival model in the majority of the cases.

Quantitatively speaking, the Ensemble classifier trained on data without environmental features and pre-trained on subtask 3-B data has managed to get the best results for AUROC at prediction horizons beyond 18 months. The assertion holds true for the 3 subtasks: 3-A, 3-B and 3-C. Hence, we can conclude that exploiting knowledge gained from one substask is beneficial for the rest of subtasks when an important proportion of patients is shared between similar prediction tasks. The survival analysis model gets its best results for prediction horizons less than or equal to 18 months. Even though it was hyper-parametrized on subtask 3-A data, the model did not beat pre-trained Ensemble classifiers applied on data for the same task except for 1 case out of 20. Introducing environmental data to prediction models was proven useful only for few cases and the survival model was the main benefiter. However the positive impact of environmental data introduction does not last beyond the 18 months prediction horizon in the best case. Preceding conclusions can be observed from results reported in the following table (table 2). 5. Conclusions and Future Work In this paper summarizing our participation at iDPP 2023 competition, we tried to approach the issue of the impact of environmental data on ALS risk prediction models from a Machine Learning perspective. Our goal can be summarized in the following question: what are the air pollution patterns that characterize groups of ALS patients with fast progression? To answer the question we have begun with stratifying patients relying on the disease progression patterns provided by features extracted from applying staging systems on visits data. The stratification process was performed by a clustering algorithm that does not need a preset number of clusters. Obtained groups of patients were then profiled to determine their common characteristics: clinical, demographic and environmental. Afterwards, a second clustering procedure was carried out to detect clusters of patients with similar exposure concentrations to 3 diferent air pollutants. Then, the obvious next step was to perform risk prediction on each cluster separately and combine the predictions. However, due to time limitations the step was not reached and predictions were made on the whole population.

Even though we have presented some guidelines to conduct a reliable study based on clustering, our study has some caveats pertaining to the design of the clustering procedure. The lack of available implementations of many necessary tools (clustering algorithms, similarity metrics for mixed-type data, statistical tests ...) hampered our efort to design a robust clustering model. The irregularity of environmental time-series made feature engineering process dificult especially that ALSFRS-R visits were too variable from one cluster to another. Additionally, the D50 model was not reliable enough to provide us with enough data points that really reflect the disease progress for fast progressing patients. Such factors made diving deeper into studying the causal relationship between air pollutants and disease progression quite dificult and necessitating advanced methods for disease progression modelling and for imputation of missing time-series data. Despite all the encountered obstacles, providing a multimodal real-world dataset for ALS to the research community would make collaborative and reproducible data analysis studies possible which will certainly help the community reach answers to pending questions about ALS in a faster and more certain ways.

For our next steps, we intend to work on improving our patient stratification process by looking for more homogeneous groups. Even though ( 2) is believed to be a risk factor for ALS, the lack of time-series data for the whole training set pertaining to the air particle prevented us from including it into our air pollution profiling step. We may work on studying pollution patterns for ( 2) using only patients groups of interest where data are available for more than 50% of the group. Our future work would also include working on improvements of the risk prediction models by studying the possibility of applying one or more models on diferent clusters and comparing their overall result with models applied on the whole dataset. Acknowledgments We would like to thank Guglielmo Faggioli for his valuable assistance and informative answers to our requests throughout the competition. density estimates, in: Advances in Knowledge Discovery and Data Mining, Lecture notes in computer science, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 160–172. [37] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, F. Hutter, Eficient and Robust Automated Machine Learning, in: Advances in Neural Information Processing Systems, volume 28, 2015. [38] Z. Qian, B.-C. Cebere, M. van der Schaar, Synthcity: facilitating innovative use cases of synthetic data in diferent data modalities (2023). ) % . 1 .

1 d 1 9t= a p V (1 ) % N E a ) % 6 .

2 P ( r e t s % 4 . 2 = u t l a

p 1 1 C 5(1 4 4 5 6 ) 3 1 .

0 % .5 . 0 .

% % 6 .6 % % 6 3 3 0 0 9

) % .4 . .

% 9 7 8 . 5 0 8 ( % )1 % 1 . =3 % t 1(1 3 5 68 .62 .(0 28 .3 a .1 . p 2 3 8

% 6 .6% .86 % 2 2 8 .52 .(0 % )1

% )6 6 3 0 .

8 2 (0 (1 25 2% 4% .677 .(01 12% 0 2 0 6 0 5 0 0 4 8% .(0 32% .(07 4%

% % % % 1 4 4 % 8 % 0 % 6 % % 4 4 2

% % 2% .61 .1 % 8 : ) 4 0 . (0 ) 4 0 . (0 ) 4 0 .

(0 % 7 . % 6 3 % 6 . 5 5 ) 1 6 . (0 ) 1 6 . ( 0 ) 5 7 . (0 ) 2 8 . (0 ) 3 8 . (0 ) 5 7 . (0 ) 6 7 . 0 ( ) 9 7 . (0 ) 7 7 . (0 ) 2 8 . (0 ) 2 8 . (0 ) 3 8 . ( 0 % .6 .

% % 8 .

8 0 1 1 7 1 1 % % .4 .

1 8 1 6 2 ) 1 ( % 1 .

5 % 7 .

% % 3 .

9 % % .6 .

7 8

% 0 .6 % 4 .

% % 2 .

1 o r P s i D _ b m i l _ t e s ) 3 3 . (0 ) 2 3 . 0 ) 8 3 . ) 9 0 . (0 ) 5 2 . (0 ) 4 2 . (0 ) 6 3 . (0 ) 1 3 . (0 ) 7 3 . (0 ) 6 3 . (0 ) 8 4 . (0 ) 6 0 . (0 ) 1 2 . (0 ) 7 1 . (0 ) 8 1 . (0 ) 3 3 . (0 ) 7 1 . (0 ) 9 1 . (0 ) 8 1 . (0 ) 5 1 . (0 ) )

% % .4

9 1 9 4 2 1 3 2 1 % .9 . 2 % % 5 .

6 3 7 5 2 1 % .6 .

% % 1 .

8 2 1 5 5 2 1 1 ( % 1 . 5 ) 2 3 .

0 % ( 4 % . % 2 4 2 2 2 2 % % .6 .

6 2

% 8 .3 % % 7 .

7 % . 0 5 0 5 3 1 % % 4 .

5 % . 5 4 2 2 3 1 ) . .2 ( (0 )4 .)4 0 (0 % % 7% .4 .

4 . 5 5 E L i R _ b m i l _ t e s n ) 8 1 . (0 ) 8 0 . 0 ) 7 1 . (0 ) 4 0 . (0 ) 4 0 . ) 4 0 .

% .3 % % 10 .77 .8 : : 0 6 4 3 3 1 ) n o i t a p u c 3 1 9 0 ) ) 7 1 3 2 1 0 3 ) % 9 . 2(1 3 3 54 .63 .(0 48 .29% .11 2 3 2 2 0 2% 8% .317% .()046 40% .(05 =2 % % 5 )1 % % % ) tap 5 .34 .1 8 2 .6 7 .9% .14 .9% 0% % .9 0 ) 4 0 . 0 ( ) 5 2 .

0 % ( ) 4 0 .

(0 % 9% .6 .

% % 9 . 8 8 0 . 2 2 2 2 2 ) 4 0 .

(0 % 7 . 2 2 2 2 1 3 .

A e l b a T m iv x e u l e g N A S A p e r ) ) V C ( n M M d e x i d e z i l a r s e b n e i g L _ _

m r a l lb a

i u x b a _ _ t t t t e e e e s s s s s n n n n n o L p U _ b m i l _ t e t

t t n n n e se e se e se e e e e ru b ru b ru f e L _ t h b e n c 9 O A 3 3

M M M M M 0 P P P P P N ) 5 . ( 3 0 ) 1 3 0 . ( 2 7 2 ) 8 ) 3 . 9 8 2 ) 6 2 0 . ( 8 8 . 0 4 2 ) 8 3 . 0 ( 6 7 . 7 4 2 ) 4 2 . 0 ( 1 8 2 ) 1 ! . 1 (

E U % L

A .3 1 .0 V 3 0 # 8 ) .4 ) 1 4 ( 1 .0 ( 0 0 8 .2 % 2 .9

8 ) 9 .6 ) 1 6 ( 3 . 7 2 . 3 0 0 88 239 .(0 .0 ( ! E

U % % %

% % % % % % . 1 6 . . 1 5 1 5 % % % 2% .33 .5 .8

0 2 . 3 2 1 8 : : : 2 % % 6 % .5 5% .6 2 9 . 3 2 1 9 : : 1

: 1 3 2 % % 2% .19 .61 . 4 1 7 : : 3 1 3 % % % .93 .41 .1 3 2 2 : : 3 2 1 % % .6 .7 8 5 % 2 2 0 : : 4 1 2 % 9 % 4% .6 .2 . 4 6 4 : :

% % % .08 .13 .1 3 2 3 : : 2 2 1 . q e t rF n t e s s o b s e s a e s i D _ f o _ r e b y r o t s i H _ e s a e s i 1 9 . 0 ( ) 1 4 . 1 ( 4 0 . 0 ) )

) 5

6 8 (1 1% .3 ) .0 47 207 .(05 1 . )

) )

) ) ) ) e c n e s e r p f o % ( e l b a l i a e r p f o % ( 2 P a a 7 B v d f r 1 D A N o D R _ _ ) e c n e s e r p f o ) n m u % a i ( e d

e e l

M ( b a s il y

M _ i o t a r _ ) e c n e s e r p f o )

n % a i ( e

M e r l o o l P _ o i t

( b a s i y a a v d A N _ _ n e s e r p f

M ( l b a s li y a a v d A N _ _ i a r _ 0 o P _ o i t a a n o e i g n

a m n _ o i s s o a l a n i F r o l a i t i I n ( s p u o r G _ d e c t

M s s o r G & o o M e n i F & r a b l r r to t o M e n F i & r a b l u e f f A _ 9 r _ a b l e in u u u F B B B B r r to t o & o y

r o o t r r a b l u & r o t o o o y r o o o C i e n i F r o f s n o i g e R o g M : 5 .

A i

u s l c x e _ s n r a e m d

d d te n a a l e r _ d n _ e v i t n a o r i s i n i a _ s s e n i s o i s s e f o r p _ n o i t a a _ s s e n i s r t s s i n r i e

p m l m d

d a a _ _ d d n n

d s s e t

e t e e b a d n n io io s s e i r o t r o

s w r _ e

k r t o _m ls d a n o i s s e r 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 G 1 1 1 1 1 1 d s p u o y r r to lg a i p s f d y r o t o M M s s s s o r

o G a c ; ia i

a m m e c ; a i a m i l s y ;

r e e s d

r e s i d r a p l s y ; n e r t ; o s i d o a t s i i m e

[1]

Grollemund ,

P.-F.

Pradat ,

Querin ,

Delbot ,

Le Chat ,

J.-F.

Pradat-Peyre ,

Bede , Machine learning in amyotrophic lateral sclerosis: Achievements, pitfalls, and future directions , Front. Neurosci . 13 ( 2019 ) 135 .

[2]

Bendotti ,

Bonetto , E. Pupillo,

Logroscino ,

Al-Chalabi ,

Lunetta ,

Riva , G. Mora, G. Lauria,

J. H.

Weishaupt ,

Agosta ,

Malaspina ,

Basso ,

Greensmith ,

L. Van Den

Bosch ,

Ratti ,

Corbo ,

Hardiman ,

Chiò ,

Silani , E. Beghi, Focus on the heterogeneity of amyotrophic lateral sclerosis , Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration 21 ( 2020 ) 485 - 495 . doi: 10 .1080/21678421. 2020 . 1779298 .

[3]

M. D.

Armstrong , G. Hansen,

K. L.

Schellenberg , Rural Residence and Diagnostic Delay for Amyotrophic Lateral Sclerosis in Saskatchewan , Canadian Journal of Neurological Sciences 47 ( 2020 ) 538 - 542 . doi: 10 .1017/cjn. 2020 . 38 .

[4]

H. R.

Rashed ,

M. A.

Tork , Diagnostic delay among ALS patients: Egyptian study , Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration 21 ( 2020 ) 416 - 419 . doi: 10 .1080/21678421. 2020 . 1763401 .

[5] M. A. van Es , O.

Hardiman , A.

Chio , A.

Al-Chalabi , R. J.

Pasterkamp , J. H.

Veldink , L. H. van den Berg , Amyotrophic lateral sclerosis, The Lancet 390 ( 2017 ) 2084 - 2098 . doi: 10 . 1016/S0140- 6736 ( 17 ) 31287 - 4 .

[6]

Karam ,

J. D.

Berry , Heterogeneity, urgency, generalizability, and enrollment: The HUGE balance in ALS trials , Neurology 92 ( 2019 ) 215 - 216 . doi: 10 .1212/WNL. 0000000000006837.

[7]

Faggioli ,

Guazzo ,

Marchesin ,

Menotti , I. Trescato,

Aidos ,

Bergamaschi , G. Birolo,

Cavalla ,

Chiò ,

Dagliati , M. de Carvalho,

G. M.

Di Nunzio ,

Fariselli ,

J. M.

García Dominguez ,

Gromicho ,

Longato ,

S. C.

Madeira , U. Manera,

Silvello ,

Tavazzi , E. Tavazzi,

Vettoretti ,

B. Di

Camillo ,

Ferro , Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2023 , in: A. Arampatzis , E. Kanoulas, T.

Tsikrika , S.

Vrochidis , A.

Giachanou , D.

Li , A.

Aliannejadi , M.

Vlachos , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), Lecture Notes in Computer Science (LNCS) , Springer, Heidelberg, Germany, 2023 .

[8]

Faggioli ,

Guazzo ,

Marchesin ,

Menotti , I. Trescato,

Aidos ,

Bergamaschi , G. Birolo,

Cavalla ,

Chiò ,

Dagliati , M. de Carvalho,

G. M.

Di Nunzio ,

Fariselli ,

J. M.

García Dominguez ,

Gromicho ,

Longato ,

S. C.

Madeira , U. Manera,

Silvello ,

Tavazzi , E. Tavazzi,

Vettoretti ,

B. Di

Camillo ,

Ferro , Overview of iDPP@CLEF 2023 : The Intelligent Disease Progression Prediction Challenge , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), CLEF 2023 Working Notes, CEUR Workshop Proceedings (CEUR-WS.org) , ISSN 1613-0073 ., 2023 .

[9]

Jankowska-Kieltyka ,

Roman , I. Nalepa , The air we breathe: Air pollution as a prevalent proinflammatory stimulus contributing to neurodegeneration, Front . Cell. Neurosci . 15 ( 2021 ) 647643 .

[10]

L. G.

Costa , Trafic-related air pollution and neurodegenerative diseases: Epidemiological and experimental evidence, and potential underlying mechanisms , in: Advances in