<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.patrec.2015.04.009</article-id>
      <title-group>
        <article-title>Air Pollution Profiling through Patient Stratification: Study of ALS Staging Systems Usefulness in Facilitating Data-driven Disease Subtyping and Discovery of Hazardous Ambient Air Pollutants</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamed Chiheb Karray</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Independent Researcher</institution>
          ,
          <addr-line>Sfax</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1973</volume>
      <fpage>1</fpage>
      <lpage>46</lpage>
      <abstract>
        <p>Amyotrophic Lateral Sclerosis (ALS) is a fatal neurodegenerative disease characterized by heterogeneity in disease progression and short lifespan after symptoms onset. The longstanding heterogeneity problem is still driving the research community to identify robust ALS disease subtypes through diferent approaches. Unsupervised machine learning techniques are among the sought-after computational tools to help discover reliable subtypes. The reproducibility of studies employing automated computational tools heralds a promising future for precision medicine where discovery of groups of similar patients is of paramount importance for clinical trials, personalized treatment plans as well as early disease diagnosis. However, the current state of the art of using machine learning techniques for ALS has started to make many researchers from the community address the problem of low-quality, hard to validate conducted studies. At the same time, accumulating set of findings is associating short and long-term exposure to air pollution with increased risk of having ALS or ALS disease worsening. In our study, we turn to an overlooked ALS prognosis tool to stratify patients using clustering techniques. A subsequent data profiling process made us identify interesting air pollution profiles that corroborates findings about some air pollutants being a risk factor for disease onset or disease worsening. We have also studied the impact of ingesting machine learning-based classification and survival analysis models with environmental data on the performance of such models. The targeted task through which such models were assessed is the prediction of occurrence of ALS-related events characterizing an endpoint of the disease or the reaching of a serious disease stage.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        months to 18 months in general [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] but could be even longer reaching 27 months in some cases
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. The incurable disease is characterized by a high degree of heterogeneity making it hard
to diagnose[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The clinical heterogeneity of ALS is noticeable in terms of age at disease onset,
site of onset, disease progression rates as well as the manifestation of behavioural and cognitive
changes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Duration of diagnostic delay is another heterogeneity factor of the disease [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
variability of the disease patterns is a long-debated problem in ALS research which is hindering
the quest for cures that could stop or at least slow the progression of the disease for more than
few months. This multi-factorial variability is considered to be the “greatest challenge” for ALS
researchers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Tackling the problem of finding reliable ALS subtypes by means of clustering
tools is a way to help researchers design more personalized treatment plans and more precise
diagnosis and prognosis tools.
      </p>
      <p>
        Throughout our participation at iDPP 2023 competition and knowing that ALS patient
stratification is among the goals of the "Brainteaser" project, we have attempted to design a
considerably reliable clustering process whose outcome is provided in the greatest possible
details for clinicians to analyze and make suggestions for future improvements in terms of
expected characteristics of clusters. Being aware of the importance and the recency of studying
relationship between air pollutants and ALS, we thought that clustering patients would certainly
help us study the potential added value of environmental data on disease progression-related
tasks such as the ones targeted by the iDPP 2023 competition [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ].
      </p>
      <p>
        That being said, our paper is organized as follows: Section 2 introduces related works;
Section 3 describes our approach; Section 4 discusses our main findings; finally, Section 5 draws
some conclusions and outlooks for future work.
2. Related Work
2.1. ALS and exposure to air pollutants
Exposure to particles carried by air has been pinpointed to be a risk factor for neurodegenerative
diseases and ALS is no exception. Ambient air pollution is found to be among the plausible
causes of oxidative stress and neuroinflammation which could lead to neurodegeneration[
        <xref ref-type="bibr" rid="ref10 ref9">9,
10</xref>
        ]. According to many studies, ambient particulate matter (PM) is among the main culprits
behind neurotoxicity [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The smaller the particulate the more fatal they become [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We can
distinguish between  10 or coarse particles (with diameter &lt; 10µm),  2.5 or fine particle
(with diameter &lt; 2.5 µm),  1 submicron particles (with diameter &lt; 1µm) and UFPM/ 0.1
ultrafine particles (with diameter &lt; 100 nm) [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. Particulate Matter in general is a combination
of chemical and biological particles that difers by regions, seasons, weather patterns and local
pollution sources [
        <xref ref-type="bibr" rid="ref9">9, 11</xref>
        ]. In ALS literature, the few studies [12, 13, 14, 11, 15] that approached
the efect of air pollutants on the disease aggravation have focused on studying coarse PM
and fine PMs for the PM category among other air pollutants such as Ozone 3 and Nitrogen
Dioxide  2. The aforementioned studies have mainly studied long-term exposure impact and
to a lesser extent the impact of short-term exposure. All the covered studies have conducted
association studies between levels of exposure to one or more air pollutants and the aggravation
of ALS. This aggravation was sometimes measured through “surrogate” indicators like the
admission to hospital for ALS being a primary admission reason [13, 11]. Conclusions of studies
investigating the relationship between pollutants and ALS aggravation (or Risk) were contrasted
on whether a given particle is associated with ALS risk or not. Nitrogen Dioxide from all the
studied air pollutants seems to be a risk factor of ALS on which many studies have agreed
[16, 13]. For  10 and 3, consensus was far from being reached on them being behind
increased risk of ALS. As for  2.5, covered studies are mostly agreeing on their hazardous
impact pertaining to ALS aggravation/risk of occurrence. The subtilty about  2.5 is that its
composition and not its concentration level within a region is what makes it harmful and thus
coinciding, at higher odds, with people in a region being impacted by ALS [16, 11].
2.2. Clustering for ALS patient stratification
Unsupervised Machine Learning techniques such as clustering are helpful in finding groups
of similar patients according to their characteristics (demographic, clinical, labs data, genetic,
imaging-derived . . . ). The advantage ofered by such techniques is that they are data-driven.
Such techniques spare the clinicians the burden of ‘manually’ dissecting into separate groups
the patients, usually characterized by an important number of features, through the use of
criteria derived from the literature or inspired from their own medical experience. For diseases
such as ALS, characterized by high heterogeneity, the use of clustering techniques could be
much of a help to uncover clusters or ‘profiles’ of patients among a studied population which
might be indiscernible by a human expert (clinician). However, the improper use of data-driven
techniques and the adoption of their outcomes, if applied on commonly used datasets in a given
study field, could have catastrophic consequences. In the ALS community, adding to the issue
of lack of use of such techniques, outcomes of the few studies relying on clustering for patient
stratification have never been quantitatively examined.
      </p>
      <p>
        During the last few years, some researchers have been voicing their discontent about the
improper use of clustering techniques for ALS patient stratification [
        <xref ref-type="bibr" rid="ref1">1, 17</xref>
        ]. For example, the very
recent review on ML-related stratification and prediction models for ALS [ 17] has concluded
that data-driven stratification techniques for ALS lack validation as well as reproducibility. In
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], they have contested the design of clustering studies as a whole.
      </p>
      <p>In fact, to ensure that clustering results are reliable, many criteria should be satisfied. The
most important of them is the choice of a similarity metric that reflects domain expertise in
relation to what makes patients close to each other in terms of data characteristics[18]. The
second consequential step is to ensure the absence of distance concentration phenomenon
with respect to the chosen metric [19, 20]. That is to assess whether data representation of
patients paired with a chosen similarity metric would inherently lead to distant and separable
non-interleaving groups. Testing for the absence of a distance concentration phenomenon
for the chosen metric is a statistical guarantee of the data clusterability [21] especially for
high-dimensional data. The not less important step of choosing the clustering algorithm itself
is detrimental to the outcomes to be obtained since the whole progression/diagnosis process
will be built on such outcomes. In a context where data tend to form clusters of diferent
shapes and densities with an absence of a unanimous agreement on possible number of clusters
to obtain; utmost care should be given to the choice of a clustering algorithm. Finally, the
internal validation of the clustering outcome which is generally an index that assesses the
quality of obtained clusters has to also encode the reality of the domain expertise. Internal
clustering validation techniques generally tend to say whether a chosen/produced number
of clusters is correct compared to other partitions (stratification) with diferent numbers of
clusters. Such ‘correctness’ difers from one validation method to another and depends on the
way the validation method defines ‘dense’ and ‘separate’ clusters. Another overlooked problem
in clustering validation is the inadequacy of chosen validation metrics to the applied clustering
algorithm. Even in the ML community of clustering validation, neglecting one or more of the
above-mentioned steps is rampant. The problem itself is not drawing so much attention within
the ML community since the inception of the second spring of Neural Networks (a.k.a Deep
Learning and Diferential Programming) a decade ago.</p>
      <p>Even though problems pertaining to clustering validation are not fully solved or adequately
addressed, the last thorough review of internal clustering validation techniques dates back to
2013 [22]. As for external stratification validation, a better performance of a disease progression
model could be thought of as a surrogate for a better stratification of patients. However, in
a brief preliminary study, we have shown that a better performance of a disease progression
model predicting ALSFRS-R slope does not entail a better clustering quality [23]. For the same
study we have also managed to slightly improve the disease progression model by applying
the same Random Forests model on clusters of patients validated by an Ensemble of Internal
Clustering validation metrics. For the 2015 ALS patient stratification challenge, for the same
task of disease progression, the winning model used patients as a single group outperforming
many submissions having used a varied number of cluster numbers and diferent clustering
approaches [24]. Quality of obtained clusters for diferent participating groups were never
assessed to the best of our knowledge.</p>
      <p>In[25], authors have used K-Means as a clustering algorithm and the Euclidean distance as
similarity metric (default choice even if it is not mentioned). For ALS patient stratification, given
the approach followed by the study, diferent problems will arise. First, the Euclidean distance
is not adapted to mixed-type data (where categorical and continuous data are present). Then,
K-Means (an instantiation of Gaussian Mixture Models), assumes that clusters are normally
distributed around a mean so that clusters tend to be organized in concentric circled (in 2D) with
similar number of patients existing in each cluster (K-Means uniform efect[ 26]). Afterwards,
simply executing the clustering algorithm for diferent cluster numbers (ranging from 2 to 15)
and concluding that a given cluster number is the most adequate one according to the K-means
cost function value for diferent cluster numbers is fallacious. K-Means as an algorithm is not
capable of returning the whole dataset as a single cluster if data are not clusterable. Thus,
any given cluster number will provide the user with a clustering. Afterwards, the used t-SNE
projection in the study to demonstrate the quality of the stratification clearly contradicts the
claim of the adequacy of the clustering method to the used data. In fact, the provided projection
proved that the used clustering algorithm has divided the dataset to balanced sets of patients
even for the case of a cluster that should be taken as is without further divisions. Similar
examples are widespread in the data-driven ALS patient stratification literature.</p>
      <p>Using a neural network (a feed-forward neural network or an autoencoder) with a
diferentiable K-Means loss function would have the advantage of learning a new data representation
(with a lower dimensionality) but will not palliate with the inherent assumption of the K-Means
loss function. That is to say that patient stratification using clustering techniques within the
ALS research community is plagued with a lot of wrong design choices.
3. Methodology
3.1. Feature Engineering
3.1.1. Features obtained from static Data
The iDPP dataset is characterized by a high number of variables with missingness rates making
them unexploitable for data analysis (percentage of missing value &gt;65%). In our study, we have
deemed that some features even if found highly missing among the studied population could
be informative if grouped with features with the same semantic. Henceforth, eight features
related to diferent diseases recorded were grouped under a single categorical feature dubbed
‘Disease_History’. For the created feature, we have found more than 75 diferent combinations
of diseases. Another related feature counting the number of diseases for each patient was also
created (‘Number_of_Diseases’). The second feature was thought to be helpful in profiling
groups of patients disregarding the details related to their disease history. In the same line of
thought, features encoding the presence of ALS-characteristic gene mutations were transformed
since they were fragmented over two centers (Turin and Lisbon). We also thought of the
possibility of finding regional clusters of ALS especially when augmenting the static and visits
data with environmental data. Therefore, a feature called ‘Nationality’ was created using the
presence of values in variables with sufixes Turin and Lisbon.</p>
      <p>
        Having in mind that our objective is to better profile diferent groups of patients showing
similar disease progress patterns, we divided information carried by the ‘onset_limb_type’
into separate groups of anatomical locations. The groups were Upper/Lower, Right/Left and
Proximal/Distal. Proceeding with the encoding of the anatomical groups made us discover a set
of instances where the anatomical description of the afected limb was present while ‘onset_limb’
was set to False. The diagnosis delay being the diference between diagnosis day and disease
onset date was also included in our feature set extracted from static data. As for features related
to height and weight, we have used them to compute two features ‘BMI_Before_Onset’ and
‘BMI_at_Onset’. The rest of features whose missingness rate exceeds 65% were deleted. And to
make the set of static data features exploitable by ML algorithms we have proceeded with a
simple replacement of missing values with the same constant value.
3.1.2. Features obtained from visits data
ALS staging system outcomes are believed to serve as a "potential input" to patient stratification
methods according to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Consequently, in order to uncover the heterogeneity of patients
belonging to the studied population we considered relying on diferent staging systems for ALS.
MiToS [27], Fine till 9 [28] and D50 [29, 30] staging systems were employed. The King’s staging
system requires the presence of an information about the introduction of gastrostomy for ALS
patients. Usually, this information is known if answers to item 5 of the ALSFRS-R questionnaire
were present with two sufixes (A and B) [ 31]. For our case, this type of information, namely
PEG (Percutaneous Endoscopic Gastrostomy), was used as an event to be predicted. Absence of
the aforesaid information made the use of King’s staging system impossible for our case.
      </p>
      <p>For the three employed staging systems, a common set of pieces of information could be
obtained which is the set of stages reached by each patient. MiToS and Fine till 9 could provide
us with a more dynamic description of the disease progress for every new visit. Whereas the
D50 staging system being based on a parametrized exponential disease progression model would
provide a possible staging that could be optimistic, pessimistic or even misleading (in case
of lack of visits data where their corresponding ALSFRS-R are not belonging to two diferent
ranges at least). However, with enough data adhering to the requirement of the model we can
get a standardized view of disease progress of patients even with diferent disease progression
dynamics. The second set of details that could diferentiate patients is the set of afected domains
or functional groups being attained at each visit. This type of information is only provided by
MiToS and Fine till 9 staging systems. The other reason for using two staging systems other
than D50, is the diference between the aforementioned systems regarding their “sensitivity”
to diferent phases of ALS progression. At an early phase of the disease progression, MiToS
tends to be less sensitive to changes occurring to the patient while Fine till 9 was reported to
be more sensitive to changes taking place at the same early disease progression phase [28, 32].
Diferences between both systems could help us identify fine-grained groups of patients with
heterogeneous progression patterns.</p>
      <p>Features extracted from visits data were categorical (nominal and ordinal) and continuous.
The information carried by engineered features falls under one of the following categories:
Afected regions, reached disease stage, duration of each stage, disease progression severity and
frequency of visits. For MiToS and Fine till 9 we distinguish two types of afected zones for a
patient. MiToS looks for afected functional domains “that involve loss of autonomy” [ 33] while
Fine till 9 helps at identifying afected functional groups (bulbar, fine motor, gross motor and
respiratory). For each patient and after running MiToS and Fine till 9 staging algorithms, we
kept afected zones for his initial and last visits. Each sequence of afected zones constitutes
a unique category so that a pair of initial and final set of afected zones would be able to
diferentiate patients from each other. Staging systems also enable the quantification of the
reached disease stage for each newly fed ALS visit data (answers to the ALSFRS-R questionnaire
and the total ALSFRS-R score). With the availability of a such information, the sequence of
reached stages for each patient could be generated. Obtained sequences of stages could identify
disease reversals and cases where the disease progression is stable/ized. Obtained sequences
were simplified so that a patient with a stable stage for successive visits would have a single
stage taken into account for such visits. Coupling the inferred sequence with visits times, the
duration of each stage for each patient is computed. Once durations are computed, we recurred
to normalizing them with respect to the overall disease duration. The caveat in determining the
normalized disease stage duration is the absence of a disease endpoint in training data (death or
tracheostomy)[34]. To circumvent the absence of a unique disease endpoint for patients (in the
training set), we assumed that the endpoint is the prediction horizon chosen for the competition.
Consequently, the stage reached by a patient during his last recorded visit is supposed to last
until the prediction horizon. Our assumption followed the same adopted approach by the
competition organizers to compute the ALSFRS-R slope. The slope was computed by supposing
that a ‘virtual visit’ took place at the date of disease onset and thus the perfect ALSFRS-R score
remained unchanged until the occurrence of the first ‘real’ ALS visit. The principle of assuming
the absence of any changes in disease state with the absence of real data remains questionable
especially for a disease characterized by high heterogeneity and short lifespan after onset. Such
an issue is to be briefly discussed in the results section.</p>
      <p>For the third staging system used in our study, D50, we have kept two types of information:
the date at which the patient is expected to lose 50% of his functional abilities (or D50) and the
ifnal standardized stage reached by the patient according to rD50 values (visits dates transformed
using the D50 inferred parameters for a given patient). The final standardized stage is determined
using experimental intervals from the literature. According to [35], a patient could be in one of 3
stages: early stable phase of progression (0&lt;rD50&lt;0.25), early progressive phase of progression
(0.25&lt;=rD50&lt;0.50) and late progressive and stable progression phase (0.5&lt;=rD50).</p>
      <p>To represent the disease progression speed, we have computed the early and late ALSFRS-R
slopes, where the early slope is rate of ALSFRS-R points lost by month between the disease
onset and the date of the first visit and the late slope is the same rate compute between the last
visit and the first one. Both quantities were negative (in contrast to the slope variable present
in static data of patients). Categorizing the speed based on the computed slope was intended
as a next step to transform the slopes, however the literature was not conclusive on clear-cut
thresholds to map slopes to progression categories given certain time periods. The other issue
with slopes in our case is the high variability of time before the first ALS visit.
3.1.3. Features extracted from environmental time-series
Static representation of diferent times series of environmental particles was preceded by two
main exploratory data analysis procedures: count of environmental data presence for each
particle for the patients and visualization of time-series along with ALSFRS-R curves. We have
noticed that not all patients have environmental data. For the particles whose data are the
most present among patients, only 26% of patients belonging to datasets A, B and C do have
such particles-related data. Those particles are the atmospheric/weather-related particles. The
presence of pollutants data is highly variable ranging from 3% to 21%. Looking for patients
having data related to all particles at once was fruitless: none of the patients belonging to the
training set had records of time-series pertaining to all 15 environmental particles covered by
the iDPP dataset. The objective of the competition is to investigate the impact of the exposure
to pollutants on improving the performance of models predicting risk of medical events. As a
result, we focused our preliminary analyses on studying the presence of pollutants data among
patients as well their patterns with respect to disease progression. In addition to studying
the presence of each pollutant data independently of other pollutants, exploring the number
of patients having records of two pollutants out of seven led to the necessity to remove the
least occurring ones. For example, only two patients in the training set (all datasets included)
had data on 66 and  2. The most frequent joint occurrences are of ( 10,3) with 234
patients and ( 2.5,3) with 188 patients out of 467 patients.</p>
      <p>Visualizing time-series of the most frequent pollutants such as  10,  2.5 and 3 (for
patients having records of one of the three pollutants) along with the ALSFRS-R curve led
to noticing patterns such as: the absence of time series data for a large period starting from
the date of the first ALS visit, the ALSFRS-R records do exist before/after the presence of any
information on the pollutant (both curves do not share any common time periods) . . . That
is to say that adding to the irregularity of ALSFRS-R entries, pollutants time-series data are
not always useful to study correlation between high exposure and disease progress let alone
to study causality between exposure to one or more pollutants and disease worsening. The
previous visualization procedure was mainly performed on patients with fast disease progress
patterns.</p>
      <p>In order to avoid a high rate of missing data for the features representing environmental
data, we had to go for the most frequently present pollutants, namely  10, 3 and  2.5.
The European Air Quality Index chart (EAQI), provided us with diferent levels of exposure to
each of the three particles. Those levels were used to devise two features for each pollutant:
ratio_medium and ratio_poor. Ratio_medium represents the ratio of days during which a patient
was exposed to medium concentrations of a pollutant over the total number of days of exposure.
Ratio_poor stands for the sum of ratios of exposure to high, very high and extremely high
concentrations of a pollutant. Adding to the couple of features mentioned beforehand, we added
the number of exposure days, and basic aggregate statistics about pollutant time-series such as
Min, Max and Mean concentrations of pollutants based on the ‘measurement’ feature present
in the competition dataset. Presence of diferent combinations of the most frequent pollutants
among the set of patients was also included in the form of Boolean features quantifying the
joint existence of environmental data (presence of ( 10,  2.5, 3), presence of ( 10,
 2.5), presence of ( 10,  2, 3) . . . ).
3.2. Identification of patient clusters and environmental profiles
3.2.1. Patient Stratification
To be able to extract clusters of patients we have only relied on features extracted from visits
data with the exception of the diagnosis delay obtained from static data. Since the data to be
clustered was a mix of categorical and continuous attributes we chose Gower distance as a
similarity matrix. We have checked for the absence of the distance concentration phenomenon
using the distribution of distances. Then we applied t-SNE for dimensionality reduction and
data visualization. Patients’ data represented by the obtained t-SNE embedding was clustered
using the HDBSCAN algorithm[36]. The algorithm was used for its capacity to extract clusters
with varying densities (and shapes) without the need to provide it with a pre-specified number
of clusters to be obtained. The stratification process was followed by a detailed profiling of each
obtained cluster (Tables A.1, A.2 and A.3).
3.2.2. Pollution profiles of stratified patients
It can be clearly understood that features extracted from static data and environmental
timeseries were not used for the clustering of patients. Having quantified exposure rates according
to pollution levels specified by the EAQI, we have used them to create pollution profiles. A
pollution profile is characterized by a collection of features that provide an overview of air
pollution exposure for a given set of patients. For our case, that profile is given by poor and high
rates of exposure to 3 diferent pollutants:  10,  2.5 and 3. This exposure is measured
by the ratio of number of days of exposure to medium and higher concentrations of a particle
over all the exposure duration to the same particle. Information of such ratios are carried by
features &lt;Particle_Name&gt;_ratio_Poor and &lt;Particle_Name&gt;_ratio_Medium.</p>
      <p>To measure the similarity between clusters in terms of exposure to a certain pollution level
of a pollutant, we have estimated the Probability Density Functions of pollution ratios of each
particle within each cluster of patients. Then, we have employed the Jensen-Shannon divergence
to obtain a similarity matrix for each particle and each level of exposure for all the clusters. The
next step was to group Patient Clusters given the computed similarity matrices. Given that the
overall number of clustering operations is equal to the number of particles times the number of
exposure levels (Medium and High), a co-occurrence matrix was computed. It will represent
which Patients’ clusters had similar exposure ‘footprint’ with respect to diferent pollutants and
exposure levels. That co-occurrence matrix will serve as a new output to another clustering
process to obtain the sought pollution profiles. For all the clustering operations, HDBSCAN
was used as a clustering algorithm. The previously described process is inspired from the way
of performing ensemble clustering.
3.3. Risk Prediction Models
Risk prediction for the 3 provided datasets was cast as a survival analysis task and as classification
task. For the classification task, we have relied on an Ensemble of Classifiers where their number,
their weights, their algorithms and their parameters were determined by an Automated Machine
Learning method [37]. The ensemble was validated through a 5-fold cross validation process
maximizing the accuracy score of the ensemble classifier. Two separate Ensemble classifiers
were retained. Each one of them was trained on a diferent data representation (data with
Environmental Features and data without Environmental Features). The time of the automatic
search process was scheduled to be the same for both ensembles. As for the survival analysis
model, we have relied on Survival RandomForests. The hyperparameter search process looked
for the optimal set through the maximization of the C-index of the trained model. The same
model was applied for data with and without environmental features.</p>
      <p>For both scenarios, the hyperparameters search process was executed on a single dataset:
dataset A for the Survival Analysis case and dataset B for the Classification case. Models with
the retained set of hyperparameters were employed to predict outcomes of the rest of the
datasets. For the classification case, the Ensemble was trained once on dataset B and predictions
for tasks A and C were based on ‘knowledge’ acquired from data of task B. That is to say, that
for the classification scenario we have performed a basic transfer of learnt ‘knowledge’ from
task B that was used to make predictions for the two remaining tasks.</p>
      <p>We have also made some initial attempts to train our models on obtained clusters of patients
but the preliminary results were not encouraging due in part to the small size of some clusters
(especially for the case of patients with environmental data). We have also paid attention to
the important class imbalance phenomenon for the diferent tasks. For the class imbalance
issue, we have tried a two-process data augmentation scenario. Its first step consisted in
undersampling the majority class. The next step was to generate synthetic data for both classes using
deep-learning based methods (2 variants of Variational Autoencoders [38]). The obtained data
(original and synthetic) were merged and obtained models were applied on the blended ‘new’
dataset. Again, the preliminary results led to a decrease of the performance of survival and
classification models compared to the sole use of the original data in the training process. Little
to no feature selection procedures were performed during our experiments.</p>
    </sec>
    <sec id="sec-2">
      <title>4. Results</title>
      <p>4.1. Patient Stratification Results
The proposed clustering procedure led to obtaining 15 clusters and a cluster of outliers. From the
t-SNE projection of the data where clusters are distinguishable by diferent colors (Figure 1), we
can see that there are 4 main groups of clusters. Further details about the general characteristics
of each cluster are provided in Tables A.1, A.2 and A.3 of appendix A.</p>
      <p>Clusters to left side of Figure 1 are clusters where we have patients with a unique visit but
with diferent characteristics. Clusters at the center of the figure (Clusters 11, 12, 13 and 14) are
clusters with the most noticeably rapid disease progression patterns.</p>
      <p>The blue cluster (labeled as "Cluster 1") at the uppermost right side of the figure is a cluster of
patients with very slow to constant disease progress rate (in terms of early and late ALSFRS-R
slopes). In fact, it contains the highest number of people with constant disease progress after
the first ALS visit (for the whole study population and for the population of patients with
environmental data). Out of 174 patients (belonging to the whole study population) with late
ALSFRS-R slopes equaling zero, 62 patients belong to "Cluster 1" which is equivalent to 35%
of the patients with constant disease progress after the first ALS visit. The cluster with the
second highest number of patients with constant progress is cluster “Cluster 3" situated below
“Cluster 1” in figure 1. It contains 16% of patients with constant disease progress. Groups
labeled as “Cluster 7” and “Cluster 6” come next in line with 15.5% and 12.6% of patients with
constant progress respectively. Among the most prominent diferences separating "Cluster 1"
from clusters 3, 6 and 7 is the last Fine till 9 reached stage. For the first cluster 100% of patients
have stayed at stage 0 of the aforementioned staging system. Without any exception patients in
clusters 3, 6 and 7 have reached stage 1 of the Fine till 9.</p>
      <p>Coming to clusters visible at the exact center of the figure, "Cluster 11" is the most interesting
one in terms of speed of disease progression with ALSFRS-R slope passing from -0.81 to -1.89
and a D50 stage of 2 for all the patients. Worse than patients of "Cluster 11", are those of
"Cluster 13" with a mean ALSFRS-R slope decreasing from -0.75 to -2.2. Interestingly enough
fast progressors of "Cluster 13" have a very similar pollution profile with the slow progressors
of "Cluster 6" according to our results. Details about both pollution profiles will be provided in
the next section.</p>
      <p>"Cluster 12" is characterized by interesting properties in terms of progression speed as its
mean ALSFRS-R slope goes from -1.4 to -2.58. More than half of the patients of this cluster have
reached an advanced Fine till 9 stage (stage 4) and the rest of them reached stage 3. In 32.9% of
the cases, patients belonging to this cluster have lost 3 functional groups and 8.2% of them have
had their 1st ALS visit with 4 lost functional groups. This clearly entails that those patients
have reached stages 2 and 1 long before their first visit. Consequently, if we want to extract
time of each stage for such patients we wrongly obtain a mean standardized duration of 65%
at stage 0 and near 5% for stages 1 and 2. Such obviously misleading durations are due to the
fact of considering that patients had the perfect ALSFRS-R score since disease onset until their
ifrst ALSFRS-S visit. Even if such absence of data would probably not occur in clinical trials,
coming with methods to estimate approximate dates of transition from stage 0 to stages 1 and 2
in such cases would make both stratification and disease progression models more reliable. D50
progression model will not be capable of filling the gap for various reasons. Among them we
can cite the most important problem of subjectivity of the ALSFRS-R scale, the constraining
need to provide the model with data points from diferent ALSFRS-R ranges and his tendency
of nonlinear decrease which will not take into account cases of reversals (if they are really
happening), improvement of functionality of diferent regions being covered by the ALSFRS-R
scale . . .</p>
      <p>Coming now to the cases of ALS reversals, we have detected 47 over the 1787 patients (2.6%)
belonging to the pool of patients in training data for the diferent tasks subject of the competition.
Over 21% of the reversal cases belonged to "Cluster 6", 19% of them were part of "Cluster 2"
and over 12% belonged to "Cluster 1". With respect to MiToS and Fine Till 9 staging systems,
they have indicated cases of reversals for 6 and 16 cases respectively. For one case, Fine till 9
returned a reversal from stage 3 to stage 0 (patient "0x73a002c9518c344906b2929c4a299cf9").
For the mentioned patient, his ALSFRS-R score has increased from 35 to 46 according to his
visits data in less than 4 months. As for MiToS we have found only one noticeable case for a
patient who went back from stage 2 to stage 1 (patient "0xabfbe3b27f02e837e604375dfddb6978").
For the same patient, his last Fine till 9 stage was 4.
4.2. Environmental profiles of stratified patients
Applying the method discussed in the methodology section led us to obtaining 3 clusters of
pollution profiles and 1 cluster of outliers. Patients’ "Cluster 10" was among the outliers. In fact,
the computed Jensen-Shannon dissimilarity matrices have shown us that the pollution profile
of "Cluster 10" was the most distant from the rest of profiles. It is characterized by the lowest
exposure rates to  10 and  2.5. Profiles of Clusters 4 and 8 were diferent from the rest of
the profiles but were less distant than "Cluster 10". Cluster 4 pollution profile is quite specific
in terms of the mean rate of exposure to high concentrations of  2.5. As a matter of fact,
it is the only cluster where patients were exposed to high concentrations of  2.5 particles
for the longest period of time (almost 30% of 245 days in average). Patients of "Cluster 12 "had
been exposed to high  2.5 concentrations in a very similar way as "Cluster 4". Even though
pollution profile of "Cluster 12" had more similarities with ‘clusterable’ pollution profiles more
than the rest of outliers, it was also labelled as an outlier. Yet, it was almost equal to ‘high
profile’ Clusters 11 and 13 for the rate of exposure to medium concentrations of  2.5.</p>
      <p>None of the pollution profiles of the 15 clusters had a perfect similarity for the probed
pollutants PM10;  2.5 and 3. However, we have had interesting groupings of pollution
profiles of Clusters 1 and 14 from one side and Clusters 6 and 13 from the other side. Clusters 1
and 14 shared similar pollutant exposure profiles for medium concentrations of  10 (Medium
and High concentration levels) as well as  2.5 High concentration levels. Both clusters had
the same distribution of exposure levels to medium  10 concentrations. As for Clusters 6 and
13, they were characterized by similar exposure rates to Medium and High  10 concentrations
as well as similar exposure to Medium 3 concentrations. They were dissimilar regarding their
exposure to high and medium  2.5 concentrations.
4.3. Disease risk prediction results
As we have explained in the methods section, the outcome of the patients clustering was not
exploited for the models we have used to generated our submitted results. In the present section,
we will try to describe and comment diferent aspects of the performance of our models for the
diferent tasks.</p>
      <p>Figure 2, depicts obtained C-index scores for the diferent models submitted to the competition.
In x-axis, ’CE’ stands for Classifier Ensemble and ’SuRF’ stands for Survival Random Forests.Each
model’s type is followed by the task name. Blue bars stand for models using static and visits
data only and the orange ones represent models which are using the environmental in addition.</p>
      <p>We will now analyze performance of submitted models from a classification perspective. For
task A, as we can see from Figure 3, the introduction of environmental features does not seem
to benefit both models. From another side, the classifier did better than the survival analysis
model both in terms of AUROC over time and in term of the confidence of its prediction on the
long run. As we go further in time, the confidence interval of the survival model widens while
the ensemble classifier keeps the same width all along the months following the prediction
horizon.</p>
      <p>For task B, we can clearly see in Figure 4 that the Ensemble classifier model is scoring its best
score in terms of AUROC at 36 months. It exceeds its performances in tasks A and C and the
reason is obvious: The model was trained and ‘hyper-parametrized’ on task B data. Noticeably,
the width of the confidence interval at 12 months for ensemble classifiers is at its largest. The
survival analysis models tend to remarkably outperform ensemble classifiers for the period
going from 12 to 30 months when environmental features are introduced. Then, the ensemble
classifier reclaims the lead again.</p>
      <p>Finally, considering the figure 5 depicting AUROC for task C, the first noticeable phenomenon
is how unconfident and inaccurate was the ensemble classifier when trying to correctly predict
the occurrence of death for a given patient. From that observation, we can assume that
knowledge being transferred from task B was mainly related to the occurrence of other events other
than death. The co-existence of another event along the ‘death’ event mitigates the propensity
of the classifier to perform badly. The introduction of environmental data to the ensemble
classifier goes hand in hand with performance deterioration. The phenomenon was consistently
present when assessing the classifier performance through two diferent ways (C-Index and
AUROC).</p>
      <p>All in all, we can conclude that introducing the environmental features to the survival analysis
model made it perform better for all the tasks and sometimes with significant gaps with respect
to its counterpart trained on static and visits data. The opposite was the case for the ensemble
classifier model. However, from an AUROC point of view the ensemble classiefir model managed
to beat the survival model in the majority of the cases.</p>
      <p>Quantitatively speaking, the Ensemble classifier trained on data without environmental
features and pre-trained on subtask 3-B data has managed to get the best results for AUROC at
prediction horizons beyond 18 months. The assertion holds true for the 3 subtasks: 3-A, 3-B and
3-C. Hence, we can conclude that exploiting knowledge gained from one substask is beneficial
for the rest of subtasks when an important proportion of patients is shared between similar
prediction tasks. The survival analysis model gets its best results for prediction horizons less
than or equal to 18 months. Even though it was hyper-parametrized on subtask 3-A data, the
model did not beat pre-trained Ensemble classifiers applied on data for the same task except for
1 case out of 20. Introducing environmental data to prediction models was proven useful only
for few cases and the survival model was the main benefiter. However the positive impact of
environmental data introduction does not last beyond the 18 months prediction horizon in the
best case. Preceding conclusions can be observed from results reported in the following table
(table 2).
5. Conclusions and Future Work
In this paper summarizing our participation at iDPP 2023 competition, we tried to approach
the issue of the impact of environmental data on ALS risk prediction models from a Machine
Learning perspective. Our goal can be summarized in the following question: what are the air
pollution patterns that characterize groups of ALS patients with fast progression? To answer
the question we have begun with stratifying patients relying on the disease progression patterns
provided by features extracted from applying staging systems on visits data. The stratification
process was performed by a clustering algorithm that does not need a preset number of clusters.
Obtained groups of patients were then profiled to determine their common characteristics:
clinical, demographic and environmental. Afterwards, a second clustering procedure was
carried out to detect clusters of patients with similar exposure concentrations to 3 diferent air
pollutants. Then, the obvious next step was to perform risk prediction on each cluster separately
and combine the predictions. However, due to time limitations the step was not reached and
predictions were made on the whole population.</p>
      <p>Even though we have presented some guidelines to conduct a reliable study based on
clustering, our study has some caveats pertaining to the design of the clustering procedure. The lack of
available implementations of many necessary tools (clustering algorithms, similarity metrics for
mixed-type data, statistical tests ...) hampered our efort to design a robust clustering model. The
irregularity of environmental time-series made feature engineering process dificult especially
that ALSFRS-R visits were too variable from one cluster to another. Additionally, the D50 model
was not reliable enough to provide us with enough data points that really reflect the disease
progress for fast progressing patients. Such factors made diving deeper into studying the causal
relationship between air pollutants and disease progression quite dificult and necessitating
advanced methods for disease progression modelling and for imputation of missing time-series
data. Despite all the encountered obstacles, providing a multimodal real-world dataset for ALS
to the research community would make collaborative and reproducible data analysis studies
possible which will certainly help the community reach answers to pending questions about
ALS in a faster and more certain ways.</p>
      <p>For our next steps, we intend to work on improving our patient stratification process by
looking for more homogeneous groups. Even though ( 2) is believed to be a risk factor
for ALS, the lack of time-series data for the whole training set pertaining to the air particle
prevented us from including it into our air pollution profiling step. We may work on studying
pollution patterns for ( 2) using only patients groups of interest where data are available for
more than 50% of the group. Our future work would also include working on improvements
of the risk prediction models by studying the possibility of applying one or more models on
diferent clusters and comparing their overall result with models applied on the whole dataset.
Acknowledgments
We would like to thank Guglielmo Faggioli for his valuable assistance and informative answers
to our requests throughout the competition.
density estimates, in: Advances in Knowledge Discovery and Data Mining, Lecture notes
in computer science, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 160–172.
[37] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, F. Hutter, Eficient and
Robust Automated Machine Learning, in: Advances in Neural Information Processing
Systems, volume 28, 2015.
[38] Z. Qian, B.-C. Cebere, M. van der Schaar, Synthcity: facilitating innovative use cases of
synthetic data in diferent data modalities (2023).
)
%
.
1
.</p>
      <p>1
d 1
9t=
a
p
V (1
)
%
N
E
a )
%
6
.</p>
      <p>2
P
(
r
e
t
s
%
4
.
2
=
u t
l a</p>
      <p>p 1 1
C 5(1 4 4 5 6
)
3
1
.</p>
      <p>0
%
.5 .
0 .</p>
      <p>% %
6 .6 % % 6
3 3 0 0 9</p>
      <p>)
%
.4 . .</p>
      <p>% 9
7 8 .
5 0
8 (
% )1
%
1
.
=3 %
t
1(1 3 5 68 .62 .(0 28 .3
a .1 .
p 2 3 8</p>
      <p>%
6 .6% .86
%
2 2 8 .52 .(0
% )1</p>
      <p>% )6
6 3
0 .</p>
      <p>8 2 (0
(1 25 2% 4% .677 .(01 12% 0 2 0 6 0 5 0 0 4 8% .(0 32% .(07 4%</p>
      <p>% % % % 1
4 4 % 8 % 0 % 6 % % 4
4 2</p>
      <p>% %
2% .61 .1 %
8 :
)
4
0
.
(0
)
4
0
.
(0
)
4
0
.</p>
      <p>(0
%
7
.
%
6
3
%
6
.
5
5
)
1
6
.
(0
)
1
6
.
(
0
)
5
7
.
(0
)
2
8
.
(0
)
3
8
.
(0
)
5
7
.
(0
)
6
7
.
0
(
)
9
7
.
(0
)
7
7
.
(0
)
2
8
.
(0
)
2
8
.
(0
)
3
8
.
(
0
%
.6 .</p>
      <p>% %
8 .</p>
      <p>8
0 1 1
7 1 1
% %
.4 .</p>
      <p>1
8 1
6 2
)
1
(
%
1
.</p>
      <p>5
%
7
.</p>
      <p>% %
3 .</p>
      <p>9
% %
.6 .</p>
      <p>7
8</p>
      <p>%
0 .6
%
4
.</p>
      <p>% %
2 .</p>
      <p>1
o
r
P
s
i
D
_
b
m
i
l
_
t
e
s
)
3
3
.
(0
)
2
3
.
0
)
8
3
.
)
9
0
.
(0
)
5
2
.
(0
)
4
2
.
(0
)
6
3
.
(0
)
1
3
.
(0
)
7
3
.
(0
)
6
3
.
(0
)
8
4
.
(0
)
6
0
.
(0
)
1
2
.
(0
)
7
1
.
(0
)
8
1
.
(0
)
3
3
.
(0
)
7
1
.
(0
)
9
1
.
(0
)
8
1
.
(0
)
5
1
.
(0
)
)</p>
      <p>%
% .4</p>
      <p>9
1 9
4 2 1
3 2 1
%
.9 .
2
% %
5 .</p>
      <p>6
3 7
5 2 1
%
.6 .</p>
      <p>% %
1 .</p>
      <p>8
2 1 5
5 2 1
1
(
%
1
.
5
)
2
3
.</p>
      <p>0
% (
4
% . %
2 4 2
2 2 2
% %
.6 .</p>
      <p>6
2</p>
      <p>%
8 .3
% %
7 .</p>
      <p>7
% .
0 5 0
5 3 1
% %
4 .</p>
      <p>5
% .
5 4 2
2 3 1
) .
.2 (
(0
)4 .)4
0 (0
% %
7% .4 .</p>
      <p>4
. 5 5
E
L
i
R
_
b
m
i
l
_
t
e
s
n
)
8
1
.
(0
)
8
0
.
0
)
7
1
.
(0
)
4
0
.
(0
)
4
0
.
)
4
0
.</p>
      <p>%
.3 %
% 10 .77
.8 : :
0 6 4
3 3 1
)
n
o
i
t
a
p
u
c
3 1 9
0
)
)
7 1 3
2 1
0 3
)
%
9
.
2(1 3 3 54 .63 .(0 48 .29% .11 2 3 2 2 0 2% 8% .317% .()046 40% .(05
=2 % % 5 )1 % % % )
tap 5 .34 .1 8 2 .6 7 .9% .14 .9% 0% % .9 0
)
4
0
.
0
(
)
5
2
.</p>
      <p>0
% (
)
4
0
.</p>
      <p>(0
%
9% .6 .</p>
      <p>6</p>
      <p>%
% 9
. 8 8 0 .
2 2 2 2 2
)
4
0
.</p>
      <p>(0
%
7
.
2 2 2 2 1
3
.</p>
      <p>A
e
l
b
a
T
m iv x e
u l e g
N A S A p
e
r
)
)
V
C
(
n
M
M
d
e
x
i
d
e
z
i
l
a
r s
e b
n
e i
g L
_ _</p>
      <p>m
r
a l
lb a</p>
      <p>i
u x
b a
_ _
t t t t
e e e e
s s s s s
n n n n n
o
L
p
U
_
b
m
i
l
_
t
e
t</p>
      <p>t t
n n n
e se e se e se e e e e
ru b ru b ru
f
e
L
_
t
h b e
n
c
9 O A 3 3</p>
      <p>M M M M M 0
P P P P P N
)
5
.
(
3
0
)
1
3
0
.
(
2
7
2
)
8
)
3
.
9
8
2
)
6
2
0
.
(
8
8
.
0
4
2
)
8
3
.
0
(
6
7
.
7
4
2
)
4
2
.
0
(
1
8
2
)
1 !
.
1
(</p>
      <p>E
U %
L</p>
      <p>A .3
1
.0 V 3
0 # 8
)
.4 )
1 4
(
1
.0 (
0 0 8
.2 %
2 .9</p>
      <p>8
)
9
.6 )
1 6
(
3
.
7 2 . 3
0 0 88 239 .(0
.0 (
!
E</p>
      <p>U
% % %</p>
      <p>% % % %
% % .
1 6
. .
1 5
1 5
% % %
2% .33 .5 .8</p>
      <p>0 2
. 3 2 1
8 : : :
2
% %
6 % .5
5% .6 2 9
. 3 2 1
9 : :
1</p>
      <p>:
1 3 2
% %
2% .19 .61
. 4 1
7 : :
3
1 3
% %
% .93 .41
.1 3 2
2 : :
3 2 1
% %
.6 .7
8 5
% 2 2
0 : :
4 1 2
%
9 %
4% .6 .2
. 4 6
4 : :</p>
      <p>% %
% .08 .13
.1 3 2
3 : :
2 2 1
.
q
e
t rF
n t
e s
s o
b
s
e
s
a
e
s
i
D
_
f
o
_
r
e
b
y
r
o
t
s
i
H
_
e
s
a
e
s
i
1
9
.
0
(
)
1
4
.
1
(
4
0
.
0
)
)</p>
      <p>)
5</p>
      <p>6
8
(1 1% .3 )
.0 47 207 .(05
1 .
)</p>
      <p>)
)</p>
      <p>)
)
)
)
e
c
n
e
s
e
r
p
f
o
%
(
e
l
b
a
l
i
a
e
r
p
f
o
%
(
2 P a a
7 B v d
f
r 1 D A N
o D R _ _
)
e
c
n
e
s
e
r
p
f
o )
n
m
u
% a i
( e d</p>
      <p>e
e
l</p>
      <p>M
(
b
a s
il y</p>
      <p>M
_
i
o
t
a
r
_
)
e
c
n
e
s
e
r
p
f
o )</p>
      <p>n
% a i
( e</p>
      <p>M
e
r l
o
o l
P
_
o
i
t</p>
      <p>(
b
a s
i y
a a
v d
A N
_ _
n
e
s
e
r
p
f</p>
      <p>M
(
l
b
a s
li y
a a
v d
A N
_ _
i
a
r
_
0
o
P
_
o
i
t
a
a
n
o
e i
g
n</p>
      <p>a
m n
_
o
i
s
s
o
a
l
a
n
i
F
r
o
l
a
i
t
i
I
n
(
s
p
u
o
r
G
_
d
e
c
t</p>
      <p>M
s
s
o
r
G
&amp;
o
o
M
e
n
i
F
&amp;
r
a
b
l
r r
to t
o
M
e
n
F
i
&amp;
r
a
b
l
u
e
f
f
A
_
9 r
_
a
b
l
e
in u u u
F B B B B
r r
to t
o
&amp;
o y</p>
      <p>r
o o
t
r
r
a
b
l
u
&amp;
r
o t
o o
o y
r
o
o
o
C
i
e
n
i
F
r
o
f
s
n
o
i
g
e
R
o
g
M
:
5
.</p>
      <p>A
i</p>
      <p>u
s l
c
x
e
_
s
n r
a
e
m d</p>
      <p>a</p>
      <p>d
d te
n a
a l
e
r
_
d
n
_
e
v
i
t
n a
o r
i
s
i
n
i
a
_
s
s
e
n
i
s
o
i
s
s
e
f
o
r
p
_
n
o
i
t
a
a
_
s
s
e
n
i
s
r
t
s s
i
n r
i e</p>
      <p>p
m l
m
d</p>
      <p>d
a a
_ _
d d
n
n</p>
      <p>d
s s
e t</p>
      <p>e
t
e e
b
a
d
n n
io io
s s
e
i
r
o
t
r
o</p>
      <p>s
w r
_ e</p>
      <p>k
r
t
o
_m ls
d a
n
o
i
s
s
e
r
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 G
1 1 1 1 1 1 d
s
p
u
o
y r
r
to lg
a
i
p
s
f
d y
r
o
t
o
M M
s s
s s
o r</p>
      <p>o
G
a
c
;
ia i</p>
      <p>a
m m
e
c
;
a i
a
m
i
l
s
y
;</p>
      <p>r
e e
s d</p>
      <p>r
e
s
i
d
r
a
p
l
s
y
;
n
e
r
t
;
o
s
i
d
o a
t
s i
i
m
e</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Grollemund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-F.</given-names>
            <surname>Pradat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Querin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Delbot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Le Chat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Pradat-Peyre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bede</surname>
          </string-name>
          ,
          <article-title>Machine learning in amyotrophic lateral sclerosis: Achievements, pitfalls, and future directions</article-title>
          ,
          <source>Front. Neurosci</source>
          .
          <volume>13</volume>
          (
          <year>2019</year>
          )
          <fpage>135</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bendotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bonetto</surname>
          </string-name>
          , E. Pupillo,
          <string-name>
            <given-names>G.</given-names>
            <surname>Logroscino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Chalabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lunetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Riva</surname>
          </string-name>
          , G. Mora, G. Lauria,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Weishaupt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Agosta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Malaspina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Basso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Greensmith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van Den</given-names>
            <surname>Bosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ratti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Corbo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hardiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chiò</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Silani</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Beghi,</surname>
          </string-name>
          <article-title>Focus on the heterogeneity of amyotrophic lateral sclerosis</article-title>
          ,
          <source>Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>485</fpage>
          -
          <lpage>495</lpage>
          . doi:
          <volume>10</volume>
          .1080/21678421.
          <year>2020</year>
          .
          <volume>1779298</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Armstrong</surname>
          </string-name>
          , G. Hansen,
          <string-name>
            <given-names>K. L.</given-names>
            <surname>Schellenberg</surname>
          </string-name>
          ,
          <article-title>Rural Residence and Diagnostic Delay for Amyotrophic Lateral Sclerosis in Saskatchewan</article-title>
          ,
          <source>Canadian Journal of Neurological Sciences</source>
          <volume>47</volume>
          (
          <year>2020</year>
          )
          <fpage>538</fpage>
          -
          <lpage>542</lpage>
          . doi:
          <volume>10</volume>
          .1017/cjn.
          <year>2020</year>
          .
          <volume>38</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Rashed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Tork</surname>
          </string-name>
          ,
          <article-title>Diagnostic delay among ALS patients: Egyptian study</article-title>
          ,
          <source>Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>416</fpage>
          -
          <lpage>419</lpage>
          . doi:
          <volume>10</volume>
          .1080/21678421.
          <year>2020</year>
          .
          <volume>1763401</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>M. A. van Es</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Hardiman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Al-Chalabi</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          <string-name>
            <surname>Pasterkamp</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          <string-name>
            <surname>Veldink</surname>
            ,
            <given-names>L. H. van den Berg</given-names>
          </string-name>
          , Amyotrophic lateral sclerosis,
          <source>The Lancet</source>
          <volume>390</volume>
          (
          <year>2017</year>
          )
          <fpage>2084</fpage>
          -
          <lpage>2098</lpage>
          . doi:
          <volume>10</volume>
          . 1016/S0140-
          <volume>6736</volume>
          (
          <issue>17</issue>
          )
          <fpage>31287</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Karam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Berry</surname>
          </string-name>
          , Heterogeneity, urgency, generalizability, and
          <article-title>enrollment: The HUGE balance in ALS trials</article-title>
          ,
          <source>Neurology</source>
          <volume>92</volume>
          (
          <year>2019</year>
          )
          <fpage>215</fpage>
          -
          <lpage>216</lpage>
          . doi:
          <volume>10</volume>
          .1212/WNL. 0000000000006837.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guazzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Menotti</surname>
          </string-name>
          , I. Trescato,
          <string-name>
            <given-names>H.</given-names>
            <surname>Aidos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          , G. Birolo,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cavalla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chiò</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dagliati</surname>
          </string-name>
          , M. de Carvalho,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fariselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>García Dominguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gromicho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Longato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Madeira</surname>
          </string-name>
          , U. Manera,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tavazzi</surname>
          </string-name>
          , E. Tavazzi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vettoretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Di</given-names>
            <surname>Camillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <source>Intelligent Disease Progression Prediction: Overview of iDPP@CLEF</source>
          <year>2023</year>
          , in: A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Aliannejadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), Lecture Notes in Computer Science (LNCS)</source>
          , Springer, Heidelberg, Germany,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guazzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Menotti</surname>
          </string-name>
          , I. Trescato,
          <string-name>
            <given-names>H.</given-names>
            <surname>Aidos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          , G. Birolo,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cavalla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chiò</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dagliati</surname>
          </string-name>
          , M. de Carvalho,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fariselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>García Dominguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gromicho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Longato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Madeira</surname>
          </string-name>
          , U. Manera,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tavazzi</surname>
          </string-name>
          , E. Tavazzi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vettoretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Di</given-names>
            <surname>Camillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <source>Overview of iDPP@CLEF</source>
          <year>2023</year>
          :
          <article-title>The Intelligent Disease Progression Prediction Challenge</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.),
          <source>CLEF 2023 Working Notes, CEUR Workshop Proceedings (CEUR-WS.org)</source>
          ,
          <source>ISSN 1613-0073</source>
          .,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jankowska-Kieltyka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roman</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Nalepa</surname>
          </string-name>
          ,
          <article-title>The air we breathe: Air pollution as a prevalent proinflammatory stimulus contributing to neurodegeneration, Front</article-title>
          .
          <source>Cell. Neurosci</source>
          .
          <volume>15</volume>
          (
          <year>2021</year>
          )
          <fpage>647643</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <article-title>Trafic-related air pollution and neurodegenerative diseases: Epidemiological and experimental evidence, and potential underlying mechanisms</article-title>
          , in: Advances in
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>