=Paper= {{Paper |id=Vol-3180/paper-91 |storemode=property |title=Predicting the Risk of & Time to Impairment for ALS patients |pdfUrl=https://ceur-ws.org/Vol-3180/paper-91.pdf |volume=Vol-3180 |authors=Aidan Mannion,Thierry Chevalier,Didier Schwab,Lorraine Goeuriot |dblpUrl=https://dblp.org/rec/conf/clef/MannionCSG22 }} ==Predicting the Risk of & Time to Impairment for ALS patients== https://ceur-ws.org/Vol-3180/paper-91.pdf
Predicting the Risk of & Time to Impairment for ALS
patients
Report for the Lab on Intelligent Disease Progression Prediction at CLEF 2022

Aidan Mannion1,2 , Thierry Chevalier3 , Didier Schwab1 and Lorraine Goeuriot1
1
  Laboratoire d’Informatique de Grenoble, Université Grenoble Alpes, CNRS, 38058 Grenoble, France
2
  EPOS SAS, 2-4 Boulevard Des Îles, 92130 Issy-les-Moulineaux, France
3
  UFR de Médecine Université Grenoble Alpes, Domaine de la Merci, 38700 La Tronche, France


                                         Abstract
                                         This report details our participation at the Intelligent Disease Progression Prediction (iDPP) track at
                                         the Conference & Labs of the Evaluation Forum (CLEF) 2022. This task focuses on the progression of
                                         Amyotrophic Lateral Sclerosis (ALS), a progressive neurodegenerative disease that affects nerve cells in
                                         the brain and spinal cord. The goal of this work is to use patient demographic data & certain medical
                                         history details along with collections of records of responses to an ALS diagnostic questionnaire to
                                         calculate risk scores corresponding to the likelihood that a patient will suffer an adverse event, and to
                                         predict the time window in which that event will occur. We present an approach based on ensemble
                                         learning, in which gradient-boosted regression trees are used to separately predict risk scores and
                                         estimate survival times. By normalising & thresholding the risk scores, we generate event predictions
                                         which are combined with the time-to-event predictions to produce time-interval predictions. While
                                         some aspects of the results seem encouraging, especially given the amount of training data available, it
                                         is clear that more sophisticated and specialised solutions are required in order for techniques like these
                                         to become a reliable part of clinical decision-making.

                                         Keywords
                                         Clinical Data Science, Survival Analysis, Disease Progression Prediction




1. Introduction
The classification & ranking of patients according to their risk of adverse events and the
estimation of when those adverse events are likely to occur are some of the most tractable &
useful applications of machine learning to healthcare. Being able to efficiently and robustly
predict when certain patients are likely to need urgent clinical intervention has the potential
to greatly enhance the quality of care provided by healthcare professionals, from the point of
view of the allocation of time & resources for consultation and for the informed construction of
treatment plans.
   This paper describes a proposed approach to automated prediction of the progression of
Amyotrophic Lateral Sclerosis (ALS) as part of the iDPP evaluation campaign at CLEF 2022
[1, 2]. The paper is organized as follows: Section 2 introduces related works; Section 3 describes
CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ aidan.mannion@univ-grenoble-alpes.fr (A. Mannion); thy.chevalier@free.fr (T. Chevalier);
didier.schwab@imag.fr (D. Schwab); lorraine.goeuriot@imag.fr (L. Goeuriot)
                                       © 2022 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
our approach; Section 4 explains our experimental setup; Section 5 summarises & discusses
the evaluation of the results; finally, Section 6 draws some conclusions and outlooks for future
work.
  The iDPP track is divided into two tasks;

    • Task 1: Ranking Risk of Impairment for ALS - estimation of the risk of early occurrence
      of an adverse event and ranking subjects based on these risk scores
    • Task 2: Predicting Time of Impairment for ALS - prediction of the time interval in which
      the adverse event is most likely to occur.

Three kinds of adverse event are considered; Non-Invasive Ventilation (NIV), Percutaneous
Endoscopic Gastrostomy (PEG), and death. Both of the above tasks are thus divided into three
subtasks, each with a different set of target variables;
   1. Task a: NIV or death,
   2. Task b: PEG or death,
   3. Task c: death
Separate datasets are provided for each of these three variants. The datasets contain general
patient information and some clinical history, as well as a series of observations of the progres-
sion of their disease over a six-month period, in the form of their responses to the ALSFRS-R
(Amyotrophic Lateral Sclerosis Functional Rating Scale - Revised) [3], a standard method of
evaluating the state of an ALS patient by how their ability to breathe, move, and perform tasks
requiring muscle co-ordination is affected. Further, for each of the six tasks described, we
calculate risk scores and survival times at both the beginning and the end of this six-month
observation period (denoted by M0 and M6 respectively).
   Data collected from clinical trials or disease progression studies tends to present many unique
difficulties that inhibit the utility of standard statistical learning techniques. Most notably,
clinical studies that focus on modelling the time to a particular event or patient outcome of
interest almost always contain censorship, i.e. patients that leave the study without having
experienced the event. Standard practice in machine learning would be to ignore or discard
samples for which the target variable was not observed, but given the nature of clinical studies
this is an impractical waste of data. The event of interest may or may not have occurred after
the time of censoring, and thus it is necessary to use models that can deal with the missingness
of a certain proportion of the outcome variables and use the information that is available for
the censored patients.
   The general class of techniques that deal with such data is known as survival analysis, which
can be used to describe or compare the survival times of groups, and estimate the effect of
quantitative or categorical predictor variables on survival (“survival” here being used as a
general term to mean time to an event).
   The survival model paradigm involves the estimation of the risk of the occurrence of an event
using a cumulative hazard function composed of two elements;
   1. an underlying risk function known as the baseline hazard function, which depends only
      on time and represents the way in which the risk of the event of interest changes at some
      baseline or default values for the input variables, and
   2. the effect parameters, which are used to take the explanatory variables into account when
      calculating the risk score.
  One of the most widely used and effective models for survival analysis is the Cox proportional
hazards model [4]. Proportional hazards refers to the assumption that the relationship between
the predictive covariates and the output of the hazard function can be effectively modelled
as multiplicative. In the Cox model, the baseline hazard function 𝐻0 (𝑡) is estimated using
Breslow’s estimator[5], and the cumulative hazard function can thus be expressed as

                                  𝐻(𝑡|𝑥) = exp(𝑓𝜃 (𝑥))𝐻0 (𝑡)

where 𝑥 is an input vector of predictive variables and 𝑓𝜃 (·) is some parameterised multiplicative
function of this input which is learned from the data.
   The strategy implemented in this work is to employ Cox’s proportional hazards model to
the task of ranking the risk of impairment, using the gradient boosting[6] learning strategy
to compute the parameters of the function 𝑓𝜃 (·). Boosting methods in general combine the
predictions of an ensemble of “weak” (i.e. very simple) decision function learners to form a
more effective learning algorithm. Gradient boosting is a generalisation that allows for any
differentiable loss function to be used by the ensemble. Boosting methods often entail a reduction
in both bias and variance as compared to single-learner methods.
   The output of the time-independent part of the survival function calculated by the gradient-
boosting survival analysis method is mapped to the interval (0, 1), as per the task requirements,
via a sigmoid function. Any set of input subject data can then be ranked according to this
function in a consistent way.
   To estimate the time-to-event, we use a regression model based on Accelerated Gradient
Boosting (AGB) [7]. This being a standard regression model, it does not take censoring into
account. We therefore use class predictions based on the Task 1 survival model to “censor” the
time-to-event predictions. More details on this process are provided in sections 3.3.1 and 3.3.2.


2. Related Work
Machine learning and survival analysis techniques have been widely experimented with in the
biomedical informatics research community, in particular for the prediction of disease progres-
sion. Recent developments have of course focused on deep learning techniques - Convolutional
Neural Networks (CNNs) have been particularly effective, for example in the work of Jarret
et al. [8], which involved the development of a generalisable framework for the prediction of
disease trajectories, or Storelli et al. [9], who used CNNs to predict Multiple Sclerosis (MS) pro-
gression from MRI data. Other advanced DL techniques have also been shown to be applicable,
however, such as multimodal Recurrent Neural Networks (RNNs) for Alzheimer’s progression
[10], autoencoders for the prediction of head & neck squamous cell carcinoma from multi-omics
data [11], or custom deep networks for the prediction of the progression of Parkinson’s from
telemonitoring data [12].
   While deep learning is a highly promising avenue for the development of more effective
computational prognostic tools that could have profound effects on the quality of care provided
to patients with these diseases, these studies all required the use of relatively large, high-
dimensional, high-frequency datasets of measurements of biomedical data (multi-omics data,
RNA/miRNA sequencing data, neuroimaging data, genetic biomarkers, etc.), in order to exploit
the potential of deep learning. It is not clear that clinical visit data, which tends to be less
detailed, lower-dimensional and more sparse than biological data, is as amenable to deep
learning, particularly in the survival analysis paradigm. Indeed, certain recent studies have
shown that simpler machine learning techniques such as gradient boosting are often the most
effective choice for survival prediction tasks [13, 14]. Comparitive studies that use large clinical-
trial datasets often show that there are many disease prediction tasks for which Support Vector
Machines, Random Forest classifiers, and even Logistic Regression outperform more advanced
algorithms [15, 16]. Ensemble learning techniques including random forests and gradient
boosting have even shown success in the prediction of genes associated with certain diseases
[17] and K-means clustering in the survival prognosis of patients with cervical cancer from
genomic data [18].
   In recent years the COVID-19 pandemic has attracted a wealth of research into effective
prediction of disease progression from limited clinical data and with limited medical knowledge
of the causal processes involved. Again, while CNNs tended to be the dominant approach
for prediction of COVID severity from medical imagery [19, 20, 21], for lower-dimensional
clinical data, simpler techiniques such as ensemble methods [22, 23], k-nearest neighbours[24],
or logistic regression [25, 26] seem to often constitute the most effective choice.
   The progression of ALS, which is the focus of this work, has long been a difficult prediction
problem, due mainly to the way in which the disease exhibits a variability of onset and rate
of progression, resulting in an inherent heterogeneity in the relevant clinical data. There has
therefore been a wealth of investigation into the application of statistical techniques to this
issue, from manual statistical analyses to determine the main clinical predictors of survival rates
[27] all the way to deep learning methods based on data gathered from induced pluripotent
stem cells of ALS patients [28]. Many different statistical models have been experimented with
for the prediction of ALS progression from clinical trial data, including correlation analysis of
genetic biomarkers [29], the Weibull statistical model [30], Kaplan-Meier survival analysis [31],
the Royston-Palmer parametric survival model [32], Bayesian network classifiers [33], logistic
regression [34], and random forest classifiers [35]. Given the success of Taylor et al. [35], and
given that gradient-boosted decision trees often turn out to be more efficient and performant
than random forest classifiers, allied with the fact that to the best of our knowledge, there is no
published research testing gradient-boosting approaches on the problem of ALS progression
prediction, we elected to use an approach based on gradient-boosted survival analysis.


3. Methodology
3.1. Exploratory Data Analysis
The datasets on which this work is based are of two different types; static patient information
(fixed with respect to time) and visit records (varying across time). Each of the training datasets
comes in three overlapping variants, one for each task; the dataset for Task A (Non-Invasive
Ventilation/Death) is referred to henceforth as dataset A; likewise datasets B & C for the other
two tasks. The static-variable dataset contains 94 different variables, many of which were not
used for analysis due to a very low number of discriminative samples and/or low correlation with
outcome variables, as discussed in section 3.2.1. The temporal visit records contain two types
of visit; ALSFRS-R questionnaire results [3], integers between 1-4 representing the answers to
twelve diagnostic questions about the progression of a patient’s ALS (4 being preferable), and
spirometry data.
   It is of particular interest in the context of this challenge to compare the results of a model
that takes into account only the information available at a patient’s first visit with the results
obtained using all of the available visit data, which covers a period of at most six months. To
give an idea of the improvement that can be expected from the addition of all visit information,
we first inspect the temporal richness of the visit data. Intuitively, if there is a high proportion
of patients for which there exist a relatively frequent, regular series of measurements, we could
expect the temporal data to bring significant improvements to the results obtained using only
the static & first-visit data.
   Unfortunately, as appears to be clear from the initial analysis and as is borne out by the
experimental results, the temporal visit datasets do not contain enough measurements nor
enough variation in measurements over time to bring significant improvements to a statistical
model. Figure 1 shows that in all three datasets, the majority of patients have ≤ 3 ALSFRS-R
visits recorded in the dataset, while Figure 2 shows that even patients with a high number of
visits tend not to exhibit much variation over time, making it difficult to use these temporal
series to make inferences about disease progression and still more difficult for machine learning
models to learn generalisable functions that can make predictions about ALS progression based
on these time series. We can see that for the respiratory subscore in particular, the patients
tend to give exactly the same responses at each visit, with only one of the 42 questionnaires
visualised in Figure 2 having a respiratory score of 11 rather than 12.
   We also found that including the spirometry data (effectively a single floating-point variable
fvc_value) did not have any effect on classification performance in our experiments so it was
excluded.
   Another important aspect of the initial data analysis is to inspect the class balance in the
training datasets, as highly imbalanced classification problems require adjustments to be made
to the training and evaluation process. As Figure 3 shows, the outcome variable classes are
reasonably balanced but there are significantly less patients that are censored (no event occurred
as of the last visit on record). For this reason, sample weighting is used in the training process.

3.2. Dataset Preprocessing
3.2.1. Feature Selection
Given that our training dataset is relatively small and contains many features in the form of
binary indicators, some of which only apply to a very small number of patients, it quickly became
clear that the prediction task would benefit from the removal of certain features containing very
little signal. The first step in the data analysis process was to attempt to identify columns in the
tabular dataset that are mainly uniform and thus would not be useful for model training. Figure
4 shows that there are some binary variables in the dataset that have a reasonable (better than
Figure 1: The count of patients in each dataset separated according to the number of ALSFRS-R
questionnaire results available.


90%-10%) class balance of positive and negative cases and seem to be correlated with outcome
variables, for example, in dataset A;
    • while 10% of patients in total had the outcome NONE, i.e. did not die or need non-invasive
      ventilation during the observation period, only 4% of patients who experienced more than
      10% weight loss between diagnosis and their first visit are associated with this outcome
      (moreThan10PercentWeightloss indicator),
    • while 47% of the total patients required non-invasive ventilation during the observation
      period, 90% of the 63 who had positive values for the major_trauma_before_onset
      values, and 89% of the 135 with positive values for the retired_at_diagnosis indicator
      required NIV.
These observations seem to indicate that a patient’s weight loss in the period following their
diagnosis, their history of major trauma, and their retirement status (perhaps simply as a proxy
for age) are strong predictive values for the outcomes of interest in this study, among other
variables. Datasets B and C showed similar patterns. We found however that the very rare
indicators (the long right tail of Figure 4) that refer to more specific surgical and trauma-related
history, do not have enough positive examples and, empirically, do not contribute to model
performance.
   After initial analysis of variable prevalence and downstream experimentation on the target
tasks, the input dataset to the prediction models used for evaluation uses the static patient
variables shown in Table 1 (before preprocessing).

3.2.2. Feature Engineering
ALSFRS-R Responses The visit data representing the patient responses to the ALSFRS-R
questionnaire are represented as individual answers to each of the 12 questions, 𝑞1 , 𝑞2 , . . . , 𝑞12
Figure 2: The change over time in the three subscores of the ALSFRS-R assessment for the six patients
that completed the questionnaire seven times over a six-month observation period.


as well as the sum of the responses over three different categories of question, and the total sum.
In researching the design of the ALSFRS-R questionnaire, we observed that the motor subscore
category can be further divided into two subcategories;

    • Fine Motor Subscore: questions 4-6,
    • Gross Motor Subscore: questions 7-9

We therefore experimented with four different “levels” of representation of the ALSFRS-R data;
   1. Level 0: (𝑑 = 12) each question considered an individual variable [𝑞𝑖 ]12
                                                                             𝑖=1
   2. Level 1: (𝑑 = 4);
          • Bulbar Subscore: 𝑆bulb = 𝑞1 + 𝑞2 + 𝑞3
          • Fine Motor Subscore: 𝑆FM = 𝑞4 + 𝑞5 + 𝑞6
          • Gross Motor Subscore: 𝑆GM = 𝑞7 + 𝑞8 + 𝑞9
          • Respiratory Subscore: 𝑆resp = 𝑞10 + 𝑞11 + 𝑞12
   3. Level 2: (𝑑 = 3) the same as Level 1, but combining the subcategories of motor-function-
      related scores: 𝑆mot = 𝑆FM + 𝑆GM
Figure 3: The percentage of patients for each target outcome in each of the training datasets.


   4. Level 3: (𝑑 = 1) using only the total ALSFRS-R score, i.e. the alsfrs_r_tot_score
      field provided.
We found in our experiments that using Level 0 as the input representation of the visit data
gave the best results, likely because it gave the learning algorithm more freedom to “focus”
more closely on the specific questions that were relevant to the outcome variables of interest,
disregarding those that were less important - we found that adding up the responses (particularly
at Level 3) serves mainly to hide the variation in individual responses.

Imputation: Weight & Height One of the first problems to be addressed with the dataset of
static patient variable is that the weight & height variables (weight, weight_before_onset,
and height), have some missing values, detailed in Table 2, with 95 patients in Dataset A, 118
patients in Dataset B and 120 patients in Dataset C missing all three of these figures.
   The standard way to deal with such missingness is multiple imputation [36]. Imputation is the
process of filling in empty fields in a tabular dataset by estimating the distribution of the non-
missing values. There exist many different methods of imputation, from simple methods such
as mean substitution [37], to more sophisticated statistical methods like regression imputation
[38] and linear-algebraic methods such as non-negative matrix factorisation [39].
   Multiple imputation is an extension of the imputation process that aims to reduce the bias
                Variable Name                                           Type
                   onsetDate                                      float (Months)
                 diagnosisDate                                    float (Months)
                       sex                                              binary
                     height                                           float (m)
             weight_before_onset                                      float (kg)
                     weight                                           float (kg)
       moreThan10PercentWeightloss                                      binary
         major_trauma_before_onset                                      binary
     surgical_interventions_before_onset                                binary
                   age_onset                                        float (years)
                    mixedMN                                             binary
                 onset_bulbar                                           binary
                   onset_axial                                          binary
               onset_limb_type                                       categorical
             retired_at_diagnosis                                       binary
            ALS_familiar_history                                        binary
                    smoking                                             binary
              turin_C9orf72_kind                                     categorical
                 hypertension                                           binary
                    diabetes                                            binary
                  dyslipidemia                                          binary
               thyroid_disorder                                         binary
             autoimmune_disease                                         binary
                     stroke                                             binary
                cardiac_disease                                         binary
              primary_neoplasm                                          binary
                      slope                   float (rate of change in ALSFRS-R score since onset)
Table 1
Static variables used in the final input datasets.


Table 2
Summary of missing values from the static-variable training datasets.
             Variable                              Number of Missing Values
                                   Dataset A (n=1454) Dataset B (n=1715) Dataset C (n=1756)
     height                               106                131                134
     weight_before_onset                  293                419                426
     weight                               111                141                143


and quantify the uncertainty introduced by the estimation process. This is done by averaging
the outcomes across multiple imputed datasets; instead of estimating the missing values directly
from the non-missing ones, multiple imputation methods estimate the underlying distribution
of the dataset and creates 𝑚 different imputed datasets by drawing all of the imputed values
from this distribution 𝑚 times. These 𝑚 different imputed datasets can be used either to run
downstream analysis 𝑚 times, in order to quantify the uncertainty introduced by imputation
through the comparison of the results, or one single imputed dataset can be estimated by taking
Figure 4: The number of positive examples of binary-indicator variables in dataset A (1454 patients),
and the type of outcome recorded for those examples.


the average across all 𝑚 datasets in order to mitigate bias that would be introduced by using
single imputation.
   In order to fill in the missing weight & height values, we used the MIDAS multiple imputation
method [40], which is based on a denoising autoencoder architecture and has been shown to be
highly effective at approximating missing values in data with complex, non-linear relationships
among variables.

BMI The height and weight of a patient are not, it seems, useful health-related indicators in
and of themselves, i.e. knowing a patient’s height will tell us nothing about their level of risk
of anything unless we know their weight, and vice versa. A commonly used way to combine
information about a patient’s height and weight to generate an indicator of health is the Body
Mass Index (BMI), which is simply the ratio of a person’s weight in kilograms to the square of
their height in metres. Moreover, given that the moreThan10PercentWeightloss indicator
seems to be an important one from the point of view of the outcome variables of interest, we
decided to summarise the three variables height, weight and weight_before_onset with
a single variable bmi_change, representing the change in BMI since the onset of ALS.
   We also generated one-hot encodings of the categorical variables onset_limb_type (𝑑 = 5)
and turin_C9orf72_kind (𝑑 = 2), resulting in an input dataset for the prediction models
with 30 static variables, to be combined with the ALSFRS-R questionnaire data as described in
Figure 5: General overview of the prediction pipeline.


the following section. It was assumed that missing values for binary or categorical variables
corresponded to “none” or negative values, and empty fields were filled in accordingly.

3.3. Combined Approach for Risk Estimation & Time-to-Event Prediction
The modelling approach taken works as follows; the questionnaire data is aggregated over time
(this being for the M6 tasks; for the M0 tasks only the first visit of each patient is used) and
joined to the static input variables to be used as input to two different types of gradient-boosting
models;
   1. binary survival-analysis models trained using outcome events as the target variable
   2. regression models using the time-to-event as the target variable
The output of the first type of model can be used directly for Task 1, as described in section 3.3.1,
and it’s classification predictions are used in conjunction with the predictions of the regression
model to form predictions for Task 2, as described in 3.3.2. An overview of the general pipeline
is shown in Figure 5. For tasks A and B, the survival analysis and regression steps are run
separately for the two outcomes of interest. More details on the training implementation are
provided in section 4.

Temporal Aggregation Given the limitations of the temporal aspect of the visit data dis-
cussed in 3.1, it became clear that not enough data was available for more advanced sequence
modelling or time-series techniques to be effective. After experimenting with several ag-
gregation methods, we found that the best results for the M6 task were given by a simple
temporally-weighted average, formulated as follows; given 𝑛 scores 𝑠0 , . . . , 𝑠𝑛−1 , from visits
taking place at times 𝑡0 = 0, 𝑡1 , . . . , 𝑡𝑛−1 , we calculate the ALSFRS-R feature as
                                              𝑛−1
                                              ∑︁
                                                    𝑠𝑖 𝑒𝑡𝑖
                                              𝑖=0

This allows for more recent scores to have greater proportional influence on the aggregate
score.
3.3.1. Task 1: Ranking Risk of Impairment
For task 1, the target data for the training of the survival models is the event type observed for
each subject; NONE or DEATH for Task 1c, along with NIV for Task 1a and PEG for Task 1b.
   As the survival analysis models we use only deal with a single event type at a time, for Tasks
1a and 1b we trained two separate models; one with the DEATH examples excluded and another
with the NIV/PEG examples excluded.
   The issue with using a survival model “as-is” for this task is that by default, survival models
assume that the event in question will indeed occur at some point for every subject. For this
task, our goal is not just to rank patients according to risk, but also to associate with each
patient the actual event type that is most likely to occur. Therefore, once we have chosen a
risk score for each subject in a dataset, regardless of whether that score represents the risk
of death, non-invasive ventilation, or percutaneous endoscopic gastrostomy, we would like
to identify whether in fact the patient is more likely not to experience any adverse event at
all, based on the patterns in the training dataset. This requires us to choose a classification
threshold below which we label all patients as NONE. We do this based on the ROC curve for
each set of predictions (again for Tasks 1a and 1b there will be two) - in each case we found it
was possible to choose a threshold such that the true positive rate was greater than 0.7 and the
false positive rate less than 0.5.
   As the risk score outputs are theoretically unbounded, we use the sigmoid function to generate
risk scores between 0 and 1, as it is a simple and effective way to map the entire real numberline
into (0, 1). One issue that can arise with the sigmoid is that its non-linearity does not
                                                                                       √ preserve
the proportions among its inputs. In particular, values outside the interval [2 ± 3] (points
corresponding to the flexes in the sigmoid S-curve) tend to get “squished” very close together
by the function, which here may affect our ability to choose an effective classification threshold.
We therefore use an adapted sigmoid function with an extra parameter, denoted 𝑎, that allows
the S-curve produced by the sigmoid function to be “stretched” such that the flex points occur
at ± 𝑎1 , which allows us create more space between values in a wider interval of inputs:

                                                    1
                            𝑆(𝑥, 𝑎) =           (︀    (︀  √ )︀ )︀                              (1)
                                        1 + exp −𝑎 log 2 + 3 𝑥

We experiment with different values of 𝑎 to find a more optimal trade-off between true & false
positive rates.
  The trained risk models are used to produce a ranking of a set of subjects as follows;
   1. Generate time-independent risk scores for all subjects via the trained ensembles - for
      Tasks 1a and 1b, we will have two models and thus two risk scores for each subject.
   2. (Task 1a and 1b only) Combine the two sets of risk scores by choosing the highest one for
      each subject, labelling each subject with the corresponding event.
   3. Map all risks to pseudo-probabilities via equation 1.
   4. Choose classification thresholds based on the ROC curve for the corresponding training
      dataset, and label all subjects with risk scores below that value with NONE.
   5. Sort the full set of subjects according to the risk scores.
3.3.2. Task 2: Predicting Time of Impairment
To predict the time window associated with the outcomes predicted by the algorithm described in
the previous section, we train another gradient boosting model, this time with a one-dimensional
continuous output (regression), on the same input data but with the time-to-event variables
as the target instead. Having already predicted event types for each patient, we can use those
predictions (DEATH, NONE, NIV, or PEG as the case may be, to “censor” the regression outputs.
   The task defines six time windows into which subjects must be classified (in months); 6-12,
12-18, 18-24, 24-30, 30-36, and 36+. Thus, we use the output of the gradient-boosting regression
for each subject to associate a time window with that subject, unless the predicted outcome for
that patient is NONE, in which case the predicted time window is always “36+”.


4. Experimental Setup
For each of the survival analysis models and regresssion models, we carried out a hyperpa-
rameter search using 5-fold cross-validation. For both model types, the variables tested in the
hyperparameter search were as follows;

    • learning_rate: scaling of the gradient descent steps; search space:
                   {︁                                                   }︁
                     𝑎 × 10𝑏 | 𝑏 ∈ [−5, . . . , −1] ⊂ Z, 𝑎 ∈ {1, 2, 5}

    • n_estimators: number of regression trees to create; 50, 100, 150, 200
    • max_depth: the maximum depth of each of the individual regression trees; 3, 5, or 10

The best-performing survival analysis model was selected based on the concordance index, and
the regression model based on the mean absolute error, i.e. the parameterisation that gave the
lowest value for the objective function on the held-out data for each cross-validation fold. Each
best-performing model was retrained on the full training dataset before being saved for use on
the test set.
  The training pipeline was implemented in Python 3.8, using the libraries scikit-learn
[41], sksurv [42], and xgboost [43]. Our code is made available on BitBucket1 .
  For each of the 12 variants of the input data, the full hyperparameter search, metric evaluation
& retraining process took around 11 minutes on 4 Intel i5-4400 3.3GHz CPUs.


5. Results
Hyperparameters The best results in the hyperparameter grid searches turned out to be
given by a learning rate of 0.05 (for the survival analysis models) or 0.01 (for the regression
models) and a maximum tree depth of 3, while the optimal number of estimators tended to vary
from problem to problem. This uniformity is unsurprising given that each task uses many of the
same training examples due to the overlap in the datasets, although overall there was very little
variation in the evaluation metrics among the different hyperparameter configurations tested.
   1
       https://bitbucket.org/brainteaser-health/idpp2022-lig-getalp/src/master/
Evaluation Metrics The development set metrics reported in this section (“Dev” column in
the results tables) are calculated as the average over 5-fold cross-validation on the training set.
It should be noted that these were calculated using different implementations of the scoring
functions to those used to calculate the results on the test set. For each metric reported, we show
the results on all the data on which the model was trained (to be compared to test/development
results to judge the extent to which the model overfit to the training dataset), calculated using
the metric calculation script provided2 , as well as our test-set results alongside the best test-
set results in the evaluation campaign (both of these from the results files provided by the
organisers). For more details on how the submitted results were evaluated, see the task overview
[1, 2].
   For Task 1, the area under the ROC curve for a censoring time of 5 years (60 months) is shown
in Table 3, the Brier score for the same censoring time in Table 4 and the concordance index
scores (Harrel’s C-index) in Table 5. The most striking observation is that the AUROC scores
are relatively poor on the development sets compared to the test set (Table 3), but this pattern
is reversed in the case of the Brier score (Table 4). This is particularly unusual given that both
represent scores calculated on roughly the same amount of test data (the test set is about 1/4
the size of the training set).

Table 3
AUROC scores for Task 1 at a censoring time of 60 months.
                                   Task        Train    Dev      Test    Best
                                   T1a_M0      0.848    0.645    0.760   0.842
                                   T1a_M6      0.856    0.658    0.802   0.867
                                   T1b_M0      0.856    0.639    0.795   0.870
                                   T1b_M6      0.855    0.641    0.811   0.877
                                   T1c_M0      0.910    0.666    0.767   0.866
                                   T1c_M6      0.920    0.682    0.793   0.871



Table 4
Brier scores for Task 1 at censoring time 60.
                                   Task        Train    Dev      Test    Best
                                   T1a_M0      0.202    0.088    0.251   0.080
                                   T1a_M6      0.210    0.087    0.259   0.073
                                   T1b_M0      0.199    0.102    0.217   0.106
                                   T1b_M6      0.186    0.103    0.243   0.104
                                   T1c_M0      0.246    0.105    0.288   0.108
                                   T1c_M6      0.258    0.105    0.272   0.103

   For the results of Task 2, we focus on the “time interval” approach to classification evaluation,
i.e. we show & discuss metrics based only on the time windows predicted by the model, as
opposed to the “labels” approach, where the classification target categories are the combination
of the time window and the actual outcome predicted to happen in that time window, treating

    2
        https://bitbucket.org/brainteaser-health/idpp2022-performance-computation/src/master/
Table 5
Concordance index scores for Task 1.
                             Task        Train   Dev     Test    Best
                             T1a_M0      0.719   0.676   0.664   0.696
                             T1a_M6      0.733   0.700   0.704   0.748
                             T1b_M0      0.736   0.705   0.694   0.725
                             T1b_M6      0.740   0.732   0.714   0.745
                             T1c_M0      0.752   0.686   0.674   0.713
                             T1c_M6      0.770   0.711   0.701   0.741


the evaluation as a (𝑐 × 5)-label problem, where 𝑐 is the number of outcomes for a given task.
Given that the event we associate with each time window prediction is the same one predicted
for Task 1, we found it somewhat redundant to re-evaluate these predictions in the context of
Task 2, and that it is more instructive as to the performance of the time-to-event regression
portion of the system to concentrate on the extent to which its predictions were within the
correct time windows. Table 6 shows the average precision, or specificity, of the time window
classification across all 6 possible windows, while Table 7 does the same for recall (the asterisk in
these tables denotes the case where our score was the highest among the challenge participants).
   The main conclusion to be drawn from the results of Task 2 is that our models, as well as the
other submissions to this challenge, are much better at ensuring that the predictions that they
do make are correct as opposed to finding all correct predictions in the dataset. This is deduced
from the fact that specificity (precision) is much higher than recall for all models and for almost
all classes. This trend suggests that survival modelling of the kind undertaken in this work can
be reasonably successful at identifying risk factors for adverse events related to ALS, but can
only properly process a small proportion of all the risk factors there are. We hypothesise that
the models tend to miss more high-level, complex interactions between variables that constitute
risks, resulting in the low recall scores.

Table 6
Macro-average specificity scores for Task 2.
                              Task       Train   Dev     Test     Best
                              T2a_M0     0.812   0.798   0.854   0.864
                              T2a_M6     0.782   0.663   0.850   0.876
                              T2b_M0     0.812   0.628   0.865   0.865*
                              T2b_M6     0.763   0.647   0.865   0.872
                              T2c_M0     0.817   0.684   0.851   0.863
                              T2c_M6     0.820   0.652   0.864   0.866



6. Conclusions and Future Work
This paper describes the development of a gradient boosting-based approach to the prediction
of the progression of Amyotrophic Lateral Sclerosis. The goal of the work was to evaluate the
effectiveness of gradient-boosting survival analysis on the ranking of ALS patients according
Table 7
Macro-average recall scores for Task 2.
                              Task        Train   Dev     Test     Best
                              T2a_M0      0.230   0.249   0.250   0.272
                              T2a_M6      0.220   0.338   0.216   0.341
                              T2b_M0      0.214   0.373   0.298   0.298*
                              T2b_M6      0.266   0.382   0.281   0.316
                              T2c_M0      0.252   0.374   0.223   0.275
                              T2c_M6      0.210   0.388   0.268   0.284


to their risk of impairment, and the combination of the classification outputs derived from
the survival analysis with the regression outputs based on time-to-event modelling for the
classification of the patients into time windows corresponding to the adverse event most likely
to befall them.
   The main conclusions we can draw from this work are as follows;
    • Gradient-boosted survival analysis shows promise as a method for ranking patients
      according to risk; while the evaluation metrics are not yet satisfactory for use in a real
      clinical setting, performance can be expected to improve with the addition of further
      training data.
    • The hybrid survival-classification/regression approach for time window classification
      seems not to be an appropriate model for time-to-event analysis in this case. We suggest
      other approaches to this problem below.
    • Performance on the M6 task was not significantly different from that observed on the
      M0 task, which may suggest that ALSFRS-R measurements need to be recorded across a
      longer observation interval than six months, or that more detailed data is needed to create
      time-series in which the progression of the disease could be effectively pattern-matched.
  In our estimation, advances could be made on this problem in future work in the following
ways;
    • Stratification of training data: it is well-known that ALS is highly variable in its rate of
      onset and progression, thus it is reasonable to hypothesise that it may be better to identify
      distince sub-populations in the training data and train separate survival models on each
      one, as one single analysis may not be enough to capture the complex dependencies in
      the dataset. This is in fact a commonly-used approach in survival analysis that can often
      lead to significant improvements.
    • Comparison of a wider range of models: due to various constraints, this work focuses on a
      single learning strategy for the parameterisation of the survival and regression functions,
      but further work should compare this approach with random forests, support vector
      machines, Bayesian classifiers and neural network-based approaches. This is particularly
      relevant given the limitations imposed by the proportional hazards assumption, which
      may in fact not be ideal for a task as complex as the prediction of the progression of ALS.
    • Reformulation of the time-to-event task: in this work we decided to base the training for
      the time-to-event prediction, which has a discrete output space, on a regression model,
      which has a continuous output space. It would perhaps be more sensible and efficient,
      and give better results, to train the model directly on the multi-class classification task of
      associating each subject with a time window.


References
 [1] A. Guazzo, I. Trescato, E. Longato, E. Hazizaj, D. Dosso, G. Faggioli, G. M. Di Nunzio,
     G. Silvello, M. Vettoretti, E. Tavazzi, C. Roversi, P. Fariselli, S. C. Madeira, M. de Carvalho,
     M. Gromicho, A. Chiò, U. Manera, A. Dagliati, G. Birolo, H. Aidos, B. Di Camillo, N. Ferro,
     Overview of iDPP@CLEF 2022: The Intelligent Disease Progression Prediction Challenge,
     in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), CLEF 2022 Working Notes, CEUR
     Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, 2022.
 [2] A. Guazzo, I. Trescato, E. Longato, E. Hazizaj, D. Dosso, G. Faggioli, G. M. Di Nunzio,
     G. Silvello, M. Vettoretti, E. Tavazzi, C. Roversi, P. Fariselli, S. C. Madeira, M. de Car-
     valho, M. Gromicho, A. Chiò, U. Manera, A. Dagliati, G. Birolo, H. Aidos, B. Di Camillo,
     N. Ferro, Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2022, in:
     A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald,
     G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Mul-
     tilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International
     Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science
     (LNCS) 13390, Springer, Heidelberg, Germany, 2022.
 [3] J. Cedarbaum, N. Stambler, E. Malta, C. Fuller, D. Hilt, B. Thurmond, A. Nakanishi,
     The ALSFRS-R: a revised ALS functional rating scale that incorporates assessments
     of respiratory function., BDNF ALS Study Group (Phase III). J Neurol Sci. (1999).
     doi:10.1016/s0022-510x(99)00210-5.
 [4] D. R. Cox, Regression models and life-tables, Journal of the Royal Statistical Society (1972).
 [5] N. Breslow, Analysis of survival data under the proportional hazards model, International
     Statistical Review (1975).
 [6] J. Friedman, Greedy function approximation: a gradient boosting machine, The Annals of
     Statistics (2001).
 [7] G. Biau, B. Cadre, L. Rouvière, Accelerated gradient boosting, Machine Learning 108 (2019)
     971–992.
 [8] D. Jarret, J. Yoon, M. van der Schaar, Dynamic prediction in clinical survival analysis using
     temporal convolutional networks, IEEE Journal of Biomedical and Health Sciences (2019).
     doi:10.1109/JBHI.2019.2929264.
 [9] L. Storelli, M. Azzimonti, M. Gueye, C. Vizzino, P. Preziosa, G. Tedeschi, N. De Stefano,
     P. Pantano, M. Filippi, M. Rocca, A deep learning approach to predicting disease progression
     in multiple sclerosis using magnetic resonance imaging, Invest Radiol. (2022). doi:10.
     1097/RLI.0000000000000854.
[10] G. Lee, K. Nho, B. Kang, K. Sohn, D. Kim, Predicting alzheimer’s disease progres-
     sion using a multi-modal deep learning approach, Sci Rep (2019). doi:10.1038/
     s41598-018-37769-z.
[11] Z. Zhao, Y. Li, Y. Wu, R. Chen, Deep learning-based model for predicting progression in
     patients with head and neck squamous cell carcinoma, Cancer Biomark. (2020). doi:10.
     3233/CBM-190380.
[12] A. Hussain Shahid, M. Prasad Singh, A deep learning approach for prediction of
     Parkinson’s disease progression, Biomedical Engineering Letters (2020). doi:10.1007/
     s13534-020-00156-7.
[13] A. Abuhelwa, G. Kichenadasse, R. McKinnon, A. Rowland, A. Hopkins, M. Sorich, Ma-
     chine learning for prediction of survival outcomes with immune-checkpoint inhibitors in
     urothelial cancer, Cancers (Basel) (2021). doi:10.3390/cancers13092001.
[14] M. Konerman, L. Beste, T. Van, B. Liu, X. Zhang, J. Zhu, S. Saini, G. Su, B. Nallamothu,
     G. Ioannou, A. Waljee, Machine learning models to predict disease progression among
     veterans with hepatitis C virus, Machine Learning in Health and Biomedicine (2019).
     doi:10.1371/journal.pone.0208141.
[15] S. Grampurohit, C. Sagarnal, Disease prediction using machine learning algorithms, in:
     2020 International Conference for Emerging Technology (INCET), 2020, pp. 1–7. doi:10.
     1109/INCET49848.2020.9154130.
[16] M. Ferjani, Disease prediction using machine learning (2020). doi:10.13140/RG.2.2.
     18279.47521.
[17] D. Le, Machine learning-base approaches for disease gene prediction, Brief Funct Genomics
     (2020). doi:10.1093/bfgp/elaa013.
[18] D. Ding, T. Lang, D. Zou, J. Tan, J. Chen, L. Zhou, D. Wang, R. Li, Y. Li, J. Liu, C. Ma,
     Q. Zhou, Machine learning-based prediction of survival prognosis in cervical cancer, BMC
     Bioinformatics (2021). doi:10.1186/s12859-021-04261-x.
[19] D. Haritha, N. Swaroop, M. Mounika, Prediction of COVID-19 cases using CNN with
     X-rays, in: 2020 5th International Conference on Computing, Communication and Security
     (ICCCS), 2020, pp. 1–6. doi:10.1109/ICCCS49678.2020.9276753.
[20] A. Chaddad, L. Hassan, C. Desrosiers, Deep CNN models for predicting COVID-19 in
     CT and X-ray images, J Med Imaging (Bellingham) (2021). doi:10.1117/1.JMI.8.S1.
     014502.
[21] A. A. A. Musleh, A. Y. Maghari, COVID-19 detection in X-ray images using CNN algorithm,
     in: 2020 International Conference on Promising Electronic Technologies (ICPET), 2020, pp.
     5–9. doi:10.1109/ICPET51420.2020.00010.
[22] R. Mohammad, M. Aljabri, M. Aboulnour, S. Mirza, Classifying the mortality of people with
     underlying health conditions affected by COVID-19 using machine learning techniques,
     Applied Computational Intelligence and Soft Computing (2022). doi:10.1155/2022/
     3783058.
[23] S. Aljameel, I. Ullah Kahn, N. Aslam, M. Aljabri, E. Alsulmi, Machine learning-based model
     to predict the disease severity and outcome in COVID-19 patients, Scientific Programming
     (2021). doi:10.1155/2021/5587188.
[24] F. Xu, X. Chen, X. Yin, Q. Qiu, J. Xiao, L. Qiao, M. He, L. Tang, X. Li, Q. Zhang, Y. Lu,
     S. Xiao, R. Zhao, Y. Guo, M. Chen, D. Chen, L. Wen, B. Wang, Y. Nian, K. Liu, Prediction of
     disease progression of COVID-19 based upon machine learning, International Journal of
     General Medecine (2021). doi:10.2147/IJGM.S294872.
[25] R. Mojjada, A. Yadav, A. Prabhu, Y. Natarajan, Machine learning models for COVID-19
     future forecasting, Mater Today Proc. (2020). doi:10.1016/j.matpr.2020.10.962.
[26] A. Das, S. Mishra, S. Saraswathy Gopalan, Predicting CoVID-19 community mortality
     risk using machine learning and development of an online prognostic tool, PeerJ. (2020).
     doi:10.7717/peerj.10083.
[27] T. Magnus, M. Beck, R. Giess, I. Puls, M. Naumann, K. Toyka, Disease progression in
     amyotrophic lateral sclerosis: predictors of survival, Muscle Nerve. (2002). doi:10.1002/
     mus.10090.
[28] K. Imamura, Y. Yada, Y. Izumi, M. Morita, A. Kawata, T. Arisato, A. Nagahashi, T. Enami,
     K. Tsukita, H. Kawakami, M. Nakagawa, R. Takahashi, H. Inoue, Prediction model of
     amyotrophic lateral sclerosis by deep learning with patient induced pluripotent stem cells,
     Annals of Neurology (2021). doi:10.1002/ana.26047.
[29] N. G. Simon, M. R. Turner, S. Vucic, A. Al-Chalabi, J. Shefner, C. Lomen-Hoerth, M. C.
     Kiernan, Quantifying disease progression in amyotrophic lateral sclerosis, Annals of
     Neurology (2014). doi:10.1002/ana.24273.
[30] R. Gomeni, M. Fava, Amyotrophic lateral sclerosis disease progression model, Pooled
     Resource Open-Access ALS Clinical Trials Consortium (2014). doi:10.3109/21678421.
     2013.838970.
[31] A.-L. Kjældgaard, K. Pilely, K. S. Olsen, A. H. Jessen, A. Ø. Lauritsen, S. W. Pedersen,
     K. Svenstrup, M. Karlsborg, H. Thagesen, M. Blaabjerg, Á. Theódórsdóttir, E. G. Elmo, A. T.
     Møller, L. Bonefeld, M. Berg, P. Garred, K. Møller, Prediction of survival in amyotrophic
     lateral sclerosis: a nationwide Danish cohort study, BMC Neurology (2021). doi:10.1186/
     s12883-021-02187-8.
[32] H. Westeneng, T. Debray, A. Visser, R. van Eijk, J. Rooney, A. Calvo, S. Martin, C. Mc-
     Dermott, A. Thompson, S. Pinto, X. Kobeleva, A. Rosenbohm, B. Stubendorff, H. Sommer,
     B. Middelkoop, A. Dekker, J. van Vugt, W. van Rheenen, A. Vajda, M. Heverin, M. Ka-
     zoka, H. Hollinger, M. Gromicho, S. Körner, T. Ringer, A. Rödiger, A. Gunkel, C. Shaw,
     A. Bredenoord, M. van Es, P. Corcia, P. Couratier, M. Weber, J. Grosskreutz, A. Ludolph,
     S. Petri, M. de Carvalho, P. Van Damme, K. Talbot, M. Turner, P. Shaw, A. Al-Chalabi,
     A. Chiò, O. Hardiman, K. Moons, J. Veldink, L. van den Berg, Prognosis for patients with
     amyotrophic lateral sclerosis: development and validation of a personalised prediction
     model, Lancet Neurology (2018). doi:10.1016/S1474-4422(18)30089-9.
[33] J. Gordon, B. Lerner, Insights into amyotrophic lateral sclerosis from a machine learning
     perspective, Journal of Clinical Medicine (2019). doi:10.3390/jcm8101578.
[34] R. Khosla, M. Rain, S. Sharma, A. Anand, Amyotrophic Lateral Sclerosis (ALS) prediction
     model derived from plasma and CSF biomarkers, PLoS One (2021). doi:10.1371/journal.
     pone.0247025.
[35] A. Taylor, C. Fournier, M. Polak, L. Wang, N. Zach, M. Keymer, J. Glass, D. Ennist, Pooled
     resource open-access ALS clinical trials consortium. Predicting disease progression in
     Amyotrophic Lateral Sclerosis, Annals of Clinical and Translational Neurology (2016).
     doi:10.1002/acn3.348.
[36] D. Rubin, Multiple imputation for nonresponse in surveys., 1986. doi:10.1002/
     9780470316696.
[37] G. Kalton, D. Kasprzyk, Imputing for missing survey responses, Proceedings of the Section
     on Survey Research Methods (1982).
[38] C. K. Enders, Applied Missing Data Analysis, Guilford Press New York, 2010.
[39] B. Ren, L. Pueyo, C. Chen, E. Choquet, J. H. Debes, G. Duchene, F. Menard, M. D. Perrin,
     Using data imputation for signal separation in high contrast imaging, The Astrophysical
     Journal (2020). doi:10.3847/1538-4357/ab7024.
[40] R. Lall, T. Robinson, The MIDAS touch: Accurate and scalable missing-data imputation
     with deep learning, Political Analysis (2021). doi:10.1017/pan.2020.49.
[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
     M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
     Learning Research 12 (2011) 2825–2830.
[42] S. Pölsterl, scikit-survival: A library for time-to-event analysis built on top of scikit-learn,
     Journal of Machine Learning Research 21 (2020) 1–6. URL: http://jmlr.org/papers/v21/
     20-729.html.
[43] T. Chen, C. Guestrin, XGBoost, in: Proceedings of the 22nd ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining, ACM, 2016. URL: https://doi.org/
     10.1145%2F2939672.2939785. doi:10.1145/2939672.2939785.