=Paper=
{{Paper
|id=Vol-3740/paper-126
|storemode=property
|title=Predicting the Functional Rating Scale and Self-Assessment Status of ALS Patients
with Sensor Data
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-126.pdf
|volume=Vol-3740
|authors=Andreia S. Martins,Daniela M. Amaral,Eduardo N. Castanho,Diogo F. Soares,Ruben Branco,Sara C. Madeira,Helena Aidos
|dblpUrl=https://dblp.org/rec/conf/clef/MartinsACSBMA24
}}
==Predicting the Functional Rating Scale and Self-Assessment Status of ALS Patients
with Sensor Data==
Predicting the Functional Rating Scale and
Self-Assessment Status of ALS Patients with Sensor Data
Notebook for the iDPP@CLEF Lab at CLEF 2024
Andreia S. Martins† , Daniela M. Amaral† , Eduardo N. Castanho† , Diogo F. Soares,
Ruben Branco* , Sara C. Madeira and Helena Aidos
LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
Abstract
Amyotrophic Lateral Sclerosis (ALS) is a neurodegenerative disease causing progressive loss of cognitive and
motor functions. Due to limited understanding of its mechanisms, there is no cure. Prognosis is still crucial for the
effective planning of symptom treatment, however, the heterogeneity in patient progression drives the need for
precision medicine research. iDPP $ CLEF 2024 aims to develop novel methodologies for predicting ALS disease
progression, enabling the community to combine efforts and improve current prognostic methods. This report
discusses our participation in tasks 1 and 2, evaluating the impact of sensor data on improving the prediction of
ALSFRS-R scores. The proposed methodology combines temporal summarization techniques (extracting relevant
statistics from the sensors), feature selection and extraction methods, and state-of-the-art classifiers for each
ALSFRS-R question independently. Results show that random forest models yield the best overall performance,
and selecting the k-best features and biclustering were the best overall feature selection and extraction strategies
for tasks 1 and 2, respectively.
Keywords
Amyotrophic Lateral Sclerosis, Prognostic Prediction, Time Series Data, Biclustering, Multi-Class Classification
1. Introduction
Amyotrophic Lateral Sclerosis (ALS) is a devastating neurodegenerative disease characterized by the
progressive degeneration of motor neurons, leading to muscle weakness, atrophy, and eventual paral-
ysis [1]. The progression of ALS varies significantly among patients, with some experiencing rapid
deterioration while others decline more slowly [2]. This variability complicates the ability to predict
disease trajectory, making it challenging for clinicians to offer accurate prognoses and for patients to
make informed decisions about their future care [3].
Traditionally, clinical assessments of ALS progression rely on periodic evaluations using scales
like the ALS Functional Rating Scale-Revised (ALSFRS-R) [4]. Although essential, these assessments
provide only snapshots of a patient’s condition at discrete time points and can miss subtle but critical
changes between visits. This intermittent data collection limits the ability to detect early signs of disease
worsening and delays the implementation of necessary interventions.
Recent advancements in sensor technology present a promising solution to these limitations. Sensors
can generate a rich, real-time dataset by continuously monitoring physiological parameters such as
muscle activity, respiratory function, and movement patterns [5]. This continuous data capture offers a
detailed and dynamic view of a patient’s condition, potentially revealing early indicators of disease
progression that would otherwise go unnoticed between clinical visits [6].
However, to fully understand and predict ALS progression, it is essential to complement sensor data
with patients’ self-assessment data [7]. Self-assessments provide critical insights into subjective symp-
toms such as pain, fatigue, and emotional well-being, which are not easily quantifiable through sensors
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
†
These authors contributed equally.
$ asmartins@ciencias.ulisboa.pt (A. S. Martins); daniela.amaral@tecnico.ulisboa.pt (D. M. Amaral);
ejcastanho@ciencias.ulisboa.pt (E. N. Castanho); dfsoares@ciencias.ulisboa.pt (D. F. Soares); rmbranco@ciencias.ulisboa.pt
(R. Branco); sacmadeira@ciencias.ulisboa.pt (S. C. Madeira); haidos@ciencias.ulisboa.pt (H. Aidos)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
alone. Integrating objective sensor data with subjective self-assessment data creates a comprehensive,
multidimensional dataset encompassing measurable physical changes and the patient’s lived disease
experience [8].
In this context, within the iDPP $ CLEF 2024 challenge1 framework, we tackled Tasks 1 and 2,
which target predicting the twelve scores of the ALSFRS-R from sensor data. Task 1 aims to predict
the score assigned by the clinician at the second visit, while Task 2 targets the second patient’s self-
assessment score. This paper reports the work done to overcome this challenge. We approach this
challenge as a multi-label, multi-class classification approach with high-dimensional data. To handle
the longitudinal datasets, we consider a double-step approach that transforms the time series sensor
data using statistics computed from a time window period. Additionally, we test two feature selection
strategies (K-Best features in all sensors and K-Best features in each sensor) and one feature extraction
strategy (Biclustering-based features). To classify the ALSFRS-R scores, we train several state-of-the-art
classifiers for each question independently.
2. Related Work
Sensor technology has gained significant traction in recent years for monitoring ALS patients. Wearable
sensors, such as accelerometers and gyroscopes, have continuously monitored motor function, gait,
and other physical activities [5, 9, 10, 11]. Accelerometer studies demonstrated their effectiveness in
capturing detailed movement patterns, providing valuable data for assessing motor decline in ALS
patients [6, 10, 9]. Vieira et al. [12] developed a model targeting ALS progression prediction based on
voice samples and accelerometer measurements from a four-year longitudinal dataset. This model was
used to predict bulbar-related and limb-related ALSFRS-R scores. Straczkiewicz et al. [11] used wrist
wearables and ALSFRS-R self-entries data to propose new measures to quantify the count and duration
of upper limb movements.
In addition to sensor data, integrating patients’ self-assessment data has proven beneficial in un-
derstanding ALS progression. Studies have shown that self-reported pain, fatigue, and quality of life
measures can provide critical insights that complement objective sensor data [13, 7].
Machine learning techniques have been increasingly applied to predict disease progression in ALS [14].
Predicting the progression of the functional domains (twelve questions) assessed by the well-known
functional scale, the ALSFRS-R was also investigated by Gordon and Lerner [15]. They modeled a
multiclass classifier using demographic, respiratory assessments, genetic data, and other dynamic data
to predict the values of each ALSFRS-R question at the time of the last patient visit.
Subspace techniques, such as pattern mining, biclustering, and triclustering, discover local patterns
with non-constant coherencies with potential for predictive tasks. Martins et al. [16] recently proposed
combining itemset mining with sequential pattern mining to uncover disease presentation and pro-
gression patterns in ALS patients and utilize these patterns to forecast the need for NIV. In a similar
approach with the same prognostic target, Matos et al. [17] suggested a classifier based on biclustering.
Biclustering [18, 19] was used to locate groups of patients with similar values in subsets of clinical
features (biclusters), which were then combined with static data as features. Although promising,
none of these methods considered the temporal relationship of features. Soares et al. [20] proposed
BicTric, a classifier capable of learning predictive models from both static and temporal data using
discriminative patterns obtained through biclustering and triclustering [21, 22, 23]. Recently, Soares et al.
[24] enhanced BicTric with TCtriCluster, a triclustering algorithm incorporating temporal contiguity
constraints. These approaches utilized temporal preprocessing with snapshots and the time windows
method proposed by Carreiro et al. [25] to learn predictive models for various clinically relevant ALS
endpoints.
Integrating multi-modal data sources, including sensor data, self-assessments, and traditional clinical
metrics, has shown potential in providing a more comprehensive understanding of ALS progression.
Johnson et al. [8] conducted a study combining wearable sensor data with patient-reported outcomes
1
http://brainteaser.dei.unipd.it/challenges/idpp2024/
and clinical assessments, demonstrating that multi-modal data fusion could enhance predictive accuracy
and offer deeper insights into disease dynamics.
3. Methodology
The objective of Tasks 1 and 2 of the iDPP $ CLEF 2024 challenge is to predict the values of the
ALSFRS-R sub-scores of a second evaluation, given the values of the first evaluation. This would imply
a reduced set of training instances (52 patients, in total), so we decided to generalize the challenge to
predict the ALSFRS-R sub-scores of any evaluation given a previous evaluation, resulting in 121 training
instances for Task 1 and 220 instances for Task 2.
The dataset made available [26, 27] with this challenge contains information on ALS patients com-
prising the following data: static (including demographic and clinical information), all the ALSFRS-R
evaluations (comprising the scores of the 12 questions for each patient), and sensor data (collected from
the sensors of a fitness smartwatch). Figure 1 illustrates the processing of the dataset.
Feature
Selection/Extraction
Selection
K-Best (all sensors)
Evaluation
Sensor Temporal
Sensor
Time Series Statistics K-Best (each target)
Data Statistics
Extraction
Biclustering-based
nt
tie
Time Temporal Statistics
Pa
Append
Evaluation
Evaluation
ALSFRS-R Static
evaluations + Clinical
(previous visit) Data
Final Dataset
ALSFRS-R Scores Clinical Variables
ALSFRS-R
Evaluation
evaluations Y X
(next visit)
Classification
GROUND TRUTH Pipeline
ALSFRS-R Scores
Predictions
Figure 1: Data processing pipeline. Addressing the challenge implies handling data from three sources: static
clinical variables, ALSFRS-R scores, and sensor data. To handle the highly dimensional sensor time series data,
we computed statistics for each sensor and then applied feature selection or extraction strategies to reduce the
dimensionality of the sensor dataset. The final dataset (that feeds the classifiers) aggregates these data sources.
Tasks 1 and 2 face a significant hurdle due to the sensor dataset’s high dimensionality, stemming
from a large number of sensor features (90 in total) and the numerous time points (approximately 268
sensor records per patient). To address this issue, we used a two-step processing of the dataset: first,
we extracted temporal statistics from the longitudinal datasets. Second, we used feature selection or
extraction techniques to obtain a representation of the sensor statistics with smaller dimensionality.
3.1. Time Series Statistics
We derived new features from the longitudinal sensor data for each evaluation using summarization
techniques, consisting of statistical metrics such as mean, standard deviation, minimum and maximum
Table 1
Number of excluded sensor features, by category. The sensor data can be grouped into 6 distinct categories
(Category). For each original sensor feature within these categories, 6 statistical metrics - mean, standard
deviation, minimum value, maximum value, first value, and last value - were computed (#Computed Features).
Features exhibiting zero or near-zero variance (#Low Variance) and those highly correlated with other features
within the same category (#High Correlation) were removed from the dataset.
Task 1 Task 2
Category #Computed Features #Low Variance #High Correlation #Low Variance #High Correlation
calories 18 0 10 0 9
steps 24 0 3 0 3
beat_to_beat 240 13 116 10 108
heart_rate 60 10 1 9 2
respiration 108 0 27 0 31
SpO2 90 0 20 0 16
values, and the first and last values of each feature (as in Branco et al. [28]). To avoid the bias introduced
by considering the entire sensor data history, these metrics were computed within fixed time intervals,
specifically considering the interval [𝑡 − 𝛿, 𝑡], where 𝑡 represents the day of the target appointment
and 𝛿 is the number of days within the interval (set to 15 days for Task 1 and 7 days for Task 2). This
computation resulted in 540 new sensor features (90 original sensor features × 6 statistical metrics).
Another issue encountered with the dataset was missing values, even after the aforementioned
computations. To address this, various interpolation and imputation techniques were explored, with
polynomial interpolation of degree 5 proving to be the most effective in minimizing variance decrease
across the feature sets.
After the interpolation step, sensor features exhibiting zero or near-zero variance (less than 10−5 )
were deemed uninformative and consequently removed. Furthermore, highly correlated sensor fea-
tures within the same category (calories, steps, beat_to_beat, heart_rate, respiration, and SpO2 ) were
also eliminated to mitigate redundancy. The selection of features for removal was based on Pearson
correlation, with a correlation threshold set at 0.95 (see Table 1).
3.2. Feature Selection and Extraction Techniques
The sensor statistics obtained from the previously discussed step are still high dimensional, as there are
340 features for Task 1 and 352 features for Task 2. Subsequently, we applied three techniques (two
feature selection (i) and (ii), and one feature extraction (iii)) to reduce the dataset dimensionality:
(i) K-Best features in all sensors;
(ii) K-Best features in each target;
(iii) Biclustering-based features.
The first two feature selection techniques are based on a k-best selection strategy. First, we selected
the top 5 features for predicting each target question based on ANOVA F-value between labels and
features. Predictions were then made using the set of highest-ranked sensor statistical features across
all questions (All Sensors). Alternatively, a specialized prediction approach was also adopted wherein
the top 5 features were selected independently for each ALSFRS-R question based on mutual infor-
mation (Each Target) (see Table 2). These selections were made using the SelectKBest class of the
sklearn.feature_selection Python module.
As an alternative to these aforementioned feature selection strategies, we used a feature extraction
strategy based on biclustering to reduce the dataset dimensionality. Biclustering, the simultaneous
clustering of rows and columns of a data matrix, has shown its ability to discover local patterns with
non-constant coherencies in both descriptive and predictive learning tasks [21, 18]. Our approach,
illustrated in Figure 2, applies biclustering to the Patient×Sensor Feature training matrix to obtain the
Table 2
Number of selected top-ranked features, by category. Predictions were made using the pairs of strategy-models
of highest-ranked computed sensor features, based on the ANOVA F-value, across all questions (All Sensors).
Additionally, a specialized prediction method was employed, wherein the top 5 features were independently
selected for each ALSFRS-R question based on mutual information (Each Target).
Task 1 Task 2
calories steps beat_to_beat heart_rate respiration SpO2 calories steps beat_to_beat heart_rate respiration SpO2
n=8 n = 21 n = 111 n = 49 n = 81 n = 70 n=9 n = 21 n = 122 n = 49 n = 77 n = 74
All Sensors 6 6 13 5 3 3 6 5 19 1 5 5
Q1 3 0 1 1 0 0 1 0 0 0 1 3
Q2 0 0 0 3 0 2 4 0 1 0 0 0
Q3 5 0 0 0 0 0 5 0 0 0 0 0
Q4 0 0 4 0 1 0 0 0 3 0 0 2
Q5 0 0 5 0 0 0 0 0 5 0 0 0
Each Target Q6 0 3 2 0 0 0 0 0 5 0 0 0
Q7 0 4 1 0 0 0 0 0 5 0 0 0
Q8 0 5 0 0 0 0 0 4 1 0 0 0
Q9 0 5 0 0 0 0 0 4 1 0 0 0
Q10 0 0 2 1 2 0 1 0 3 0 1 0
Q11 1 0 3 0 0 1 0 0 4 1 0 0
Q12 0 0 5 0 0 0 0 0 0 0 5 0
biclusters, with the row pattern of each bicluster being computed as the mean value of each column.
Then, the Euclidean distance between each training (and test) sample and the row pattern of each
bicluster is computed to obtain a reduced representation of the training (and test) set.
Original Dataset
(With two biclusters) Reduced Dataset
y1 y2 y3 y4 y5 y6 P1 P2
x1 1 3 1 5 4 4 x1 8 0
x2 5 1 2 5 2 3 x2 6 4
x3 3 2 3 2 3 1 x3 1 9
Train Data
Biclustering Mapping
x4 2 3 1 5 1 4 x4 9 0
x5 4 2 4 3 3 1 Pattern x5 0 9
y2 y3 y5 y6
1
x6 5 2 4 2 2 1 x6 1 10
2 4 3 1
x7 1 2 4 4 3 1 x7 1 8
Pattern y2 y3 y4 y6
2
3 1 5 4
y1 y2 y3 y4 y5 y6 P1 P2
x8 2 2 4 3 3 1 x8 0 9
Test Data
x9 5 1 3 2 2 1 Mapping x9 3 10
x10 5 3 1 5 1 4 x10 9 0
x11 5 3 1 5 2 2 x11 6 2
Figure 2: We used an approach based on biclustering-computed features. First, we apply a biclustering algorithm
to obtain a set of biclusters (sub-matrices) from the dataset. Second, we compute its row pattern for each bicluster.
Finally, we compute the distance between each row of the dataset and each bicluster to obtain the new reduced
dataset. To simplify the representation of this methodology, we illustrate the pattern of a bicluster by the mode
of each column (instead of the mean value) and use the Manhattan distance between each row and bicluster
instead of the Euclidean distance.
We considered Spectral Biclustering to mine the biclusters as implemented in scikit-learn [29, 30].
The number of biclusters influences the number of features in the reduced dataset. In our approach,
we tested values for the number of biclusters and selected the value that maximizes the number of
non-trivial biclusters (biclusters with more than 2 rows and columns).
3.3. Modeling and Hyperparameter Optimization
In this section, we discuss our classification methodology, as illustrated in Figure 3.
Target
SMOTE
Question 1
For every model Hyperparameter
Model Specific
Pre-Processing Optimization
Question i (using Train & Validation set)
Train
Split
Question 12
Question 1
Question i
Question 12
Classifier Classifier Classifier
Independently performed for each question
Figure 3: The challenge implies a multi-label, multi-class tasks. To simplify the training, we train classifiers for
each question independently. We use SMOTE to compensate for a lack of sufficient representation across each
scale value when possible. We train several traditional classifiers for each question, optimized considering the
mean absolute error.
The tasks at hand are multi-label, multi-class tasks, which add complexity to the standard modeling
techniques. Furthering the difficulty, the labels, which are the ALSFRS-R questions, are not completely
independent, as the sub-scores are correlated within the different domains (bulbar, fine motor / upper
limb, gross motor /lower limb, and respiratory).
Despite this intricacy, we decided to simplify the task by separating them into independent multi-class
problems, where a given patient ALSFRS-R evaluation and their sensor data are used to predict each
sub-score individually. Despite not modeling the correlation between questions, we assume that the
models could still connect a patient’s condition in time with their ability to perform a single function.
We train 12 models and combine their predictions to predict the full set of sub-scores.
We consider a set of well-known classifiers covering a diverse range of model types, using scikit-
learn [29]: Logistic Regression (LR), Random Forest (RF), XGBoost, and Support Vector Machines (SVC).
Each model undergoes a model-appropriate pre-processing if required, and the optimal hyperparameters
are searched for, as will be described later on.
For questions that have a sufficient representation across each of the scale values (0 to 4),2 we employ
imblearn [31]’s implementation of SMOTE [32], to alleviate the issue of small training sample size.
It is common to scale the input data for linear models to avoid widely different magnitudes across
features that can hurt learning and performance. We use a standard scaler for Logistic Regression and
Support Vector Machines to scale the input data.
We optimize the models using the Mean Absolute Error metric, both as a loss function for the
model optimization and as a hyperparameter optimization objective, which searches for the best
hyperparameter optimization that yields better performance on the validation set. We use Optuna [33]
for hyperparameter optimization, with the Tree-Structured Parzen Estimator algorithm (as a sampler),
avoiding a grid search brute-force approach to more efficiently sweep the hyperparameter space (see
Table 3 for hyperparameter range of each model). The best-performing model is then used for the
submissions in the challenge.
To assess the generalization of trained models and to optimize hyperparameters, we split the provided
dataset into two sets: a train set and a validation set. As the dataset is multi-label multi-class, regular
2
Two questions in each task did not qualify, which were questions 11 and 12 for Task 1, and 3 and 11 for Task 2.
Table 3
The hyperparameter space for each model. Int and Float Distributions describe a search space between two
integers or floating values, whereas CategoricalDistribution specifies a set of discrete values.
Model Hyperparameter Distribution Space
n_estimators IntDistribution(100, 1000)
XGBoost Classifier max_depth IntDistribution(1, 20)
learning_rate FloatDistribution(0.01, 1)
n_estimators IntDistribution(10, 1000)
Random Forest
max_depth IntDistribution(1, 20)
Logistic Regression(max_iter=100000) C FloatDistribution(0.01, 10)
C FloatDistribution(0.01, 10)
SVC(max_iter=100000, cache_size=1000) gamma FloatDistribution(0.01, 10)
CategoricalDistribution(
kernel
["linear", "rbf", "poly", "sigmoid"])
stratified train test splits do not guarantee a representative proportion of each scale value for each
question for both splits. We resort to a variant termed iterative stratified train test splitting [34, 35],
implemented in the scikit-multilearn package [36]. This method works by iteratively populating both
splits and assigning data points at each step to the split that requires them the most to maintain balance.
Ultimately, we ensure each split is as similar to the overall dataset as possible. We split the provided
training set following a 70/30 ratio, with 70% becoming the training set and 30% the validation set.
All the experiments were run on a Desktop Computer with an AMD Ryzen 9 7950X 16-Core with
64GB of RAM and Ubuntu 22.04.2. The code was run using Python 3.10.11.
4. Results & Discussion
In this section, we cover the results obtained in Tasks 1 and 2 in the challenge, as reported and computed
with the private test set made available by the lab organizers.
To examine the impact of our design choices on feature selection or extraction, we define an experi-
mental space beyond the basic analysis of the challenge results. First, for each question, we select the
best pair feature selection or extraction strategy and classification model with the top-k (we consider
𝑘 = {1, 2, 3}) highest validation metric values for both Mean Absolute Error (MAE) and Root Mean
Squared Error (RMSE) (see section 4.1). Next, to determine which feature selection or extraction per-
forms best, we consider the mean RMSE and MAE across the four classifiers for each question (see
section 4.2). Lastly, we will assess whether there is a significant advantage in using one classifier over
another. Given that the classifiers are all different types, identifying specific model properties suited for
this particular task could lead to improvements for each question (see section 4.3).
4.1. Selecting the best combination of feature strategy and classification model
We conducted experiments to predict the ALSFRS-R questions of a subsequent assessment by combining
the best models for each target question based on their validation set performance. Specifically, we
submitted the three best-performing pairs for both Tasks (see Table 4).
Table 5 presents the results of the models trained in each feature selection or extraction strategy
for predicting each target question, along with the global results (average RMSE and MAE values
across all questions). For both Tasks 1 and 2, the best-performing combination of feature selection or
extraction strategy and classification model (strategy-model pair) in the test set was the second-best
strategy-model pair in predicting the ALSFRS-R questions in the validation set. This suggests that
the training and validation sets used for optimizing and validating the classifiers were unsuitable for
Table 4
Results on the validation set for each combination of feature selection or extraction strategy and classification
model. RF stands for Random Forest, SVC for Support Vector Machine Classifier, and LR for Logistic Regression.
Best pair 2nd best pair 3rd best pair
Question Strategy Model Strategy Model Strategy Model
Q1 All Sensors XGBoost Each Target XGBoost Each Target RF
Q2 Each Target RF Biclustering RF All Sensors RF
Q3 Biclustering RF Biclustering XGBoost All Sensors RF
Q4 Each Target SVC All Sensors RF All Sensors SVC
Q5 Each Target RF All Sensors XGBoost Each Target SVC
Task 1
Q6 Biclustering XGBoost Biclustering RF All Sensors XGBoost
Q7 Each Target SVC Each Target RF All Sensors XGBoost
Q8 All Sensors SVC All Sensors LR All Sensors XGBoost
Q9 Biclustering XGBoost Biclustering RF All Sensors RF
Q10 All Sensors SVC Biclustering SVC Each Target LR
Q11 Biclustering XGBoost Each Target XGBoost Biclustering RF
Q12 All Sensors RF Biclustering RF All Sensors LR
Q1 All Sensors SVC Biclustering RF Each Target RF
Q2 Each Target XGBoost Biclustering SVC Each Target LR
Q3 Biclustering XGBoost Biclustering SVC All Sensors XGBoost
Q4 Each Target XGBoost All Sensors XGBoost All Sensors RF
Q5 All Sensors RF Biclustering RF Each Target RF
Task 2
Q6 All Sensors XGBoost Each Target RF All Sensors RF
Q7 All Sensors RF Biclustering RF Each Target XGBoost
Q8 Biclustering RF Each Target XGBoost Each Target RF
Q9 Biclustering XGBoost Biclustering RF Biclustering SVC
Q10 All Sensors SVC Each Target RF Each Target XGBoost
Q11 All Sensors XGBoost Biclustering RF Each Target RF
Q12 Biclustering RF All Sensors SVC Each Target RF
predicting the ALSFRS-R questions in the second evaluation. These sets included all evaluations made
available for the challenge, leading the models to be trained for predicting the next evaluation rather
than specifically the second evaluation.
In Task 1, two questions related to the bulbar domain, Q1 and Q2, and one respiratory question,
Q11, were the easiest to predict (RMSE 0.309, MAE 0.095). Specifically, Q1 and Q11 were best predicted
using the XGBoost classifier with the All Sensors (Best strategy-model pair) and Each Target (2nd best
strategy-model) feature selection strategies, respectively. Question Q2 was best predicted using the
RF classifier with the All Sensors strategy (3rd best strategy-model pair). In contrast, motor-related
questions, Q7 (trunk domain) and Q9 (lower limb domain) had the highest prediction errors (RMSE
0.873, MAE 0.476).
For Task 2, questions Q11 and Q12 were correctly classified for all the evaluations (RMSE 0.000
and MAE 0.000). Both the questions used the RF classifier and the Biclustering strategy (2nd best
strategy-model and Best strategy-model pair, respectively). Question Q11 was also correctly classified
for all evaluations using the Each Target strategy (3rd best strategy-model pair). Conversely, Q4 had
the most misclassified evaluations (RMSE 1.044, MAE 0.545).
Table 5
Results of the submitted strategy-model pairs. Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE)
metrics for the three best strategy-model pairs presented in Table 4. The performance metrics are provided for
each target question and averaged across all the 12 questions (Global).
Model Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Global
Best strategy-model pair RMSE 0.309 0.577 0.436 0.900 0.655 0.900 1.113 0.655 1.291 0.756 0.378 0.577 0.712
MAE 0.095 0.238 0.190 0.524 0.429 0.619 0.857 0.429 0.810 0.381 0.143 0.238 0.413
Task 1
2nd best strategy-model pair RMSE 0.787 0.690 0.655 0.816 0.756 0.900 0.900 0.873 0.873 0.378 0.309 0.577 0.709
MAE 0.429 0.286 0.333 0.476 0.476 0.619 0.619 0.572 0.381 0.143 0.095 0.238 0.389
3rd best strategy-model pair RMSE 0.787 0.309 0.756 0.976 1.000 0.690 0.873 0.787 1.069 0.926 0.378 0.845 0.783
MAE 0.429 0.095 0.381 0.571 0.619 0.476 0.476 0.429 0.667 0.476 0.143 0.429 0.433
Best strategy-model pair RMSE 0.905 0.739 0.522 1.206 1.279 1.000 0.674 0.798 0.302 1.414 1.206 0.000 0.837
MAE 0.636 0.364 0.273 0.727 0.909 0.818 0.455 0.636 0.091 1.091 0.364 0.000 0.530
Task 2
2nd strategy-model pair RMSE 1.000 0.798 0.739 1.044 0.953 0.739 0.522 0.798 0.302 1.000 0.000 0.603 0.708
MAE 0.636 0.455 0.364 0.545 0.545 0.545 0.273 0.636 0.091 0.636 0.000 0.182 0.409
3rd best strategy-model pair RMSE 0.953 0.798 0.522 1.044 0.853 0.905 0.905 0.739 0.603 1.679 0.000 0.302 0.775
MAE 0.545 0.455 0.273 0.545 0.545 0.818 0.636 0.545 0.364 1.182 0.000 0.909 0.500
4.2. Feature Selection and Extraction Comparison
As previously mentioned, one feature extraction and two feature selection strategies were assessed:
biclustering and K-Best selection, both globally for all questions (All Sensors) and individually for each
question (Each Target).
Table 6 presents the average model performance in the test set for each ALSFRS-R question and
feature selection or extraction method. Overall, no strategy clearly outperformed the others, with the
metrics typically not differing much between models with the same target question. However, the
preferred strategy does change with the target.
In Task 1, the best overall method was individual k-best selection, Each Target (RMSE 0.780, MAE
0.474). It gathered the best average metrics in 6 out of the 12 questions, followed by the biclustering
approach (RMSE 0.815, MAE 0.515) with 4 questions. Notably, there may be a preferred strategy by
domain: the All Sensors approach performed best in the trunk domain questions (Q6 and Q7), and Each
Target yielded the best metrics in the lower limb domain (Q8 and Q9). However, this behavior does not
seem to occur for the upper limb domain (Q4 and Q5). For the bulbar (Q1-Q3) and respiratory (Q10-Q12)
areas, the Biclustering and Each Target approaches achieved the best performance in two of the three
targets. The best average performance was obtained for Q11 (RMSE 0.361, MAE 0.131) and the worst
for Q6 (RMSE 0.909, MAE 0.667) and Q9 (RMSE 0.934, MAE 0.560).
For Task 2, the best overall strategy was feature transformation through Biclustering (RMSE 0.805,
MAE 0.483), with the best average metrics in 8 out of 12 targets. Compared to Task 1, there is more
overlap in the outcome of the three strategies, and as such, the second best method (Each Target; RMSE
0.836, MAE 0.507) had the best average metrics in 5 questions. Also, unlike Task 1, there is no preferred
strategy by domain, save for the respiratory questions (Q10-Q12) that are most easily predicted by
biclustering-based models. The best average performance was attained in Q12 (RMSE 0.419, MAE 0.318)
and the worst in Q10 (RMSE 1.191, MAE 0.818).
4.3. Model Comparison
We conducted experiments to predict the ALSFRS-R questions in the second evaluation using four
machine-learning classifiers - Logistic Regression (LR), Random Forests (RF), Support Vector Machine
(SVC), and XGBoost (XGB). We optimized their hyperparameters and validated their performance on a
validation set derived from the provided training set as described in section 3.3. In addition to these
classifiers, we also submitted two naïve approaches: Last Observation Carried Forward (LOCF) and
Majority Class.
Table 6
Model results’ summary, by feature selection and extraction strategy. Presented Root Mean Squared Error
(RMSE) and Mean Absolute Error (MAE) report to the average performance of the 4 tested classifiers (LR, RF,
SVC, XGBoost), in the test set. The performance metrics are provided for each target question and averaged
across all of the strategy’s models (Global), with the best outcome in bold.
Strategy Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Global
Biclustering RMSE 0.730 0.775 0.616 0.820 0.909 1.008 1.015 0.843 1.095 0.825 0.378 0.761 0.815
MAE 0.393 0.488 0.298 0.440 0.643 0.702 0.810 0.571 0.667 0.548 0.143 0.476 0.515
Task 1
All Sensors RMSE 0.744 0.836 0.883 0.975 0.804 0.909 0.884 0.733 1.205 0.842 0.477 0.849 0.845
MAE 0.417 0.571 0.488 0.560 0.536 0.667 0.571 0.429 0.821 0.548 0.190 0.536 0.528
Each Target RMSE 0.826 0.813 0.548 0.906 0.765 0.959 0.948 0.595 0.934 1.004 0.361 0.703 0.780
MAE 0.464 0.536 0.262 0.488 0.488 0.714 0.679 0.310 0.560 0.631 0.131 0.429 0.474
Biclustering RMSE 0.738 0.910 0.726 1.115 1.028 0.892 0.574 0.698 0.452 1.191 0.914 0.419 0.805
MAE 0.432 0.523 0.364 0.659 0.614 0.705 0.341 0.500 0.227 0.818 0.295 0.318 0.483
Task 2
All Sensors RMSE 0.820 0.799 0.749 1.217 1.310 1.034 0.811 0.797 0.689 1.388 1.383 0.603 0.967
MAE 0.545 0.477 0.432 0.727 0.977 0.864 0.554 0.500 0.386 0.977 0.568 0.364 0.614
Each Target RMSE 0.808 0.860 0.686 1.147 0.993 0.875 0.696 0.665 0.518 1.232 1.128 0.433 0.836
MAE 0.455 0.477 0.295 0.727 0.636 0.636 0.455 0.455 0.273 0.864 0.500 0.318 0.507
Table 7 present the performance results for each model in predicting each target question, along with
the overall results (average RMSE and MAE values across all questions). Notably, the LOCF approach
performed the best for both tasks, indicating minimal variability between the ALSFRS-R scores of the
first and second evaluations. On the other hand, the Majority Class approach was the worst performer,
with RMSE values of 1.092 for Task 1 and 1.471 for Task 2. A potential reason for the classifiers’ overall
poor performance is that they were trained to predict the next score rather than specifically the second
score, making the models too general for this particular task.
Regarding Task 1, questions Q3 (bulbar domain) and Q10 (respiratory domain) had the lowest
prediction error using the LOCF approach (RMSE 0.218, MAE 0.048). Conversely, question Q9 (lower
limb domain) predictions were the poorest, with the best classifier being RF (RMSE 0.873, MAE 0.381).
For Task 2, the conclusions are similar to those of Section 4.1. Questions regarding the respiratory
domain, Q11, and Q12, were correctly predicted for all the evaluations. Particularly, the LOCF approach
correctly predicted all the scores of question Q11, and the LR and RF classifiers accurately predicted all
scores for question Q12 (RMSE 0.000, MAE 0.000). The most misclassified question was Q4 (upper limb
domain), with an RMSE of 1.044 and MAE of 0.545 using the best-performing model (LOCF).
5. Conclusion
In a fast-acting and debilitating disease like ALS, the ability to predict how it evolves can be critical for
clinical decision-making and life-prolonging therapy administration. Thus, the collection of sensor data
can be a valuable resource for improving prognosis prediction, as it provides continuous monitoring of
the patient’s physiological status. This information can complement the periodic clinical assessments
and possibly hint at the imminent occurrence of critical events, such as needing ventilation support.
Machine learning techniques allow for meaningful insight to be extracted from these large datasets,
which can potentially improve the performance of current prognosis prediction approaches or lead
to the development of new ones. In the iDPP $ CLEF 2024 challenge, the main goal was to predict
the ALSFRS-R scores (both clinical and self-assessed) of a patient’s second assessment, given the first
assessment and the sensor records between evaluations.
Our methodology consisted of independent multi-class models, each predicting an ALSFRS-R question.
Four classification models were tested: Logistic Regression, Random Forest, XGBoost, and Support
Table 7
Results of the models. Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) metrics of four ML
classifiers and two naïve approaches across the 12 target questions. The classifiers include Logistic Regression
(LR), Random Forest (RF), Support Vector Classifier (SVC), and XGBoost. The naïve approaches are the Last
Observation Carried Forward (LOCF) and Majority Class. The performance metrics are provided for each target
question and averaged across all the questions (Global).
Model Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Global
LOCF RMSE 0.488 0.309 0.218 0.690 0.535 0.577 0.488 0.535 0.951 0.218 0.309 0.577 0.491
MAE 0.143 0.095 0.048 0.286 0.286 0.333 0.238 0.190 0.429 0.048 0.095 0.238 0.202
Majority Class RMSE 1.512 0.976 1.512 1.254 1.113 1.34 1.327 1.175 1.690 0.309 0.378 0.724 1.092
MAE 0.857 0.476 0.762 0.814 0.762 0.905 0.810 0.810 1.238 0.0952 0.143 0.333 0.659
Task 1
LR RMSE 1.000 0.756 0.756 0.900 0.787 1.000 1.024 0.873 0.926 0.816 0.378 0.845 0.838
MAE 0.619 0.381 0.381 0.524 0.524 0.714 0.762 0.571 0.571 0.476 0.143 0.429 0.508
RF RMSE 0.690 0.578 0.436 0.926 0.655 0.900 0.900 0.617 0.873 0.577 0.378 0.577 0.676
MAE 0.381 0.238 0.190 0.476 0.429 0.619 0.619 0.286 0.381 0.238 0.143 0.238 0.353
SVC RMSE 0.976 0.787 0.617 0.900 1.000 1.234 1.113 0.655 1.291 0.756 0.378 0.951 0.888
MAE 0.571 0.429 0.286 0.524 0.619 0.857 0.857 0.429 0.905 0.381 0.143 0.524 0.544
XGBoost RMSE 0.309 1.134 0.655 0.900 0.617 0.900 0.756 0.787 1.291 1.215 0.378 1.024 0.830
MAE 0.095 1.095 0.333 0.429 0.381 0.619 0.476 0.429 0.810 1.095 0.143 0.952 0.571
LOCF RMSE 0.674 0.674 0.426 1.044 0.739 0.603 0.739 0.603 0.302 0.522 0.000 0.603 0.577
MAE 0.455 0.273 0.182 0.545 0.364 0.364 0.364 0.364 0.091 0.273 0.000 0.182 0.288
Majority Class RMSE 1.348 0.905 1.168 1.314 1.477 1.809 1.651 1.044 1.883 1.758 2.089 1.206 1.471
MAE 0.909 0.455 0.636 0.818 1.091 1.636 1.273 0.909 1.545 1.273 1.091 0.727 1.030
Task 2
LR RMSE 0.798 0.790 0.905 1.168 1.168 0.953 0.674 0.798 0.603 1.279 1.537 0.000 0.890
MAE 0.455 0.455 0.455 0.818 0.818 0.727 0.455 0.636 0.364 0.909 0.727 0.000 0.568
RF RMSE 0.905 0.905 0.739 1.087 1.279 0.905 0.674 0.798 0.302 1.128 1.508 0.000 0.852
MAE 0.636 0.455 0.364 0.636 0.909 0.818 0.455 0.636 0.091 0.727 0.636 0.000 0.530
SVC RMSE 0.905 1.000 0.739 1.128 1.624 1.279 0.853 0.674 0.603 1.414 1.279 0.674 1.014
MAE 0.636 0.636 0.364 0.727 1.364 1.091 0.545 0.455 0.364 1.091 0.545 0.273 0.674
XGBoost RMSE 0.674 0.739 0.522 1.206 1.168 1.000 1.044 0.522 0.302 1.732 1.206 1.000 0.926
MAE 0.455 0.364 0.273 0.727 0.818 0.818 0.727 0.273 0.091 1.182 0.364 1.000 0.591
Vector Machine. The sensor data was handled first by deriving static features from the longitudinal
ones using summarization techniques, i.e., by calculating summary statistics within an observation
window before the target date. Then, the feature set was reduced using three methods: K-Best selection
across all questions, K-Best selection by question, and biclustering. These models were also compared
to baseline approaches Last Observation Carried Forward (LOCF) and Majority Class.
In both tasks, Random Forest yielded the best overall results but did not outperform LOCF, save for a
few individual questions. Additionally, there was no consensus regarding the best feature selection or
extraction approach. Independent K-Best selection and Biclustering were the best overall methods in
tasks 1 and 2, respectively. However, further research is needed to capture the temporal patterns of
sensors to fully understand their potential in tracking disease progression as measured by ALSFRS-R
scores.
Acknowledgments
This work was partially supported by Fundação para a Ciência e a Tecnologia (FCT) through project
AIpALS ref. PTDC/CCI-CIF/4613/2020 (https://doi.org/10.54499/PTDC/CCI-CIF/4613/2020), LASIGE Re-
search Unit, ref. UIDB/00408/2020 (https://doi.org/10.54499/UIDB/00408/2020) and ref. UIDP/00408/2020
(https://doi.org/10.54499/UIDP/00408/2020), and PhD Research Scholarships to RB (2022.10727.BD),
DFS ref. 2020.05100.BD (https://doi.org/10.54499/2020.05100.BD) and ENC ref. 2021.07810.BD (https:
//doi.org/10.54499/2021.07810.BD); and by BRAINTEASER project, which has received funding from
the European Union’s Horizon 2020 research and innovation program under grant agreement No.
101017598.
References
[1] L. C. Wijesekera, P. Nigel Leigh, Amyotrophic lateral sclerosis, Orphanet journal of rare diseases
4 (2009) 1–22.
[2] J. Morris, Amyotrophic lateral sclerosis (ALS) and related motor neuron diseases: an overview,
The Neurodiagnostic Journal 55 (2015) 180–194.
[3] S. R. Pfohl, R. B. Kim, G. S. Coan, C. S. Mitchell, Unraveling the complexity of amyotrophic lateral
sclerosis survival prediction, Frontiers in neuroinformatics 12 (2018) 36.
[4] J. M. Cedarbaum, N. Stambler, E. Malta, C. Fuller, D. Hilt, B. Thurmond, A. Nakanishi, B. A. S.
Group, A. complete listing of the BDNF Study Group, et al., The ALSFRS-R: a revised als functional
rating scale that incorporates assessments of respiratory function, Journal of the neurological
sciences 169 (1999) 13–21.
[5] E. Beswick, T. Fawcett, Z. Hassan, D. Forbes, R. Dakin, J. Newton, S. Abrahams, A. Carson,
S. Chandran, D. Perry, et al., A systematic review of digital technology to evaluate motor function
and disease progression in motor neuron disease, Journal of Neurology 269 (2022) 6254–6268.
[6] R. P. van Eijk, J. N. Bakers, T. M. Bunte, A. J. de Fockert, M. J. Eijkemans, L. H. van den Berg,
Accelerometry for remote monitoring of physical activity in amyotrophic lateral sclerosis: a
longitudinal cohort study, Journal of neurology 266 (2019) 2387–2395.
[7] A. Maier, T. Holm, P. Wicks, L. Steinfurth, P. Linke, C. Münch, R. Meyer, T. Meyer, Online
assessment of als functional rating scale compares well to in-clinic evaluation: a prospective trial,
Amyotrophic Lateral Sclerosis 13 (2012) 210–216.
[8] S. A. Johnson, M. Karas, K. M. Burke, M. Straczkiewicz, Z. A. Scheier, A. P. Clark, S. Iwasaki,
A. Lahav, A. S. Iyer, J.-P. Onnela, et al., Wearable device and smartphone data quantify als
progression and may provide novel outcome measures, NPJ Digital Medicine 6 (2023) 34.
[9] J. W. van Unnik, M. Meyjes, M. R. J. van Mantgem, L. H. van den Berg, R. P. van Eijk, Remote
monitoring of amyotrophic lateral sclerosis using wearable sensors detects differences in disease
progression and survival: a prospective cohort study, Ebiomedicine 103 (2024).
[10] A. S. Gupta, S. Patel, A. Premasiri, F. Vieira, At-home wearables and machine learning sensitively
capture disease progression in amyotrophic lateral sclerosis, Nature Communications 14 (2023)
5080.
[11] M. Straczkiewicz, M. Karas, S. A. Johnson, K. M. Burke, Z. Scheier, T. B. Royse, N. Calcagno,
A. Clark, A. Iyer, J. D. Berry, et al., Upper limb movements as digital biomarkers in people with
als, EBioMedicine 101 (2024).
[12] F. G. Vieira, S. Venugopalan, A. S. Premasiri, M. McNally, A. Jansen, K. McCloskey, M. P. Brenner,
S. Perrin, A machine-learning based objective measure for als disease severity, NPJ digital medicine
5 (2022) 45.
[13] S. B. Rutkove, P. Narayanaswami, V. Berisha, J. Liss, S. Hahn, K. Shelton, K. Qi, S. Pandeya, J. M.
Shefner, Improved als clinical trials through frequent at-home self-assessment: a proof of concept
study, Annals of Clinical and Translational Neurology 7 (2020) 1148–1157.
[14] E. Tavazzi, E. Longato, M. Vettoretti, H. Aidos, I. Trescato, C. Roversi, A. S. Martins, E. N. Castanho,
R. Branco, D. F. Soares, et al., Artificial intelligence and statistical methods for stratification
and prediction of progression in amyotrophic lateral sclerosis: A systematic review, Artificial
Intelligence in Medicine (2023) 102588.
[15] J. Gordon, B. Lerner, Insights into amyotrophic lateral sclerosis from a machine learning perspective,
Journal of Clinical Medicine 8 (2019) 1578.
[16] A. S. Martins, M. Gromicho, S. Pinto, M. de Carvalho, S. C. Madeira, Learning prognostic models
using diseaseprogression patterns: Predicting the need fornon-invasive ventilation in amyotrophic
lateralsclerosis, IEEE/ACM Transactions on Computational Biology and Bioinformatics (2021).
[17] J. Matos, S. Pires, H. Aidos, M. Gromicho, S. Pinto, M. de Carvalho, S. C. Madeira, Unravelling
disease presentation patterns in als using biclustering for discriminative meta-features discovery,
in: International Work-Conference on Bioinformatics and Biomedical Engineering, Springer, 2020,
pp. 517–528.
[18] S. C. Madeira, A. L. Oliveira, Biclustering algorithms for biological data analysis: a survey,
IEEE/ACM transactions on computational biology and bioinformatics 1 (2004) 24–45.
[19] E. N. Castanho, H. Aidos, S. C. Madeira, Biclustering fmri time series: a comparative study, BMC
bioinformatics 23 (2022) 192.
[20] D. F. Soares, R. Henriques, M. Gromicho, M. de Carvalho, S. C. Madeira, Learning prognostic
models using a mixture of biclustering and triclustering: Predicting the need for non-invasive
ventilation in amyotrophic lateral sclerosis, Journal of Biomedical Informatics 134 (2022) 104172.
[21] R. Henriques, S. C. Madeira, Flebic: Learning classifiers from high-dimensional biomedical data
using discriminative biclusters with non-constant patterns, Pattern Recognition 115 (2021) 107900.
[22] R. Henriques, S. C. Madeira, Triclustering algorithms for three-dimensional data analysis: a
comprehensive survey, ACM Computing Surveys (CSUR) 51 (2018) 1–43.
[23] D. F. Soares, R. Henriques, S. C. Madeira, Comprehensive assessment of triclustering algorithms
for three-way temporal data analysis, Pattern Recognition (2024) 110303.
[24] D. F. Soares, R. Henriques, M. Gromicho, M. de Carvalho, S. C. Madeira, Triclustering-based
classification of longitudinal data for prognostic prediction: targeting relevant clinical endpoints
in amyotrophic lateral sclerosis, Scientific Reports 13 (2023) 6182.
[25] A. V. Carreiro, P. M. Amaral, S. Pinto, P. Tomás, M. de Carvalho, S. C. Madeira, Prognostic
models based on patient snapshots and time windows: Predicting disease progression to assisted
ventilation in amyotrophic lateral sclerosis, Journal of biomedical informatics 58 (2015) 133–144.
[26] G. Birolo, P. Bosoni, G. Faggioli, H. Aidos, R. Bergamaschi, P. Cavalla, A. Chiò, A. Dagliati, M. de
Carvalho, G. Di Nunzio, P. Fariselli, J. García Dominguez, A. G. Marta Gromicho, E. Longato,
S. Madeira, U. Manera, S. Marchesin, L. Menotti, G. Silvello, E. Tavazzi, E. Tavazzi, I. Trescato,
M. Vettoretti, B. D. Camillo, N. Ferro, Overview of iDPP@CLEF 2024: The Intelligent Disease
Progression Prediction Challenge, in: Working Notes of the Conference and Labs of the Evaluation
Forum (CLEF 2024), Grenoble, France, September 9th to 12th, 2024, 2024.
[27] G. Birolo, P. Bosoni, G. Faggioli, H. Aidos, R. Bergamaschi, P. Cavalla, A. Chiò, A. Dagliati, M. de
Carvalho, G. Di Nunzio, P. Fariselli, J. García Dominguez, A. G. Marta Gromicho, E. Longato,
S. Madeira, U. Manera, S. Marchesin, L. Menotti, G. Silvello, E. Tavazzi, E. Tavazzi, I. Trescato,
M. Vettoretti, B. D. Camillo, N. Ferro, Intelligent Disease Progression Prediction: Overview of
iDPP@CLEF 2024, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction -
15th International Conference of the CLEF Association, CLEF 2024, Grenoble, France, September
9-12, 2024, Proceedings, 2024.
[28] R. Branco, J. Valente, A. Martins, D. Soares, E. Castanho, S. Madeira, H. Aidos, Survival analysis
for multiple sclerosis: predicting risk of disease worsening, in: CLEF, 2023.
[29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,
Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–
2830.
[30] Y. Kluger, R. Basri, J. T. Chang, M. Gerstein, Spectral biclustering of microarray data: Coclustering
genes and conditions, Genome Research 13 (2003) 703–716. doi:10.1101/gr.648603.
[31] G. Lemaître, F. Nogueira, C. K. Aridas, Imbalanced-learn: A python toolbox to tackle the curse of
imbalanced datasets in machine learning, Journal of Machine Learning Research 18 (2017) 1–5.
URL: http://jmlr.org/papers/v18/16-365.html.
[32] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling
technique, Journal of artificial intelligence research 16 (2002) 321–357.
[33] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter
optimization framework, in: Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2019.
[34] K. Sechidis, G. Tsoumakas, I. Vlahavas, On the stratification of multi-label data, Machine Learning
and Knowledge Discovery in Databases (2011) 145–158.
[35] P. Szymański, T. Kajdanowicz, A network perspective on stratification of multi-label data, in:
L. Torgo, B. Krawczyk, P. Branco, N. Moniz (Eds.), Proceedings of the First International Workshop
on Learning with Imbalanced Domains: Theory and Applications, volume 74 of Proceedings of
Machine Learning Research, PMLR, ECML-PKDD, Skopje, Macedonia, 2017, pp. 22–35.
[36] P. Szymański, T. Kajdanowicz, A scikit-based Python environment for performing multi-label
classification, ArXiv e-prints (2017). arXiv:1702.01460.