Prognosis Prediction in Covid-19 Patients from Lab Tests
  and X-ray Data through Randomized Decision Trees
                   Alfonso E. Gerevini∗ , Roberto Maroldi∗† , Matteo Olivato∗ , Luca Putelli∗ , Ivan Serina∗
                             ∗ Università degli Studi di Brescia, † ASST Spedali Civili di Brescia
                      {alfonso.gerevini, roberto.maroldi, ivan.serina, m.olivato, l.putelli002}@unibs.it


Abstract. AI and Machine Learning can offer powerful tools to                               Our datasets were engineered to cope with a number of practical
help in the fight against Covid-19. In this paper we present a study                     issues, including missing values and feature values categorization,
and a concrete tool based on machine learning to predict the progno-                     and to add some helpful artificial features. We also addressed the
sis of hospitalised patients with Covid-19. In particular we address                     “concept drift” issue [6, 23], since we observed that the risk of death
the task of predicting the risk of death of a patient at different times                 was clearly sensitive to the time period when the patient was hos-
of the hospitalisation, on the base of some demographic informa-                         pitalised; the risk was significantly higher during the earlier period
tion, chest X-ray scores and several laboratory findings. Our machine                    of the emergency (March 2020), when in northern Italy the spread of
learning models use ensembles of decision trees trained and tested                       the virus infection was very high and many people were hospitalised.
using data from more than 2000 patients. An experimental evalua-                         Moreover, given the very sensitive nature of our task, we introduced
tion of the models shows good performance in solving the addressed                       a threshold to discharge the model predictions that have a low esti-
task.                                                                                    mated probability. Such a threshold is a parameter that is automati-
                                                                                         cally calculated and optimised during the training phase.
                                                                                            We considered several machine learning algorithms. A first experi-
1     Introduction                                                                       mental comparison of their performance on our data sets showed that
The fight against Covid-19 is a new important challenge for the world                    methods based on forests of trees have more promising performance,
that AI and machine learning can help facing at various levels [15,                      and so we decided to focus on this approach. The obtained predic-
28, 29]. In March 2020, at the time of the coronavirus emergency                         tion models have good performance over a randomly chosen test set
in Italy, we started working in strict collaboration with one of the                     of 200 patients for each considered period, in terms of both F2 and
hospitals that had more Covid-19 patients in Italy, Spedali Civili di                    ROC-AUC scores. In particular, overall the system makes very few
Brescia, to help predicting the prognosis of hospitalised patients. Our                  errors in predicting patient survival, i.e., the specificity of the predic-
work was focused on the task of predicting the risk of death of a                        tion is very high.
patient at different times of the hospitalisation. As discussed in [28],                    In the following, after discussing related work, we describe our
predicting if a patient is at risk of decease or adverse events can help                 data sets, we present our prediction models and their experimental
the hospital, for instance, to organize the allocation of limited health                 evaluation, and finally we give conclusions and mention future work.
resources in a more efficient way.
   Our predictive models are built on the base of demographic in-                        2    Related work
formation (sex and age), the values of ten laboratory tests and the
chest X-ray score(s), which is an innovative measure developed and                       Artificial Intelligence and Machine Learning techniques can be used
used at Spedali Civili di Brescia to assess the severity of the pul-                     for tackling the Covid-19 pandemic in different aspects. However,
monary conditions [3]. Other important information, such us the pa-                      given that the pandemic has started only few months ago, most works
tient comorbidities or the time and duration of the symptoms related                     are still preliminary, and there isn’t a clear description of the devel-
to Covid-19, were not used because not available to us.                                  oped techniques and of their results (often only pre-printed and not
   Using raw data from more than 2000 patients, we built some data                       properly peer-reviewed).
sets describing the “clinical history” of each patient during the hos-                      A preliminary study is presented in [15]. Given a set of only 53
pitalisation. In particular, each dataset contains a “snapshot” of the                   patients with mild symptoms and their lab tests, comorbidities and
infection conditions of every considered patient at a certain day after                  treatment, the authors train several machine learning models (Lo-
the start of the hospitalisation. For each dataset, we built a different                 gistic Regression, Decision Trees, Random Forests, Support Vector
predictor, allowing to make progressive predictions over time that                       Machines, KNN) to predict if a patient will be subject to more sever
take into account the evolution of the disease severity in a patient,                    symptoms, obtaining a prediction accuracy score of up to 0.8 using
which helps the formulation of a personalized prediction of the prog-                    10-fold cross validation. The generalizability and strength of these
nosis. A change of the predicted risk over time for a patient could also                 results are questionable, given the very small set of considered pa-
hint a link between specific events or treatments and the increase or                    tients.
decrease of the risk for the patient. As snapshot times for a patient, in                   Another example is the pre-printed work by Li Yan et al. [29] that
our experiments we considered the 2nd, 4th, 6th, 8th and 10th hospi-                     uses lab tests for predicting the mortality risk; the proposed model
talization day, and the day before the end of the hospitalisation.                       is a very simple decision tree based on the three most important fea-


    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
tures. While the performance seems promising, the test set used for           Lab test                                  Normal Range    Median Value
evaluation was very small (29 patients).                                      C-Reactive Protein (PCR)                       ≤ 10          34.3
   Various AI and machine learning techniques have been developed             Lactate dehydrogenase (LDH)                 [80, 300]         280
                                                                              Ferritin (Male)                             [30, 400]        1030
for prognosis and disease progression prediction [7] in the context of        Ferritin (Female)                           [13, 150]         497
diseases different from Covid-19 [20, 21, 22]. In particular, in the last     Troponin-T                                     ≤ 14           19
few years, several works about predicting mortality risk or adverse           White blood cell (WBC)                        [4, 11]         7.1
events and on the use of AI in critical care [19] have been published.        D-dimer                                       ≤ 250           553
                                                                              Fibrinogen                                 [180, 430]         442
The survey in [1] presents a review of statistical and ML systems for
                                                                              Lymphocite (over 18 years old patients)      [20, 45]         1.0
predicting the mortality risk, the need of beds in intense care units         Neutrophils/Lymphocites                     [0.8, 3.5]        4.9
[30] or the length of the patient hospitalization. In particular, it is       Chest XRay-Score (RX)                          <7              8
worth to mention the work by Harutyunyan et al. [11] which uses
                                                                            Table 1: Lab tests performed during the hospitalisation. In the second
LSTM Neural Networks for predicting both the mortality risk and
                                                                            column, we show the range which is considered clinically normal
the length of the hospitalisation.
                                                                            for a specific exam. In the third column, we show the median value
   An overview of the issues and challenges for applying ML in a
                                                                            extracted considering the lab test findings for our set of 2015 patients.
critical-care context is available in [16]. This work stresses the need
to deal with corrupted data, like missing values, imprecision, and
                                                                            few days to two months), due to different reasons including the nov-
errors that can increase the complexity of prediction tasks.
                                                                            elty and the characteristics of the disease, its high contagiousness or
   Lab test findings and their variation over time are the main focus
                                                                            the absence of an effective treatment. Therefore, the number of per-
of the work by Hyland et al. [14], which describes a system that
                                                                            formed lab tests and relative findings significantly varies among the
processes these data to generate an alarm predicting that a patient
                                                                            considered set of patients (from only three to hundreds).
will have a circulatory failure 2 hours in advance.
                                                                               Moreover, the lab tests and X-ray exams are not performed at a
                                                                            regular frequency due, e.g., to the different kinds and timing of the
3     Available Data Sources                                                relative procedures, the need of different resources (X-Ray machines,
                                                                            lab equipments, technical staff, etc.), or to the different severity of
During the Covid-19 outbreak, from February to April 2020 in                the health conditions of the patients. For example, in our data we see
hospital Spedali Civili di Brescia more than two thousand patients          that a patient can be tested for PCR everyday and not be subject to
were hospitalised. During their hospitalisation, the medical staff per-     a Ferritin exam for two weeks. This leads to the need of handling
formed several exams to them in order to monitor their conditions,          the issues missing values and outdated values. When we consider a
checking the response to some treatments, verifying the need to             snapshot of a patient at a certain day, we have a missing value for a
transfer a patient to the ICU, etc. We had data from a total of 2015        lab test (or X-ray) feature if that test (X-ray) has not been performed.
hospitalised patients; for each of these patients, the specific data that   We have an outdated value for a feature if the corresponding lab test
were made available to us are:                                              (X-ray) was performed several days earlier: since in the meanwhile
                                                                            the disease has progressed, the findings of the lab test could be incon-
• the age and sex;
                                                                            sistent with the current conditions of the patient, and so they could
• the values and dates of several lab tests (see Table 1);
                                                                            mislead the prediction.
• the scores (each one from 0 to 18), assigned by the physicians,
                                                                               Data quality issues arise especially patients hospitalised in the pe-
  assessing the severity of the pulmonary conditions resulting from
                                                                            riod of the highest emergency, when several hundreds of patients
  the X-ray exams [3];
                                                                            were in the hospital at the same time.
• the values and dates of the throat-swab exams for Covid-19;
• the final outcome of the hospitalisation at the end of the stay,
  which is the classification value of our application (either in-          3.2    Concept Drift
  hospital death, released survivor, or transferred to another hospital
                                                                            An examination of the data available for our cohort of patients re-
  or rehabilitation center).
                                                                            vealed that their prognostic risk is influenced by multiple factors,
   Table 1 specifies the considered lab tests, their normal range of val-   such as the number of the patients currently hospitalised and the con-
ues, and their median values in our set of patients. We had no further      sequent availability of ICU beds or other resources, the experimen-
information about symptoms, their timing, comorbidities, generic            tation of new therapies, and the increase of the clinical knowledge.
health conditions or clinical treatment. Moreover, we have no CT im-           In machine learning, this change of data distribution is known as
ages or text reports associated with the X-ray exams. The available         concept drift [6, 23]. A classical method to deal with this problem is
information about whether a patient was or had been in ICU was not          training the algorithm using only a subset of samples, depending on
clear enough to be used. Finally, of course, also the names of the          the data distribution that we are considering [6, 24].
patient and of the involved medical staff names were not provided.             For this reason, we divided the considered set of patients into two
                                                                            groups: the High Contagion Phase (HCP) group of patients, which
                                                                            is composed by the patients admitted during the last weeks of Febru-
3.1    Data Quality Issues                                                  ary and the first weeks of March (the most critical period of the pan-
                                                                            demic outbreak in Italy) and the Moderate Contagion Phase (MCP)
When applying machine learning to raw real-world data, there are            group of patients, which is composed by the patients admitted from
some non-trivial practical issues to deal with, such as the quality of      the last decade of March to the end of April.
the available data and related aspects, that in biomedical applications        The main differences between these groups of patients are:
are especially important given the very sensitive domain [12].
   In our case, one of such issues is that the length of the hospital-      1. the mortality rate of the HCP patients is about twice the mortality
isation period can sensibly differ from one patient to another (from           rate of the MCP patients;
Figure 1: Length of stay in hospital (left) and weekly death rate histograms for the High Contagion Phase (in blue) and for the Moderate
Contagion Phase (in orange). On the x-axis, for the length of stay we indicate the range of days, for the death rate we indicate the week when
the patient was released. On the y-axis we indicate the percentage of patients.

2. in HCP patients the median value of the hospitalisation period is 8    4.1.1   Patient Snapshot and Feature Engineering
   days, while in MCP patients is 14 days. Further details are given
   in Figure 1;                                                           In order to provide a prediction for a patient at different hospitalisa-
3. for many of the considered lab test, the mortality rate associated     tion times, we introduced the concept of patient snapshot to repre-
   with having values in a particular range significantly changes in      sent the patient health conditions at a given day.
   the two groups. For example, in HCP patients the mortality rate           In this snapshot, for each lab test of Table 1, we consider its most
   for the patients which had a PCR value 10 times above the normal       recent value. In the ideal case, we should know the lab test findings
   range is 40.1%, while in MCP patients it is 21.1%.                     at every day. However, as explained in Section 3.1, in a real-world
                                                                          context the situation is very different. For example, in our data if we
   These differences clearly indicate that the data in the HCP and        consider to take a snapshot of a patient 14 days after the admission
MCP groups represent different target (concept) functions; therefore      into the hospital, we have cases with very recent values of PCR, LDH
predicting mortality during the high infection phase and during the       or WBC (obtained one or a few days before), very old values for Fib-
moderate phase can be considered as two different tasks. If we had        rinogen or Troponin-T (obtained the first day of the hospitalisation)
only the patients hospitalised during the high infection phase, using     and even no value for Ferritin.
these data for training an algorithm that predicts the mortality during      Given the difficulty to set a predefined threshold that separates re-
the moderate phase would lead to many errors.                             cent and old values of the lab tests (e.g., for Fibrinogen and Troponin-
   In our case, we generated two different systems, one for each of       T), we choose to always use the most recent value, even if it could
the two groups of patients. We are currently investigating ways to        be outdated. In order to allow the learning algorithm to capture that
automatically select the set of patients for training starting from the   a value may not be significant to represent the current status of the
latest ones, and keeping the less recent ones until we find significant   patient (because too old), we introduce a feature called ageing for
changes in the mortality rate or in the data distribution.                each test finding. If a lab test has been performed at a day d0 , and the
                                                                          snapshot of a patient is taken at day d1 , the ageing is defined as the
                                                                          number of days between d1 and d0 . If there is no available value for
4     Datasets for Training and Testing                                   a lab test, its ageing is considered a missing value.
                                                                             A patient snapshot can contain the values of the lab test findings in
The main task of our work is to provide survival/death predictions at
                                                                          two forms: either numerical, in which we report the value itself, or
different days of the patient hospitalisation, according to the current
                                                                          categorical, in which the value is transformed into an integer number
patient conditions reflected by the available lab findings and X-ray
                                                                          expressing the gravity of the test finding within a partition of the
scores. In this section we describe the specific extracted features and
                                                                          possible real values. This partition is based on the range of values for
the (training and testing) datasets that we built for this purpose.
                                                                          normal conditions and on how the test values are distributed over the
                                                                          data of all patients. For example, we divide the D-Dimer vales into
4.1    Pre-processing and Feature Extraction                              6 categories: the normal range, up to 2 times the maximum value
                                                                          of the normal range, up to 4 times, 6 times, 10 times and over 10
The issues presented in Section 3.1 compel us to a robust pre-            times. The categorical form could help the algorithm to have a clearer
processing phase with the goal of extracting features in order to sum-    understanding of the data and improve performance.
marize the patients conditions and process them by a machine learn-          Monitoring the conditions of a patient means knowing not only the
ing algorithm. The pre-processing is applied to both HCP and MCP          patient status at a specific time, but also how the conditions evolve
data.                                                                     during the hospitalisation. For this purpose, we introduce a feature
   Given that we have no information about the survival or the de-        called trend that is defined as follows:
cease of a patient after a transfer (which can be due to limited avail-
ability of beds or ICU places), we exclude from our training and test       For each lab test, if there is no available value for a lab test or if
set the 142 patients which were admitted in Spedali Civili di Bres-         the patient has not performed the lab test at least two times, the
cia and then transferred to another hospital. However, the 74 patients      trend is a missing value. Otherwise, given the values v1 and v2
who were transferred to a rehabilitation center can be considered not       of the findings for the lab test performed at days d1 and d2 and a
at risk of death; therefore we include them in our datasets and con-        threshold T that we set to 15% of v1 , if v2 > (1 + T ) ∗ v1 , then
sider the transferred patients as released alive.                           the trend is increasing, while if v2 < (1 − T ) ∗ v1 the trend is
    decreasing; otherwise the trend is stable.                                   5.1    Classification Algorithms
   We distinguish two types of trends: the start trend, that uses the            Decision trees
distance between the most recent value and the first available value,            Decision Trees [25] are one of the most popular learning methods
and the last trend, that uses the distance between the last one and              for solving classification tasks. In a decision tree, the root and each
the penultimate one. We are currently investigating techniques for               internal node provides a condition for splitting the training samples
including more than two values in the trend calculation.                         into two subsets depending on whether the condition holds for a sam-
   To summarize, for each lab test in a patient snapshot, we have the            ple or not. In our context, for each numerical feature f , a candidate
most recent finding and the relative ageing and trend, as well as the            splitting condition is f ≤ C, where C is called cut point. The final
static features age and sex.                                                     splitting condition is chosen by finding the f and C providing the
                                                                                 best split according to one of some possible measures like Informa-
4.2      Training and Test Sets Generation                                       tion Gain, Entropy index or Gini index.
                                                                                    A subset of samples at a tree node can either be split again by
In this section we describe how we generated the training and test               further feature conditions forming a new internal node, or form a
sets for the purpose of predicting, at different days from the start of          leaf node labelled with a specific classification (prediction) value; in
the patient hospitalization, the final outcome of her/his stay.                  our application domain the label is either the alive class or the dead
   First, for both the HCP and MCP sets, we used stratified sampling             class. Let us consider a decision tree with a leaf node l and a subset S
for selecting 80% of the patients for training the models and 20% for            of associated training samples. A test instance X that reaches l from
testing them. Then, we created specific training and test sets for each          the root tree, is classified (predicted) y with probability
element in a sequence of times when the model is used to make the
prediction1 :                                                                                                           TP
                                                                                                         P (y|X) =
                                                                                                                      TP + FP
• 2 days of hospitalisation. We include all the patients’ snapshots
                                                                                 where T P (True Positives) is the number of training samples in S
  containing the first values for each lab test conducted in the first
                                                                                 that have class value y, and F P (False Positives) is the number of
  two days after the hospital admission. Note that if a patient has
                                                                                 samples in S that don’t have class value y [5]. Given that in our task
  performed a lab test more than once in the first two days, the
                                                                                 we have only two classes (y and y), P (y|X) = 1 − P (y|X). The
  snapshot will consider the oldest value. In fact, the purpose of
                                                                                 classification outcome of a decision tree forX is the class value with
  the model we want to build is to provide the prediction as soon
                                                                                 the highest probability.
  as possible, with the first information available. Furthermore, in
  these snapshots the ageing and trend values are not included.
• 4 days and 6 days of hospitalisation. In these cases, the corre-               Random Forests
  sponding snapshots also contain the ageing and trend features, and
                                                                                 Random Forests (RF) [4] is an ensemble learning method [32] that
  the lab values will be the most recent ones in the available data.
                                                                                 builds a number of decision trees at training time. For building each
  Given that only a few days passed after admission, we consider
                                                                                 individual tree of the random forest, a randomly chosen subset of the
  the start trend.
                                                                                 data features is used. While, in the standard implementation of ran-
• 8 days and 10 days of hospitalisation. The procedure of creating
                                                                                 dom forests the final classification label is provided using the statisti-
  the corresponding snapshots is the same as for the snapshots of
                                                                                 cal mode of the class values predicted by each individual tree, in the
  4 days and 6 days cases, except that we consider the last trend
                                                                                 well-known tool Scikit-Learn [18] that we used for our system im-
  instead of the start trend.
                                                                                 plementation, the probability of the classification output is obtained
• End day (the last day before the patient is released or the patience
                                                                                 by averaging the probabilities provided by all trees. Hence, given a
  decease). In this case, for each lab test the snapshot includes both
                                                                                 random forest with n decision trees, a class (prediction) value y is
  the start trend and the last trend features.
                                                                                 assigned to an instance X with the following probability:
   It is important to observe, that while the datasets of the latter days                                          Pn
                                                                                                                      i=1 Pi (y|X)
will contain more information about the single patients (more lab                                     P (y|X) =                     .
                                                                                                                           n
tests findings, less missing values), the overall number of patients in
the datasets decreases with the prediction day increase. This is due to
the fact that more patients are released or die within longer periods
                                                                                 Extra Trees
of hospitalisation, and therefore such patients are not included in the          Extremely Randomized Trees (Extra Trees or ET) [8] are another
corresponding datasets.                                                          ensemble learning method based on decision trees. The main differ-
   Finally, note that the splitting between training and testing of the          ences between Extra Trees and Random Forests are:
data is done only once considering all patients. Thus if, for instance,
a patient belongs to the training set of 2 days, then it does not belong         • In the original description of Extra Trees [8] each tree is built us-
to the test set of the following days.                                             ing the entire training dataset. However in most implementations
                                                                                   of Extra Trees, including Scikit-Learn [18], the decision trees are
                                                                                   built exactly as in Random Forests.
5     Machine Learning Algorithms                                                • In standard decision trees and Random Forests, the cut point is
                                                                                   chosen by first computing the optimal cut point for each feature,
In this section we briefly describe the machine learning algorithms
                                                                                   and then choosing the best feature for branching the tree; while
used in our prognosis prediction system.
                                                                                   in Extra Trees, first we randomly choose k features and then, for
1 While we chose 2, 4, 6, 8, 10 days after the hospitalisation, plus the day       each chosen feature f , the algorithm randomly selects a cut point
    before the patient release, of course other sequences could be considered.     Cf in the range of the possible f values. This generates a set of k
   couples {(fi , Ci ) | i = 1, . . . , k}. Then, the algorithm compares
                                                                            F IND U NCERTAIN T HRESHOLD: Algorithm for computing,
   the splits generated by each couple (e.g., under split test fi ≤ ci )
                                                                            during the training phase, an optimised prediction threshold
   to select the best one using a split quality measure such as the Gini
                                                                            under which the model labels an instance as uncertain.
   Index or others.
                                                                             Input:
The probability P (y|X) of assigning a class value y to an instance          – L array of labels (alive or dead) li with l[i] label of the sample i
X is computed as in Random Forests (see equation above).                        of the validation data (fold);
                                                                             – P = [pi = (palive , pdead )i | i is the sample index in val. set];
                                                                             – max u the maximum percentage of the samples in the validation
5.2    Hyperparameter Search                                                    set that can be labeled as uncertain (not predictable);
Most machine learning algorithms have several hyperparameters to             – n the maximum number of thresholds to try;
tune such as, for instance, in a Random Forest the number of decision        – EvaluateScore the score function to maximize by dropping the
trees to create and their maximum depth. Since in our application               uncertain samples;
handling the missing values is an important issue, we also used a hy-
                                                                              Output: A pair (v, th) where v is the score function value
perparameter for this with three possible settings: a missing value is
                                                                                         after dropping the uncertain samples and th the
set to either the average value, the median value or a special constant
                                                                                         optimized threshold value.
(-1).
                                                                            1 Lpred ← array of labels such that Lpred [i] is the predicted
   In order to find the best performing configuration of the hyper-
                                                                               label (the label with highest probability) of the val. sample i;
parameters, we used the Random Search optimization approach [2],
                                                                            2 Pmax ← [max(palive , pdead )i | (palive , pdead )i ∈ P ];
which consists of the following main steps:
                                                                            3 v ← EvaluateScore(L, Lpred );

1. We divide our training sets into k folds, with either k = 10 or          4 th ← min value in Pmax ;

   k = 5, depending on the dimension of the considered dataset.             5 δ ← [(max value in Pmax ) − (min value in Pmax )]/n;

2. For each randomly selected combination of hyperparameters, we            6 for i ← 0 to n − 1 do

   run the learning algorithm in k-fold cross validation.                   7     th0 ← min value in Pmax + i · δ;
3. For each fold, we evaluate the performance of the algorithm with         8     S ← {i |i is id sample such that Pmax [i] > th0 }
   that configuration using the Macro F-β score metric and β = 2.           9     u ← 1 − (|S|/|Pmax |);
   The F -β score is the weighted harmonic mean of precision and           10     if u ≥ max u then return (v, th);
   recall measures. The β parameter indicates how many times the           11     L0 ← array of labels such that L[i] is the label of the val.
   recall is more important with respect to the precision:                          sample i and i ∈ S;
                                                                           12     L0pred ← array of labels such that Lpred [i] is the
                                      precision + recall                            predicted label of the val. sample i and i ∈ S;
             F -β = (1 + β 2 ) ∗
                                   β 2 ∗ precision + recall                13     v 0 ← EvaluateScore(L0 , L0pred );
    We choose β = 2 in order to give particular importance to false        14     if v 0 > v then
   negatives, i.e. those patients which our system could not identify      15           th ← th0 ;
   as at death risk. Given that we can compute the F2-score both for       16           v ← v0 ;
   both the alive class and the dead class, we considered the Macro        17     end
   F2-Score, which is the arithmetic mean of the scores for the two        18 end
   classes.
4. The overall evaluation score of the k-fold cross validation for         Figure 2: Pseudocode of algorithm F IND U NCERTAIN T HRESHOLD .
   a configuration of the parameters is obtained by averaging the
   scores obtained for each fold.                                             We designed an algorithm called F IND U NCERTAIN T HRESHOLD
5. The hyperparameter configuration with the best overall score is         that is used in the training phase to decide the threshold and opti-
   selected.                                                               mize the prediction performance on the training samples that pass it,
                                                                           under the max u constraint. The pseudocode of the algorithm is in
                                                                           Figure 2.
5.3    Handling Prediction Uncertainty
                                                                              Given the original labels L of the validation samples and their
The output for an instance X of every generate classification model        prediction probabilities P derived by the learning algorithm, F IND -
is an array of two probabilities, P (alive|X) and P (dead|X), de-          U NCERTAIN T HRESHOLD first computes: the predicted labels Lpred
fined as described in Section 5.1. We can see them as “degrees of          (i.e., the class values with highest probabilities) and the relative
certainty” of the prediction: the higher the probability is, the more      Pmax probabilities; the original score v obtained using the input
reliable the prediction is. Given the very sensitive nature of our task,   score function evaluating all samples; an initial value of the threshold
the system discards potential predictions supported by a low proba-        (th) defined as to the minimum probability in Pmax .
bility. This is achieved using a prediction threshold under which the         The next loop finds an optimal value of threshold th and computes
system considers the prediction uncertain (and the patient risk un-        the score function for the validation set reduced to the validation
predictable). Note that if we used a threshold value that is too high,     samples with predicted labels that have probabilities above th. The
many patients could be classified uncertain, and our model would be        considered threshold values are obtained by using the δ-increments
much less useful for clinical practice. To avoid this, at training time    defined at lines 5 and 7. First we compute the new threshold th0 in-
we impose a maximum percentage of samples that can can be con-             creasing the current threshold by δ, and then we derive the set S
sidered uncertain (unpredictable), and we implemented this with a          of sample ids with prediction probabilities higher than th0 . Next we
parameter, called max u, that is given in input; for our experimental      compute the percentage u of samples that are labeled as uncertain
analysis we used max u = 25%.                                              using threshold th0 . If u ≥ max u, we can terminate returning the
                                                                             minutes. Therefore, we can build all the four most promising models
                                                                             generated by RF and ET using the numerical version (RF-N, TC-N)
                                                                             or the categorical version (RF-C, ET-C) of the data set in less than
                                                                             two hours, and then select the best performing model among them.
                                                                                It is also worth to note that in our system the models for predicting
                                                                             the prognostic risk at different days are completely independent from
                                                                             each other, and so we can consider prediction tasks at different days
                                                                             as different tasks.
                                                                                In Figure 4 and in Table 2 we show the performances of our sys-
                                                                             tem at each considered day for both the High Contagion Phase and
Figure 3: Average performance (F2 score) of seven machine learning           the Moderate Contagion Phase. As we can see, we obtain promising
algorithms for the HCP datasets. The line over the bar represents the        results in terms of F2 score for an early evaluation of the risk dur-
standard deviation.                                                          ing the HCP (with score 77.1% at day 2), while we encounter some
                                                                             problems at the 6th and 10th days. For the MCP datasets, the system
current best new score v and the corresponding threshold value th (a         performs better at the latter days, in particular for the 10th day F2
greater threshold value cannot lead to label as uncertain less samples       is 80.4% and ROC-AUC is 90.2%. For HCP, both RF and ET ob-
than the returned th value). Otherwise (u < max u), we compute               tain good results in both the numerical and categorical versions of
the correct sample labels L0 and the predicted sample labels L0pred          the datasets. Instead, for MCP using the categorical datasets does not
for the samples identified by S, and we compute the new score value          give good performance, and we do not observe an improvement for
v 0 using L0 and L0preds . If v 0 is a better score than v, we update both   the latter prediction days (the F2 score is always below the 70%).
the threshold and the score values.                                             In all but one case, the models using the uncertain threshold in-
    F IND U NCERTAIN T HRESHOLD is executed during the training              crease the performance in terms of both F2 and ROC-AUC scores.
phase. In particular during the hyperparameter search, for each at-          In particular, in the most problematic cases of HCP, such as for the
tempted hyperparamenter configuration, we compute through F IND -            6-days and 10-days datasets, the prediction performance improves in
U NCERTAIN T HRESHOLD an optimized threshold and the relative                terms of F2 by over than 7 points. The improvement is less significant
score function value. These two values are obtained by averaging             for MCP.
the optimal thresholds and corresponding scores over all folds of the           Note that, while the threshold value under which the system labels
cross validation for the attempted configuration. The hyperparameter         an instance (patient risk) as uncertain is derived at training time im-
search returns the best configuration together with the relative (aver-      posing a maximum percentage of uncertain samples (we used 25%),
aged) threshold.                                                             there is no formal guarantee that this percentage limit is satisfied for
                                                                             test set. However, in most cases the percentage of uncertain test sam-
6   Experimental Evaluation and Discussion                                   ples (indicated with % Unc in Table 2) is much below the limit im-
In this section, we evaluate the performance the of the machine learn-       posed during training, expect for the test set of the 6th day in HCP,
ing models that we built. Our system was implemented using the               where the unpredicted (labelled as uncertain) patients are 26.1%. The
Scikit-Learn [18] library for Python, and the experimental tests were        performance for the “end” dataset is good for both HCP and MCP
conducted using a Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz.                 even without omitting the uncertain patients (F2 score 86.6% for
   The performance of the learning algorithms with the relative op-          HCP, and F2 score 86.9% for MCP).
timized hyperparameters was evaluated using the test set in terms               Figure 4 gives graphical pictures comparing the performance of
of F2 score and ROC-AUC score. The second metric is defined as               our system for HCP and MCP in terms of F2 and ROC-AUC. The
the area under the Receiver Operating Characteristic curve, which            performance behaviour over time significantly differs in the two con-
plots the true positive rate against the false positive rate, and it takes   tagion periods, reflecting the concept drift we discussed in Section
also into account the probability that the predictive system produces        3.2. For HCP, considering the results without omitting the uncertain
false positives (i.e. false alarms). This metric is a standard method        test instances (blue curves), the performance prediction is very good
for evaluating medical tests and risk models [9, 10].                        at the 2nd day and it decreases at the 6th and 10th days. Instead, for
   In a preliminary study we examined various machine learning ap-           MCP the performance improves over time, reaching 90.2% in terms
proaches and we compared their average performances over the HCP             of ROC-AUC at the 10th day, as also reported in Table 2. This is due
datasets. Figure 3 shows a summary of the relative performance in            to several factors:
terms of F2 score. We considered Decision Trees [25], ExtraTrees             • MCP includes patients that have hospitalisation periods much
(ET) [8], Gaussian Naive Bayes [31], Multilayer Perceptron with two            longer than the patients in HCP, which can make more difficult to
layers (MLP) [13], Quadratic Discriminant Analysis [26], Random                predict the mortality risk for some patients with only a few days
Forests (RF) [4] and Support Vector Machines [27]. The best perfor-            of hospitalisation;
mance was obtained with RF and ET. NN and SVM performed much                 • on the contrary, in HCP half of the patients stayed in hospital for
worse and with a much higher variability over the datasets, probably           less than 8 days. This decreases significantly the size of the 8-
related to the missing values and the scarcity of data. For the MCP            days and 10-days training sets, which contain respectively only
datasets the relative performance was similar. Given the observed              431 and 339 patients. The lack of training data in these datasets
better performance of RF and ET, we focused the evaluation of our              is only partially compensated by the increase of the lab tests for a
system on these learning algorithms                                            single patient in the datasets;
   Regarding the training time, including the hyperparamenter search         • as described in Section 3.2, the MCP patients are much more un-
over 4096 random configurations and the optimization of the uncer-             balanced (with only 11% deceased patients) than the HCP pa-
tainty threshold, for any specific dataset (e.g., the MCP numerical            tients, and this increases the difficulty of learning an high per-
dataset for 2 days), the overall training time is between 20 and 30            forming model [17].
Figure 4: Graphical representation of the prediction performance (F2 and ROC-AUC scores) over hospitalisation time for HCP and MCP.


    HCP data     F2     ROC      F2-U      ROC-U      % Unc      Model      MCP data      F2     ROC      F2-U     ROC-U      % Unc      Model
    2days       77.1    77.8     80.1       83.3       18.3      ET-C       2days        60.0    75.4     61.0      78.1       13.9      ET-N
    4days       74.1    79.4     76.7       81.9       13.8      RF-N       4days        63.5    78.5     65.4      82.4       21.1      RF-N
    6days       68.7    75.6     75.9       83.6       26.1      RF-N       6days        74.1    86.0     77.2      88.1        9.8      ET-N
    8days       74.8    76.5     78.2       82.5       22.1      ET-C       8days        73.2    85.0     76.1      86.5       12.3      ET-N
    10days      68.9    75.5     80.6       83.9       24.8      RF-C       10days       80.4    90.2     75.3      89.0       12.7      ET-N
    end         86.6    89.4     94.3       95.5       19.3      RF-C       end          86.9    93.9     95.8      98.4       19.4      RF-N
Table 2: Predictive performance for the High Contagion Phase (HCP, left) and the Moderate Contagion Phase (MCP, right) in terms of F2 and
ROC-AUC scores considering all instances in the test set (columns F2 and ROC) and omitting the instances classified uncertain (columns F2-U
and ROC-U). The percentages of instances that the system classifies uncertain are in the column % Unc. Column Model indicates the method
selected for generating the model; ET stands for Extra Trees, RF for Random Forest, C for categorical and N for numerical.

   Figure 5 shows the confusion matrices for the test sets gener-
ated using our predictive models. Above the line we have the HCP
datasets and below the MCP datasets. Despite the training phase was
optimised (through the use of the F2 metric) to avoid false negatives,
for the HCP datasets there are several false negatives (bottom-left of
the matrices). This can be explained by the scarcity of lab test and
X-ray data in the HCP data that affects prediction.
   However, false negatives are significantly reduced with the mod-
els that can classify a patient as uncertain. For example, at day 6,
the system classifies as uncertain 4 patients who otherwise would be
false negatives. Moreover, when there are less false negatives, such
as at days 8 and 10, classifying some patients as uncertain helps to
also avoid false positives and so to generate less false alarms.
   Remarkably, especially for the MCP datsets, we have very few
false negatives even at the early days, which is quite important in our
application context. On the other hand, especially for days 2 and 4,
our system produces many false positives. This type of error is re-
duced in the models with uncertain patients up to only 5 false alarms
for the end dataset (e.g., at day 2 we avoid 16 false positives.)           Figure 5: Confusion matrices for datasets HCP (above the line) and
                                                                            MCP (below the line) at different days with dead-alive predictions
7    Conclusions and Future Work                                            for all patients (Complete) and omitting patients classified uncertain
We have presented a system for predicting the prognosis of Covid-           (No Unc). For each matrix of 4 numbers, on the main diagonal we
19 patients focusing on the death risk. We built and engineered some        have the correct predictions (alive class on the top-left corner and
datasets from lab test and X-ray data of more than 2000 patients in         dead class on the bottom-right corner); on the anti-diagonal, we have
an hospital in northern Italy that was severely hit by Covid-19. Our        the incorrect predictions (false positives and on the top-right corner
predictive system uses a collection of machine learning algorithms          and the false negatives on the bottom-left corner).
and a new method for setting, at training time, an uncertain threshold
for prediction that helps to significantly reduce the prediction errors.    itives and (few) false negatives.
   Overall, the experimental results are quite promising, and show             For future work we plan to extend our datasets with more informa-
that our system often obtains high ROC-AUC scores. The observed             tion (both additional features and patients), to consider further meth-
predictive performance is especially good in terms of false nega-           ods for dealing with the observed concept drift and to address other
tives (patients erroneously predicted survivor), that are very few. This    prediction tasks such as the duration of the hospitalisation or the need
gives a predictive test for patient survival with very good specificity     of ICU beds and critical hospital resources. Moreover, we are analyz-
in particular when the system can classify a patient as uncertain.          ing the importance of the features used in our models, and we intend
   On the other hand, in terms of false positives, there is room for sig-   to investigate additional learning techniques.
nificant improvements. We are confident that the availability of more
information, such as patient comorbidities or clinical treatments, will     Acnowledgements. The work of the first author has been sup-
help to improve performance, reducing the number of both false pos-         ported by Fondazione Garda Valley.
REFERENCES                                                                               actions’, in Proceedings of the Sixth Italian Conference on Computa-
                                                                                         tional Linguistics, (2019).
                                                                                  [22]   Luca Putelli, Alfonso Emilio Gerevini, Alberto Lavelli, and Ivan Se-
                                                                                         rina, ‘Applying self-interaction attention for extracting drug-drug inter-
 [1] Aya Awad, Mohamed Bader–El–Den, and James McNicholas, ‘Patient                      actions’, in XVIIIth International Conference of the Italian Association
     length of stay and mortality prediction: A survey’, Health Services                 for Artificial Intelligence, Rende, Italy, November 19–22, 2019, Pro-
     Management Research, 30(2), 105–120, (2017). PMID: 28539083.                        ceedings, (11 2019).
 [2] James Bergstra and Yoshua Bengio, ‘Random search for hyper-                  [23]   Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer,
     parameter optimization’, Journal of machine learning research,                      and Neil D Lawrence, Dataset shift in machine learning, The MIT
     13(Feb), 281–305, (2012).                                                           Press, 2009.
 [3] Andrea Borghesi and Roberto Maroldi, ‘Covid-19 outbreak in italy: ex-        [24]   Anna S Rakitianskaia and Andries Petrus Engelbrecht, ‘Training feed-
     perimental chest x-ray scoring system for quantifying and monitoring                forward neural networks with dynamic particle swarm optimisation’,
     disease progression’, La radiologia medica, (05 2020).                              Swarm Intelligence, 6(3), 233–270, (2012).
 [4] Leo Breiman, ‘Random forests’, Machine learning, 45(1), 5–32,                [25]   Lior Rokach and Oded Maimon, Data Mining with Decision Trees:
     (2001).                                                                             Theory and Applications, World Scientific Publishing Co., Inc., River
 [5] NV CHAWLA, ‘Evaluating probability estimates from decision trees’,                  Edge, NJ, USA, 2008.
     in Proc. AAAI Workshop on Evaluation Methods for Machine Learning,           [26]   Santosh Srivastava, Maya R Gupta, and Béla A Frigyik, ‘Bayesian
     Boston, MA, 2006, pp. 18–23, (2006).                                                quadratic discriminant analysis’, Journal of Machine Learning Re-
 [6] João Gama, Indrundefined Žliobaitundefined, Albert Bifet, Mykola                  search, 8(Jun), 1277–1305, (2007).
     Pechenizkiy, and Abdelhamid Bouchachia, ‘A survey on concept drift           [27]   Johan AK Suykens and Joos Vandewalle, ‘Least squares support vector
     adaptation’, ACM Comput. Surv., 46(4), (March 2014).                                machine classifiers’, Neural processing letters, 9(3), 293–300, (1999).
 [7] Alfonso Emilio Gerevini, Alberto Lavelli, Alessandro Maffi, Roberto          [28]   Mihaela van der Schaar and Ahmed Alaa, ‘How artificial intelligence
     Maroldi, Anne-Lyse Minard, Ivan Serina, and Guido Squassina, ‘Au-                   and machine learning can help healthcare systems respond to covid-19’,
     tomatic classification of radiological reports for clinical care’, in Pro-          https://www.vanderschaar-lab.com/covid-19/, (2020).
     ceedings of the 16th Conference on Artificial Intelligence in Medicine,      [29]   Li Yan, Hai-Tao Zhang, Yang Xiao, Maolin Wang, et al., ‘Prediction of
     AIME 2017, Vienna, Austria, June 21-24, 2017, volume 10259 of Lec-                  criticality in patients with severe covid-19 infection using three clinical
     ture Notes in Computer Science, pp. 149–159. Springer, (2017).                      features: a machine learning-based prognostic model with clinical data
 [8] Pierre Geurts, Damien Ernst, and Louis Wehenkel, ‘Extremely random-                 in wuhan’, medArxiv preprint, (2020).
     ized trees’, Machine learning, 63(1), 3–42, (2006).                          [30]   Jinsung Yoon, Ahmed Alaa, Scott Hu, and Mihaela Schaar, ‘Forecas-
 [9] Gary L Grunkemeier and Ruyun Jin. Receiver operating characteristic                 ticu: a prognostic decision support system for timely prediction of in-
     curve analysis of clinical risk models, 2001.                                       tensive care unit admission’, in International Conference on Machine
[10] Karimollah Hajian-Tilaki, ‘Receiver operating characteristic (roc)                  Learning, pp. 1680–1689, (2016).
     curve analysis for medical diagnostic test evaluation’, Caspian journal      [31]   Harry Zhang, ‘The optimality of naive bayes’, in Proceedings of the
     of internal medicine, 4(2), 627, (2013).                                            Seventeenth International Florida Artificial Intelligence Research So-
[11] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg,                 ciety Conference, Miami Beach, Florida, USA, eds., Valerie Barr and
     and Aram Galstyan, ‘Multitask learning and benchmarking with clini-                 Zdravko Markov, pp. 562–567. AAAI Press, (2004).
     cal time series data’, Scientific data, 6(1), 1–18, (2019).                  [32]   Zhi-Hua Zhou, Ensemble Methods: Foundations and Algorithms, Chap-
[12] Sharique Hasan and Rema Padman, ‘Analyzing the effect of data qual-                 man & Hall/CRC, 1st edn., 2012.
     ity on the accuracy of clinical decision support systems: a computer
     simulation approach’, in AMIA annual symposium proceedings, volume
     2006, p. 324. American Medical Informatics Association, (2006).
[13] Simon Haykin, Neural networks: a comprehensive foundation, Prentice
     Hall PTR, 1994.
[14] Stephanie Hyland, Martin Faltys, Matthias Hüser, Xinrui Lyu, Thomas
     Gumbsch, Cristóbal Esteban, Christian Bock, Max Horn, Michael
     Moor, Bastian Rieck, Marc Zimmermann, Dean Bodenham, Karsten
     Borgwardt, Gunnar Rätsch, and Tobias Merz, ‘Early prediction of cir-
     culatory failure in the intensive care unit using machine learning’, Na-
     ture Medicine, 26, 1–10, (03 2020).
[15] Xiangao Jiang, Megan Coffee, Anasse Bari, Junzhang Wang, Xinyue
     Jiang, Jianping Huang, Jichan Shi, Jianyi Dai, Jing Cai, Tianxiao
     Zhang, et al., ‘Towards an artificial intelligence framework for data-
     driven prediction of coronavirus clinical severity’, CMC: Computers,
     Materials & Continua, 63, 537–51, (2020).
[16] Alistair EW Johnson, Mohammad M Ghassemi, Shamim Nemati,
     Katherine E Niehaus, David A Clifton, and Gari D Clifford, ‘Machine
     learning and decision support in critical care’, Proceedings of the IEEE,
     104(2), 444–466, (2016).
[17] Bartosz Krawczyk, ‘Learning from imbalanced data: open challenges
     and future directions’, Progress in Artificial Intelligence, 5(4), 221–
     232, (2016).
[18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
     O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
     plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
     esnay, ‘Scikit-learn: Machine learning in Python’, Journal of Machine
     Learning Research, 12, 2825–2830, (2011).
[19] Tom J Pollard and Leo Anthony Celi, ‘Enabling machine learning in
     critical care’, ICU management & practice, 17(3), 198, (2017).
[20] Luca Putelli, Alfonso Gerevini, Alberto Lavelli, Matteo Olivato, and
     Ivan Serina, ‘Deep learning for classification of
     radiology reports with a hierarchical schema’, in Proceedings of 24th
     International Conference on Knowledge-Based and Intelligent Infor-
     mation & Engineering Systems, (2020).
[21] Luca Putelli, Alfonso Gerevini, Alberto Lavelli, and Ivan Serina, ‘The
     impact of self-interaction attention on the extraction of drug-drug inter-