<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Prognosis Prediction in Covid-19 Patients from Lab Tests and X-ray Data through Randomized Decision Trees</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alfonso E. Gerevini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Maroldi y</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Olivato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Putelli</string-name>
          <email>l.putelli002g@unibs.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Serina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universit a` degli Studi di Brescia</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>AI and Machine Learning can offer powerful tools to help in the fight against Covid-19. In this paper we present a study and a concrete tool based on machine learning to predict the prognosis of hospitalised patients with Covid-19. In particular we address the task of predicting the risk of death of a patient at different times of the hospitalisation, on the base of some demographic information, chest X-ray scores and several laboratory findings. Our machine learning models use ensembles of decision trees trained and tested using data from more than 2000 patients. An experimental evaluation of the models shows good performance in solving the addressed task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The fight against Covid-19 is a new important challenge for the world
that AI and machine learning can help facing at various levels [
        <xref ref-type="bibr" rid="ref15 ref28 ref29">15,
28, 29</xref>
        ]. In March 2020, at the time of the coronavirus emergency
in Italy, we started working in strict collaboration with one of the
hospitals that had more Covid-19 patients in Italy, Spedali Civili di
Brescia, to help predicting the prognosis of hospitalised patients. Our
work was focused on the task of predicting the risk of death of a
patient at different times of the hospitalisation. As discussed in [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ],
predicting if a patient is at risk of decease or adverse events can help
the hospital, for instance, to organize the allocation of limited health
resources in a more efficient way.
      </p>
      <p>
        Our predictive models are built on the base of demographic
information (sex and age), the values of ten laboratory tests and the
chest X-ray score(s), which is an innovative measure developed and
used at Spedali Civili di Brescia to assess the severity of the
pulmonary conditions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Other important information, such us the
patient comorbidities or the time and duration of the symptoms related
to Covid-19, were not used because not available to us.
      </p>
      <p>Using raw data from more than 2000 patients, we built some data
sets describing the “clinical history” of each patient during the
hospitalisation. In particular, each dataset contains a “snapshot” of the
infection conditions of every considered patient at a certain day after
the start of the hospitalisation. For each dataset, we built a different
predictor, allowing to make progressive predictions over time that
take into account the evolution of the disease severity in a patient,
which helps the formulation of a personalized prediction of the
prognosis. A change of the predicted risk over time for a patient could also
hint a link between specific events or treatments and the increase or
decrease of the risk for the patient. As snapshot times for a patient, in
our experiments we considered the 2nd, 4th, 6th, 8th and 10th
hospitalization day, and the day before the end of the hospitalisation.</p>
      <p>
        Our datasets were engineered to cope with a number of practical
issues, including missing values and feature values categorization,
and to add some helpful artificial features. We also addressed the
“concept drift” issue [
        <xref ref-type="bibr" rid="ref23 ref6">6, 23</xref>
        ], since we observed that the risk of death
was clearly sensitive to the time period when the patient was
hospitalised; the risk was significantly higher during the earlier period
of the emergency (March 2020), when in northern Italy the spread of
the virus infection was very high and many people were hospitalised.
Moreover, given the very sensitive nature of our task, we introduced
a threshold to discharge the model predictions that have a low
estimated probability. Such a threshold is a parameter that is
automatically calculated and optimised during the training phase.
      </p>
      <p>We considered several machine learning algorithms. A first
experimental comparison of their performance on our data sets showed that
methods based on forests of trees have more promising performance,
and so we decided to focus on this approach. The obtained
prediction models have good performance over a randomly chosen test set
of 200 patients for each considered period, in terms of both F2 and
ROC-AUC scores. In particular, overall the system makes very few
errors in predicting patient survival, i.e., the specificity of the
prediction is very high.</p>
      <p>In the following, after discussing related work, we describe our
data sets, we present our prediction models and their experimental
evaluation, and finally we give conclusions and mention future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Artificial Intelligence and Machine Learning techniques can be used
for tackling the Covid-19 pandemic in different aspects. However,
given that the pandemic has started only few months ago, most works
are still preliminary, and there isn’t a clear description of the
developed techniques and of their results (often only pre-printed and not
properly peer-reviewed).</p>
      <p>
        A preliminary study is presented in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Given a set of only 53
patients with mild symptoms and their lab tests, comorbidities and
treatment, the authors train several machine learning models
(Logistic Regression, Decision Trees, Random Forests, Support Vector
Machines, KNN) to predict if a patient will be subject to more sever
symptoms, obtaining a prediction accuracy score of up to 0.8 using
10-fold cross validation. The generalizability and strength of these
results are questionable, given the very small set of considered
patients.
      </p>
      <p>
        Another example is the pre-printed work by Li Yan et al. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] that
uses lab tests for predicting the mortality risk; the proposed model
is a very simple decision tree based on the three most important
features. While the performance seems promising, the test set used for
evaluation was very small (29 patients).
      </p>
      <p>
        Various AI and machine learning techniques have been developed
for prognosis and disease progression prediction [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in the context of
diseases different from Covid-19 [
        <xref ref-type="bibr" rid="ref20 ref21 ref22">20, 21, 22</xref>
        ]. In particular, in the last
few years, several works about predicting mortality risk or adverse
events and on the use of AI in critical care [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] have been published.
The survey in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] presents a review of statistical and ML systems for
predicting the mortality risk, the need of beds in intense care units
[
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] or the length of the patient hospitalization. In particular, it is
worth to mention the work by Harutyunyan et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] which uses
LSTM Neural Networks for predicting both the mortality risk and
the length of the hospitalisation.
      </p>
      <p>
        An overview of the issues and challenges for applying ML in a
critical-care context is available in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This work stresses the need
to deal with corrupted data, like missing values, imprecision, and
errors that can increase the complexity of prediction tasks.
      </p>
      <p>
        Lab test findings and their variation over time are the main focus
of the work by Hyland et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which describes a system that
processes these data to generate an alarm predicting that a patient
will have a circulatory failure 2 hours in advance.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Available Data Sources</title>
      <p>
        During the Covid-19 outbreak, from February to April 2020 in
hospital Spedali Civili di Brescia more than two thousand patients
were hospitalised. During their hospitalisation, the medical staff
performed several exams to them in order to monitor their conditions,
checking the response to some treatments, verifying the need to
transfer a patient to the ICU, etc. We had data from a total of 2015
hospitalised patients; for each of these patients, the specific data that
were made available to us are:
the age and sex;
the values and dates of several lab tests (see Table 1);
the scores (each one from 0 to 18), assigned by the physicians,
assessing the severity of the pulmonary conditions resulting from
the X-ray exams [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ];
the values and dates of the throat-swab exams for Covid-19;
the final outcome of the hospitalisation at the end of the stay,
which is the classification value of our application (either
inhospital death, released survivor, or transferred to another hospital
or rehabilitation center).
      </p>
      <p>Table 1 specifies the considered lab tests, their normal range of
values, and their median values in our set of patients. We had no further
information about symptoms, their timing, comorbidities, generic
health conditions or clinical treatment. Moreover, we have no CT
images or text reports associated with the X-ray exams. The available
information about whether a patient was or had been in ICU was not
clear enough to be used. Finally, of course, also the names of the
patient and of the involved medical staff names were not provided.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Data Quality Issues</title>
      <p>
        When applying machine learning to raw real-world data, there are
some non-trivial practical issues to deal with, such as the quality of
the available data and related aspects, that in biomedical applications
are especially important given the very sensitive domain [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>In our case, one of such issues is that the length of the
hospitalisation period can sensibly differ from one patient to another (from
Lab test
C-Reactive Protein (PCR)
Lactate dehydrogenase (LDH)
Ferritin (Male)
Ferritin (Female)
Troponin-T
White blood cell (WBC)
D-dimer
Fibrinogen
Lymphocite (over 18 years old patients)
Neutrophils/Lymphocites
Chest XRay-Score (RX)
Normal Range</p>
      <p>
        10
[80, 300]
[
        <xref ref-type="bibr" rid="ref30">30, 400</xref>
        ]
[
        <xref ref-type="bibr" rid="ref13">13, 150</xref>
        ]
      </p>
      <p>
        14
[
        <xref ref-type="bibr" rid="ref11 ref4">4, 11</xref>
        ]
      </p>
      <p>
        250
[180, 430]
[
        <xref ref-type="bibr" rid="ref20">20, 45</xref>
        ]
[0.8, 3.5]
&lt; 7
few days to two months), due to different reasons including the
novelty and the characteristics of the disease, its high contagiousness or
the absence of an effective treatment. Therefore, the number of
performed lab tests and relative findings significantly varies among the
considered set of patients (from only three to hundreds).
      </p>
      <p>Moreover, the lab tests and X-ray exams are not performed at a
regular frequency due, e.g., to the different kinds and timing of the
relative procedures, the need of different resources (X-Ray machines,
lab equipments, technical staff, etc.), or to the different severity of
the health conditions of the patients. For example, in our data we see
that a patient can be tested for PCR everyday and not be subject to
a Ferritin exam for two weeks. This leads to the need of handling
the issues missing values and outdated values. When we consider a
snapshot of a patient at a certain day, we have a missing value for a
lab test (or X-ray) feature if that test (X-ray) has not been performed.
We have an outdated value for a feature if the corresponding lab test
(X-ray) was performed several days earlier: since in the meanwhile
the disease has progressed, the findings of the lab test could be
inconsistent with the current conditions of the patient, and so they could
mislead the prediction.</p>
      <p>Data quality issues arise especially patients hospitalised in the
period of the highest emergency, when several hundreds of patients
were in the hospital at the same time.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Concept Drift</title>
      <p>An examination of the data available for our cohort of patients
revealed that their prognostic risk is influenced by multiple factors,
such as the number of the patients currently hospitalised and the
consequent availability of ICU beds or other resources, the
experimentation of new therapies, and the increase of the clinical knowledge.</p>
      <p>
        In machine learning, this change of data distribution is known as
concept drift [
        <xref ref-type="bibr" rid="ref23 ref6">6, 23</xref>
        ]. A classical method to deal with this problem is
training the algorithm using only a subset of samples, depending on
the data distribution that we are considering [
        <xref ref-type="bibr" rid="ref24 ref6">6, 24</xref>
        ].
      </p>
      <p>For this reason, we divided the considered set of patients into two
groups: the High Contagion Phase (HCP) group of patients, which
is composed by the patients admitted during the last weeks of
February and the first weeks of March (the most critical period of the
pandemic outbreak in Italy) and the Moderate Contagion Phase (MCP)
group of patients, which is composed by the patients admitted from
the last decade of March to the end of April.</p>
      <p>The main differences between these groups of patients are:
1. the mortality rate of the HCP patients is about twice the mortality
rate of the MCP patients;
2. in HCP patients the median value of the hospitalisation period is 8
days, while in MCP patients is 14 days. Further details are given
in Figure 1;
3. for many of the considered lab test, the mortality rate associated
with having values in a particular range significantly changes in
the two groups. For example, in HCP patients the mortality rate
for the patients which had a PCR value 10 times above the normal
range is 40.1%, while in MCP patients it is 21.1%.</p>
      <p>These differences clearly indicate that the data in the HCP and
MCP groups represent different target (concept) functions; therefore
predicting mortality during the high infection phase and during the
moderate phase can be considered as two different tasks. If we had
only the patients hospitalised during the high infection phase, using
these data for training an algorithm that predicts the mortality during
the moderate phase would lead to many errors.</p>
      <p>In our case, we generated two different systems, one for each of
the two groups of patients. We are currently investigating ways to
automatically select the set of patients for training starting from the
latest ones, and keeping the less recent ones until we find significant
changes in the mortality rate or in the data distribution.
4</p>
    </sec>
    <sec id="sec-6">
      <title>Datasets for Training and Testing</title>
      <p>The main task of our work is to provide survival/death predictions at
different days of the patient hospitalisation, according to the current
patient conditions reflected by the available lab findings and X-ray
scores. In this section we describe the specific extracted features and
the (training and testing) datasets that we built for this purpose.
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>Pre-processing and Feature Extraction</title>
      <p>The issues presented in Section 3.1 compel us to a robust
preprocessing phase with the goal of extracting features in order to
summarize the patients conditions and process them by a machine
learning algorithm. The pre-processing is applied to both HCP and MCP
data.</p>
      <p>Given that we have no information about the survival or the
decease of a patient after a transfer (which can be due to limited
availability of beds or ICU places), we exclude from our training and test
set the 142 patients which were admitted in Spedali Civili di
Brescia and then transferred to another hospital. However, the 74 patients
who were transferred to a rehabilitation center can be considered not
at risk of death; therefore we include them in our datasets and
consider the transferred patients as released alive.
4.1.1</p>
      <sec id="sec-7-1">
        <title>Patient Snapshot and Feature Engineering</title>
        <p>In order to provide a prediction for a patient at different
hospitalisation times, we introduced the concept of patient snapshot to
represent the patient health conditions at a given day.</p>
        <p>In this snapshot, for each lab test of Table 1, we consider its most
recent value. In the ideal case, we should know the lab test findings
at every day. However, as explained in Section 3.1, in a real-world
context the situation is very different. For example, in our data if we
consider to take a snapshot of a patient 14 days after the admission
into the hospital, we have cases with very recent values of PCR, LDH
or WBC (obtained one or a few days before), very old values for
Fibrinogen or Troponin-T (obtained the first day of the hospitalisation)
and even no value for Ferritin.</p>
        <p>Given the difficulty to set a predefined threshold that separates
recent and old values of the lab tests (e.g., for Fibrinogen and
TroponinT), we choose to always use the most recent value, even if it could
be outdated. In order to allow the learning algorithm to capture that
a value may not be significant to represent the current status of the
patient (because too old), we introduce a feature called ageing for
each test finding. If a lab test has been performed at a day d0, and the
snapshot of a patient is taken at day d1, the ageing is defined as the
number of days between d1 and d0. If there is no available value for
a lab test, its ageing is considered a missing value.</p>
        <p>A patient snapshot can contain the values of the lab test findings in
two forms: either numerical, in which we report the value itself, or
categorical, in which the value is transformed into an integer number
expressing the gravity of the test finding within a partition of the
possible real values. This partition is based on the range of values for
normal conditions and on how the test values are distributed over the
data of all patients. For example, we divide the D-Dimer vales into
6 categories: the normal range, up to 2 times the maximum value
of the normal range, up to 4 times, 6 times, 10 times and over 10
times. The categorical form could help the algorithm to have a clearer
understanding of the data and improve performance.</p>
        <p>Monitoring the conditions of a patient means knowing not only the
patient status at a specific time, but also how the conditions evolve
during the hospitalisation. For this purpose, we introduce a feature
called trend that is defined as follows:</p>
        <p>For each lab test, if there is no available value for a lab test or if
the patient has not performed the lab test at least two times, the
trend is a missing value. Otherwise, given the values v1 and v2
of the findings for the lab test performed at days d1 and d2 and a
threshold T that we set to 15% of v1, if v2 &gt; (1 + T ) v1, then
the trend is increasing, while if v2 &lt; (1 T ) v1 the trend is
decreasing; otherwise the trend is stable.</p>
        <p>We distinguish two types of trends: the start trend, that uses the
distance between the most recent value and the first available value,
and the last trend, that uses the distance between the last one and
the penultimate one. We are currently investigating techniques for
including more than two values in the trend calculation.</p>
        <p>To summarize, for each lab test in a patient snapshot, we have the
most recent finding and the relative ageing and trend, as well as the
static features age and sex.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Training and Test Sets Generation</title>
      <p>In this section we describe how we generated the training and test
sets for the purpose of predicting, at different days from the start of
the patient hospitalization, the final outcome of her/his stay.</p>
      <p>First, for both the HCP and MCP sets, we used stratified sampling
for selecting 80% of the patients for training the models and 20% for
testing them. Then, we created specific training and test sets for each
element in a sequence of times when the model is used to make the
prediction1:
2 days of hospitalisation. We include all the patients’ snapshots
containing the first values for each lab test conducted in the first
two days after the hospital admission. Note that if a patient has
performed a lab test more than once in the first two days, the
snapshot will consider the oldest value. In fact, the purpose of
the model we want to build is to provide the prediction as soon
as possible, with the first information available. Furthermore, in
these snapshots the ageing and trend values are not included.
4 days and 6 days of hospitalisation. In these cases, the
corresponding snapshots also contain the ageing and trend features, and
the lab values will be the most recent ones in the available data.
Given that only a few days passed after admission, we consider
the start trend.
8 days and 10 days of hospitalisation. The procedure of creating
the corresponding snapshots is the same as for the snapshots of
4 days and 6 days cases, except that we consider the last trend
instead of the start trend.</p>
      <p>End day (the last day before the patient is released or the patience
decease). In this case, for each lab test the snapshot includes both
the start trend and the last trend features.</p>
      <p>It is important to observe, that while the datasets of the latter days
will contain more information about the single patients (more lab
tests findings, less missing values), the overall number of patients in
the datasets decreases with the prediction day increase. This is due to
the fact that more patients are released or die within longer periods
of hospitalisation, and therefore such patients are not included in the
corresponding datasets.</p>
      <p>Finally, note that the splitting between training and testing of the
data is done only once considering all patients. Thus if, for instance,
a patient belongs to the training set of 2 days, then it does not belong
to the test set of the following days.
5</p>
    </sec>
    <sec id="sec-9">
      <title>Machine Learning Algorithms</title>
      <p>In this section we briefly describe the machine learning algorithms
used in our prognosis prediction system.
1 While we chose 2, 4, 6, 8, 10 days after the hospitalisation, plus the day
before the patient release, of course other sequences could be considered.
5.1</p>
    </sec>
    <sec id="sec-10">
      <title>Classification Algorithms</title>
      <sec id="sec-10-1">
        <title>Decision trees</title>
        <p>
          Decision Trees [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] are one of the most popular learning methods
for solving classification tasks. In a decision tree, the root and each
internal node provides a condition for splitting the training samples
into two subsets depending on whether the condition holds for a
sample or not. In our context, for each numerical feature f , a candidate
splitting condition is f C, where C is called cut point. The final
splitting condition is chosen by finding the f and C providing the
best split according to one of some possible measures like
Information Gain, Entropy index or Gini index.
        </p>
        <p>A subset of samples at a tree node can either be split again by
further feature conditions forming a new internal node, or form a
leaf node labelled with a specific classification (prediction) value; in
our application domain the label is either the alive class or the dead
class. Let us consider a decision tree with a leaf node l and a subset S
of associated training samples. A test instance X that reaches l from
the root tree, is classified (predicted) y with probability
P (yjX) =</p>
        <p>
          T P
T P + F P
where T P (True Positives) is the number of training samples in S
that have class value y, and F P (False Positives) is the number of
samples in S that don’t have class value y [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Given that in our task
we have only two classes (y and y), P (yjX) = 1 P (yjX). The
classification outcome of a decision tree forX is the class value with
the highest probability.
        </p>
      </sec>
      <sec id="sec-10-2">
        <title>Random Forests</title>
        <p>
          Random Forests (RF) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is an ensemble learning method [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] that
builds a number of decision trees at training time. For building each
individual tree of the random forest, a randomly chosen subset of the
data features is used. While, in the standard implementation of
random forests the final classification label is provided using the
statistical mode of the class values predicted by each individual tree, in the
well-known tool Scikit-Learn [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] that we used for our system
implementation, the probability of the classification output is obtained
by averaging the probabilities provided by all trees. Hence, given a
random forest with n decision trees, a class (prediction) value y is
assigned to an instance X with the following probability:
P (yjX) =
        </p>
        <p>Pn
i=1 Pi(yjX) :
n</p>
      </sec>
      <sec id="sec-10-3">
        <title>Extra Trees</title>
        <p>
          Extremely Randomized Trees (Extra Trees or ET) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] are another
ensemble learning method based on decision trees. The main
differences between Extra Trees and Random Forests are:
        </p>
        <p>
          In the original description of Extra Trees [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] each tree is built
using the entire training dataset. However in most implementations
of Extra Trees, including Scikit-Learn [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], the decision trees are
built exactly as in Random Forests.
        </p>
        <p>In standard decision trees and Random Forests, the cut point is
chosen by first computing the optimal cut point for each feature,
and then choosing the best feature for branching the tree; while
in Extra Trees, first we randomly choose k features and then, for
each chosen feature f , the algorithm randomly selects a cut point
Cf in the range of the possible f values. This generates a set of k
couples f(fi; Ci) j i = 1; : : : ; kg. Then, the algorithm compares
the splits generated by each couple (e.g., under split test fi ci)
to select the best one using a split quality measure such as the Gini
Index or others.</p>
        <p>The probability P (yjX) of assigning a class value y to an instance
X is computed as in Random Forests (see equation above).
5.2</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Hyperparameter Search</title>
      <p>Most machine learning algorithms have several hyperparameters to
tune such as, for instance, in a Random Forest the number of decision
trees to create and their maximum depth. Since in our application
handling the missing values is an important issue, we also used a
hyperparameter for this with three possible settings: a missing value is
set to either the average value, the median value or a special constant
(-1).</p>
      <p>
        In order to find the best performing configuration of the
hyperparameters, we used the Random Search optimization approach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
which consists of the following main steps:
1. We divide our training sets into k folds, with either k = 10 or
k = 5, depending on the dimension of the considered dataset.
2. For each randomly selected combination of hyperparameters, we
run the learning algorithm in k-fold cross validation.
3. For each fold, we evaluate the performance of the algorithm with
that configuration using the Macro F- score metric and = 2.
The F - score is the weighted harmonic mean of precision and
recall measures. The parameter indicates how many times the
recall is more important with respect to the precision:
      </p>
      <p>F
= (1 +
2)
2
precision + recall</p>
      <p>precision + recall
We choose = 2 in order to give particular importance to false
negatives, i.e. those patients which our system could not identify
as at death risk. Given that we can compute the F2-score both for
both the alive class and the dead class, we considered the Macro
F2-Score, which is the arithmetic mean of the scores for the two
classes.
4. The overall evaluation score of the k-fold cross validation for
a configuration of the parameters is obtained by averaging the
scores obtained for each fold.
5. The hyperparameter configuration with the best overall score is
selected.
5.3</p>
    </sec>
    <sec id="sec-12">
      <title>Handling Prediction Uncertainty</title>
      <p>The output for an instance X of every generate classification model
is an array of two probabilities, P (alivejX) and P (deadjX),
defined as described in Section 5.1. We can see them as “degrees of
certainty” of the prediction: the higher the probability is, the more
reliable the prediction is. Given the very sensitive nature of our task,
the system discards potential predictions supported by a low
probability. This is achieved using a prediction threshold under which the
system considers the prediction uncertain (and the patient risk
unpredictable). Note that if we used a threshold value that is too high,
many patients could be classified uncertain, and our model would be
much less useful for clinical practice. To avoid this, at training time
we impose a maximum percentage of samples that can can be
considered uncertain (unpredictable), and we implemented this with a
parameter, called max u, that is given in input; for our experimental
analysis we used max u = 25%.</p>
      <p>FINDUNCERTAINTHRESHOLD: Algorithm for computing,
during the training phase, an optimised prediction threshold
under which the model labels an instance as uncertain.</p>
      <p>Input:
– L array of labels (alive or dead) li with l[i] label of the sample i
of the validation data (fold);
– P = [pi = (palive; pdead)i j i is the sample index in val. set];
– max u the maximum percentage of the samples in the validation
set that can be labeled as uncertain (not predictable);
– n the maximum number of thresholds to try;
– EvaluateScore the score function to maximize by dropping the
uncertain samples;
Output: A pair (v; th) where v is the score function value
after dropping the uncertain samples and th the
optimized threshold value.
1 Lpred array of labels such that Lpred[i] is the predicted
label (the label with highest probability) of the val. sample i;
2 Pmax [max(palive; pdead)i j (palive; pdead)i 2 P ];
3 v EvaluateScore(L; Lpred);
4 th min value in Pmax;
5 [(max value in Pmax) (min value in Pmax)]=n;
6 for i 0 to n 1 do
7 th0 min value in Pmax + i ;
8 S fi ji is id sample such that Pmax[i] &gt; th0g
9 u 1 (jSj=jPmaxj);
10 if u max u then return (v; th);
11 L0 array of labels such that L[i] is the label of the val.</p>
      <p>sample i and i 2 S;
12 L0pred array of labels such that Lpred[i] is the
predicted label of the val. sample i and i 2 S;
v0 EvaluateScore(L0; L0pred);
if v0 &gt; v then
th th0;
v v0;</p>
      <p>We designed an algorithm called FINDUNCERTAINTHRESHOLD
that is used in the training phase to decide the threshold and
optimize the prediction performance on the training samples that pass it,
under the max u constraint. The pseudocode of the algorithm is in
Figure 2.</p>
      <p>Given the original labels L of the validation samples and their
prediction probabilities P derived by the learning algorithm,
FINDUNCERTAINTHRESHOLD first computes: the predicted labels Lpred
(i.e., the class values with highest probabilities) and the relative
Pmax probabilities; the original score v obtained using the input
score function evaluating all samples; an initial value of the threshold
(th) defined as to the minimum probability in Pmax.</p>
      <p>The next loop finds an optimal value of threshold th and computes
the score function for the validation set reduced to the validation
samples with predicted labels that have probabilities above th. The
considered threshold values are obtained by using the -increments
defined at lines 5 and 7. First we compute the new threshold th0
increasing the current threshold by , and then we derive the set S
of sample ids with prediction probabilities higher than th0. Next we
compute the percentage u of samples that are labeled as uncertain
using threshold th0. If u max u, we can terminate returning the
current best new score v and the corresponding threshold value th (a
greater threshold value cannot lead to label as uncertain less samples
than the returned th value). Otherwise (u &lt; max u), we compute
the correct sample labels L0 and the predicted sample labels L0pred
for the samples identified by S, and we compute the new score value
v0 using L0 and L0preds. If v0 is a better score than v, we update both
the threshold and the score values.</p>
      <p>FINDUNCERTAINTHRESHOLD is executed during the training
phase. In particular during the hyperparameter search, for each
attempted hyperparamenter configuration, we compute through
FINDUNCERTAINTHRESHOLD an optimized threshold and the relative
score function value. These two values are obtained by averaging
the optimal thresholds and corresponding scores over all folds of the
cross validation for the attempted configuration. The hyperparameter
search returns the best configuration together with the relative
(averaged) threshold.
6</p>
    </sec>
    <sec id="sec-13">
      <title>Experimental Evaluation and Discussion</title>
      <p>
        In this section, we evaluate the performance the of the machine
learning models that we built. Our system was implemented using the
Scikit-Learn [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] library for Python, and the experimental tests were
conducted using a Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz.
      </p>
      <p>
        The performance of the learning algorithms with the relative
optimized hyperparameters was evaluated using the test set in terms
of F2 score and ROC-AUC score. The second metric is defined as
the area under the Receiver Operating Characteristic curve, which
plots the true positive rate against the false positive rate, and it takes
also into account the probability that the predictive system produces
false positives (i.e. false alarms). This metric is a standard method
for evaluating medical tests and risk models [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ].
      </p>
      <p>
        In a preliminary study we examined various machine learning
approaches and we compared their average performances over the HCP
datasets. Figure 3 shows a summary of the relative performance in
terms of F2 score. We considered Decision Trees [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], ExtraTrees
(ET) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Gaussian Naive Bayes [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], Multilayer Perceptron with two
layers (MLP) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Quadratic Discriminant Analysis [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], Random
Forests (RF) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Support Vector Machines [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. The best
performance was obtained with RF and ET. NN and SVM performed much
worse and with a much higher variability over the datasets, probably
related to the missing values and the scarcity of data. For the MCP
datasets the relative performance was similar. Given the observed
better performance of RF and ET, we focused the evaluation of our
system on these learning algorithms
      </p>
      <p>Regarding the training time, including the hyperparamenter search
over 4096 random configurations and the optimization of the
uncertainty threshold, for any specific dataset (e.g., the MCP numerical
dataset for 2 days), the overall training time is between 20 and 30
minutes. Therefore, we can build all the four most promising models
generated by RF and ET using the numerical version (RF-N, TC-N)
or the categorical version (RF-C, ET-C) of the data set in less than
two hours, and then select the best performing model among them.</p>
      <p>It is also worth to note that in our system the models for predicting
the prognostic risk at different days are completely independent from
each other, and so we can consider prediction tasks at different days
as different tasks.</p>
      <p>In Figure 4 and in Table 2 we show the performances of our
system at each considered day for both the High Contagion Phase and
the Moderate Contagion Phase. As we can see, we obtain promising
results in terms of F2 score for an early evaluation of the risk
during the HCP (with score 77.1% at day 2), while we encounter some
problems at the 6th and 10th days. For the MCP datasets, the system
performs better at the latter days, in particular for the 10th day F2
is 80.4% and ROC-AUC is 90.2%. For HCP, both RF and ET
obtain good results in both the numerical and categorical versions of
the datasets. Instead, for MCP using the categorical datasets does not
give good performance, and we do not observe an improvement for
the latter prediction days (the F2 score is always below the 70%).</p>
      <p>In all but one case, the models using the uncertain threshold
increase the performance in terms of both F2 and ROC-AUC scores.
In particular, in the most problematic cases of HCP, such as for the
6-days and 10-days datasets, the prediction performance improves in
terms of F2 by over than 7 points. The improvement is less significant
for MCP.</p>
      <p>Note that, while the threshold value under which the system labels
an instance (patient risk) as uncertain is derived at training time
imposing a maximum percentage of uncertain samples (we used 25%),
there is no formal guarantee that this percentage limit is satisfied for
test set. However, in most cases the percentage of uncertain test
samples (indicated with % Unc in Table 2) is much below the limit
imposed during training, expect for the test set of the 6th day in HCP,
where the unpredicted (labelled as uncertain) patients are 26.1%. The
performance for the “end” dataset is good for both HCP and MCP
even without omitting the uncertain patients (F2 score 86.6% for
HCP, and F2 score 86.9% for MCP).</p>
      <p>Figure 4 gives graphical pictures comparing the performance of
our system for HCP and MCP in terms of F2 and ROC-AUC. The
performance behaviour over time significantly differs in the two
contagion periods, reflecting the concept drift we discussed in Section
3.2. For HCP, considering the results without omitting the uncertain
test instances (blue curves), the performance prediction is very good
at the 2nd day and it decreases at the 6th and 10th days. Instead, for
MCP the performance improves over time, reaching 90.2% in terms
of ROC-AUC at the 10th day, as also reported in Table 2. This is due
to several factors:</p>
      <p>
        MCP includes patients that have hospitalisation periods much
longer than the patients in HCP, which can make more difficult to
predict the mortality risk for some patients with only a few days
of hospitalisation;
on the contrary, in HCP half of the patients stayed in hospital for
less than 8 days. This decreases significantly the size of the
8days and 10-days training sets, which contain respectively only
431 and 339 patients. The lack of training data in these datasets
is only partially compensated by the increase of the lab tests for a
single patient in the datasets;
as described in Section 3.2, the MCP patients are much more
unbalanced (with only 11% deceased patients) than the HCP
patients, and this increases the difficulty of learning an high
performing model [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>Figure 5 shows the confusion matrices for the test sets
generated using our predictive models. Above the line we have the HCP
datasets and below the MCP datasets. Despite the training phase was
optimised (through the use of the F2 metric) to avoid false negatives,
for the HCP datasets there are several false negatives (bottom-left of
the matrices). This can be explained by the scarcity of lab test and
X-ray data in the HCP data that affects prediction.</p>
      <p>However, false negatives are significantly reduced with the
models that can classify a patient as uncertain. For example, at day 6,
the system classifies as uncertain 4 patients who otherwise would be
false negatives. Moreover, when there are less false negatives, such
as at days 8 and 10, classifying some patients as uncertain helps to
also avoid false positives and so to generate less false alarms.</p>
      <p>Remarkably, especially for the MCP datsets, we have very few
false negatives even at the early days, which is quite important in our
application context. On the other hand, especially for days 2 and 4,
our system produces many false positives. This type of error is
reduced in the models with uncertain patients up to only 5 false alarms
for the end dataset (e.g., at day 2 we avoid 16 false positives.)
7</p>
    </sec>
    <sec id="sec-14">
      <title>Conclusions and Future Work</title>
      <p>We have presented a system for predicting the prognosis of
Covid19 patients focusing on the death risk. We built and engineered some
datasets from lab test and X-ray data of more than 2000 patients in
an hospital in northern Italy that was severely hit by Covid-19. Our
predictive system uses a collection of machine learning algorithms
and a new method for setting, at training time, an uncertain threshold
for prediction that helps to significantly reduce the prediction errors.</p>
      <p>Overall, the experimental results are quite promising, and show
that our system often obtains high ROC-AUC scores. The observed
predictive performance is especially good in terms of false
negatives (patients erroneously predicted survivor), that are very few. This
gives a predictive test for patient survival with very good specificity
in particular when the system can classify a patient as uncertain.</p>
      <p>On the other hand, in terms of false positives, there is room for
significant improvements. We are confident that the availability of more
information, such as patient comorbidities or clinical treatments, will
help to improve performance, reducing the number of both false
positives and (few) false negatives.</p>
      <p>For future work we plan to extend our datasets with more
information (both additional features and patients), to consider further
methods for dealing with the observed concept drift and to address other
prediction tasks such as the duration of the hospitalisation or the need
of ICU beds and critical hospital resources. Moreover, we are
analyzing the importance of the features used in our models, and we intend
to investigate additional learning techniques.</p>
      <p>Acnowledgements. The work of the first author has been
supported by Fondazione Garda Valley.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Aya</given-names>
            <surname>Awad</surname>
          </string-name>
          , Mohamed
          <string-name>
            <surname>Bader-El-Den</surname>
          </string-name>
          , and
          <string-name>
            <surname>James</surname>
            <given-names>McNicholas</given-names>
          </string-name>
          , '
          <article-title>Patient length of stay and mortality prediction: A survey'</article-title>
          ,
          <source>Health Services Management Research</source>
          ,
          <volume>30</volume>
          (
          <issue>2</issue>
          ),
          <fpage>105</fpage>
          -
          <lpage>120</lpage>
          , (
          <year>2017</year>
          ). PMID:
          <volume>28539083</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>James</given-names>
            <surname>Bergstra</surname>
          </string-name>
          and Yoshua Bengio, '
          <article-title>Random search for hyperparameter optimization'</article-title>
          ,
          <source>Journal of machine learning research</source>
          ,
          <volume>13</volume>
          (Feb),
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Borghesi</surname>
          </string-name>
          and Roberto Maroldi, '
          <article-title>Covid-19 outbreak in italy: experimental chest x-ray scoring system for quantifying and monitoring disease progression'</article-title>
          ,
          <source>La radiologia medica, (05</source>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Leo</given-names>
            <surname>Breiman</surname>
          </string-name>
          , 'Random forests',
          <source>Machine learning</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          , (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>NV</surname>
            <given-names>CHAWLA</given-names>
          </string-name>
          , '
          <article-title>Evaluating probability estimates from decision trees'</article-title>
          ,
          <source>in Proc. AAAI Workshop on Evaluation Methods for Machine Learning</source>
          , Boston, MA,
          <year>2006</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>23</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Joa</surname>
          </string-name>
          <article-title>˜o Gama, Indrundefined Zˇ liobaitundefined, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia, 'A survey on concept drift adaptation'</article-title>
          ,
          <source>ACM Comput. Surv.</source>
          ,
          <volume>46</volume>
          (
          <issue>4</issue>
          ),
          <source>(March</source>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Alfonso</given-names>
            <surname>Emilio</surname>
          </string-name>
          <string-name>
            <given-names>Gerevini</given-names>
            , Alberto Lavelli, Alessandro Maffi, Roberto Maroldi,
            <surname>Anne-Lyse</surname>
          </string-name>
          <string-name>
            <surname>Minard</surname>
          </string-name>
          , Ivan Serina, and Guido Squassina, '
          <article-title>Automatic classification of radiological reports for clinical care'</article-title>
          ,
          <source>in Proceedings of the 16th Conference on Artificial Intelligence in Medicine, AIME 2017</source>
          , Vienna, Austria, June 21-24,
          <year>2017</year>
          , volume
          <volume>10259</volume>
          of Lecture Notes in Computer Science, pp.
          <fpage>149</fpage>
          -
          <lpage>159</lpage>
          . Springer, (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Geurts</surname>
          </string-name>
          , Damien Ernst, and Louis Wehenkel, '
          <article-title>Extremely randomized trees'</article-title>
          ,
          <source>Machine learning</source>
          ,
          <volume>63</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3</fpage>
          -
          <lpage>42</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Gary</surname>
            <given-names>L Grunkemeier</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Ruyun</given-names>
            <surname>Jin</surname>
          </string-name>
          .
          <article-title>Receiver operating characteristic curve analysis of clinical risk models</article-title>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Karimollah</given-names>
            <surname>Hajian-Tilaki</surname>
          </string-name>
          , '
          <article-title>Receiver operating characteristic (roc) curve analysis for medical diagnostic test evaluation'</article-title>
          ,
          <source>Caspian journal of internal medicine</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ),
          <fpage>627</fpage>
          , (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Hrayr</surname>
            <given-names>Harutyunyan</given-names>
          </string-name>
          , Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan, '
          <article-title>Multitask learning and benchmarking with clinical time series data', Scientific data</article-title>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Sharique</given-names>
            <surname>Hasan</surname>
          </string-name>
          and Rema Padman, '
          <article-title>Analyzing the effect of data quality on the accuracy of clinical decision support systems: a computer simulation approach'</article-title>
          ,
          <source>in AMIA annual symposium proceedings</source>
          , volume
          <year>2006</year>
          , p.
          <fpage>324</fpage>
          . American Medical Informatics Association, (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Simon</surname>
            <given-names>Haykin</given-names>
          </string-name>
          ,
          <article-title>Neural networks: a comprehensive foundation</article-title>
          ,
          <source>Prentice Hall PTR</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Stephanie</surname>
            <given-names>Hyland</given-names>
          </string-name>
          , Martin Faltys, Matthias Hu¨ ser, Xinrui Lyu, Thomas Gumbsch,
          <source>Crist o´bal Esteban</source>
          , Christian Bock, Max Horn, Michael Moor, Bastian Rieck,
          <string-name>
            <given-names>Marc Zimmermann</given-names>
            ,
            <surname>Dean Bodenham</surname>
          </string-name>
          , Karsten Borgwardt, Gunnar Ra¨tsch, and Tobias Merz, '
          <article-title>Early prediction of circulatory failure in the intensive care unit using machine learning'</article-title>
          ,
          <source>Nature Medicine</source>
          ,
          <volume>26</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          , (03
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Xiangao</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Megan Coffee, Anasse Bari, Junzhang Wang, Xinyue Jiang, Jianping Huang, Jichan Shi, Jianyi Dai, Jing Cai,
          <string-name>
            <given-names>Tianxiao</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al., '
          <article-title>Towards an artificial intelligence framework for datadriven prediction of coronavirus clinical severity'</article-title>
          , CMC: Computers, Materials &amp; Continua,
          <volume>63</volume>
          ,
          <fpage>537</fpage>
          -
          <lpage>51</lpage>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Alistair</surname>
            <given-names>EW Johnson</given-names>
          </string-name>
          , Mohammad M Ghassemi,
          <string-name>
            <given-names>Shamim</given-names>
            <surname>Nemati</surname>
          </string-name>
          , Katherine E Niehaus,
          <article-title>David A Clifton,</article-title>
          and
          <string-name>
            <surname>Gari D Clifford</surname>
          </string-name>
          , '
          <article-title>Machine learning and decision support in critical care'</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          ,
          <volume>104</volume>
          (
          <issue>2</issue>
          ),
          <fpage>444</fpage>
          -
          <lpage>466</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Bartosz</surname>
            <given-names>Krawczyk</given-names>
          </string-name>
          , '
          <article-title>Learning from imbalanced data: open challenges and future directions'</article-title>
          ,
          <source>Progress in Artificial Intelligence</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ),
          <fpage>221</fpage>
          -
          <lpage>232</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and E. Duchesnay, '
          <article-title>Scikit-learn: Machine learning in Python'</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Tom J Pollard and Leo Anthony</surname>
          </string-name>
          Celi, '
          <article-title>Enabling machine learning in critical care'</article-title>
          ,
          <source>ICU management &amp; practice</source>
          ,
          <volume>17</volume>
          (
          <issue>3</issue>
          ),
          <fpage>198</fpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Luca</surname>
            <given-names>Putelli</given-names>
          </string-name>
          , Alfonso Gerevini, Alberto Lavelli, Matteo Olivato, and Ivan Serina, '
          <article-title>Deep learning for classification of radiology reports with a hierarchical schema'</article-title>
          ,
          <source>in Proceedings of 24th International Conference on Knowledge-Based and Intelligent Information &amp; Engineering Systems</source>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Luca</surname>
            <given-names>Putelli</given-names>
          </string-name>
          , Alfonso Gerevini, Alberto Lavelli, and Ivan Serina, '
          <article-title>The impact of self-interaction attention on the extraction of drug-drug interactions'</article-title>
          ,
          <source>in Proceedings of the Sixth Italian Conference on Computational Linguistics</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Luca</surname>
            <given-names>Putelli</given-names>
          </string-name>
          , Alfonso Emilio Gerevini, Alberto Lavelli, and Ivan Serina, '
          <article-title>Applying self-interaction attention for extracting drug-drug interactions'</article-title>
          ,
          <source>in XVIIIth International Conference of the Italian Association for Artificial Intelligence</source>
          , Rende, Italy,
          <source>November 19-22</source>
          ,
          <year>2019</year>
          , Proceedings, (
          <volume>11</volume>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Joaquin</given-names>
            <surname>Quionero-Candela</surname>
          </string-name>
          , Masashi Sugiyama, Anton Schwaighofer, and
          <string-name>
            <surname>Neil D Lawrence</surname>
          </string-name>
          ,
          <article-title>Dataset shift in machine learning</article-title>
          , The MIT Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Anna</surname>
            <given-names>S Rakitianskaia</given-names>
          </string-name>
          and Andries Petrus Engelbrecht, '
          <article-title>Training feedforward neural networks with dynamic particle swarm optimisation'</article-title>
          ,
          <source>Swarm Intelligence</source>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          ),
          <fpage>233</fpage>
          -
          <lpage>270</lpage>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Lior</given-names>
            <surname>Rokach</surname>
          </string-name>
          and
          <article-title>Oded Maimon, Data Mining with Decision Trees: Theory and Applications</article-title>
          , World Scientific Publishing Co., Inc.,
          <string-name>
            <surname>River</surname>
            <given-names>Edge</given-names>
          </string-name>
          , NJ, USA,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Santosh</surname>
            <given-names>Srivastava</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maya R Gupta</surname>
          </string-name>
          , and
          <article-title>Be´la A Frigyik, 'Bayesian quadratic discriminant analysis'</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>8</volume>
          (Jun),
          <fpage>1277</fpage>
          -
          <lpage>1305</lpage>
          , (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Johan</surname>
            <given-names>AK</given-names>
          </string-name>
          <string-name>
            <surname>Suykens and Joos Vandewalle</surname>
          </string-name>
          , '
          <article-title>Least squares support vector machine classifiers'</article-title>
          ,
          <source>Neural processing letters</source>
          ,
          <volume>9</volume>
          (
          <issue>3</issue>
          ),
          <fpage>293</fpage>
          -
          <lpage>300</lpage>
          , (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Mihaela</surname>
            <given-names>van der Schaar</given-names>
          </string-name>
          and Ahmed Alaa, '
          <article-title>How artificial intelligence and machine learning can help healthcare systems</article-title>
          respond to covid-
          <volume>19</volume>
          ', https://www.vanderschaar-lab.com/covid-19/, (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Li</surname>
            <given-names>Yan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hai-Tao</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Yang Xiao,
          <string-name>
            <surname>Maolin Wang</surname>
          </string-name>
          , et al.,
          <article-title>'Prediction of criticality in patients with severe covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in wuhan', medArxiv preprint</article-title>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Jinsung</surname>
            <given-names>Yoon</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ahmed</given-names>
            <surname>Alaa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Scott</given-names>
            <surname>Hu</surname>
          </string-name>
          , and Mihaela Schaar, '
          <article-title>Forecasticu: a prognostic decision support system for timely prediction of intensive care unit admission'</article-title>
          ,
          <source>in International Conference on Machine Learning</source>
          , pp.
          <fpage>1680</fpage>
          -
          <lpage>1689</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Harry</surname>
            <given-names>Zhang,</given-names>
          </string-name>
          '
          <article-title>The optimality of naive bayes'</article-title>
          ,
          <source>in Proceedings of the Seventeenth International Florida Artificial Intelligence Research</source>
          Society Conference, Miami Beach, Florida, USA, eds.,
          <source>Valerie Barr and Zdravko Markov</source>
          , pp.
          <fpage>562</fpage>
          -
          <lpage>567</lpage>
          . AAAI Press, (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Zhi-Hua</surname>
            <given-names>Zhou</given-names>
          </string-name>
          ,
          <source>Ensemble Methods: Foundations and Algorithms</source>
          , Chapman &amp; Hall/CRC, 1st edn.,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>