Prognosis Prediction in Covid-19 Patients from Lab Tests and X-ray Data through Randomized Decision Trees Alfonso E. Gerevini∗ , Roberto Maroldi∗† , Matteo Olivato∗ , Luca Putelli∗ , Ivan Serina∗ ∗ Università degli Studi di Brescia, † ASST Spedali Civili di Brescia {alfonso.gerevini, roberto.maroldi, ivan.serina, m.olivato, l.putelli002}@unibs.it Abstract. AI and Machine Learning can offer powerful tools to Our datasets were engineered to cope with a number of practical help in the fight against Covid-19. In this paper we present a study issues, including missing values and feature values categorization, and a concrete tool based on machine learning to predict the progno- and to add some helpful artificial features. We also addressed the sis of hospitalised patients with Covid-19. In particular we address “concept drift” issue [6, 23], since we observed that the risk of death the task of predicting the risk of death of a patient at different times was clearly sensitive to the time period when the patient was hos- of the hospitalisation, on the base of some demographic informa- pitalised; the risk was significantly higher during the earlier period tion, chest X-ray scores and several laboratory findings. Our machine of the emergency (March 2020), when in northern Italy the spread of learning models use ensembles of decision trees trained and tested the virus infection was very high and many people were hospitalised. using data from more than 2000 patients. An experimental evalua- Moreover, given the very sensitive nature of our task, we introduced tion of the models shows good performance in solving the addressed a threshold to discharge the model predictions that have a low esti- task. mated probability. Such a threshold is a parameter that is automati- cally calculated and optimised during the training phase. We considered several machine learning algorithms. A first experi- 1 Introduction mental comparison of their performance on our data sets showed that The fight against Covid-19 is a new important challenge for the world methods based on forests of trees have more promising performance, that AI and machine learning can help facing at various levels [15, and so we decided to focus on this approach. The obtained predic- 28, 29]. In March 2020, at the time of the coronavirus emergency tion models have good performance over a randomly chosen test set in Italy, we started working in strict collaboration with one of the of 200 patients for each considered period, in terms of both F2 and hospitals that had more Covid-19 patients in Italy, Spedali Civili di ROC-AUC scores. In particular, overall the system makes very few Brescia, to help predicting the prognosis of hospitalised patients. Our errors in predicting patient survival, i.e., the specificity of the predic- work was focused on the task of predicting the risk of death of a tion is very high. patient at different times of the hospitalisation. As discussed in [28], In the following, after discussing related work, we describe our predicting if a patient is at risk of decease or adverse events can help data sets, we present our prediction models and their experimental the hospital, for instance, to organize the allocation of limited health evaluation, and finally we give conclusions and mention future work. resources in a more efficient way. Our predictive models are built on the base of demographic in- 2 Related work formation (sex and age), the values of ten laboratory tests and the chest X-ray score(s), which is an innovative measure developed and Artificial Intelligence and Machine Learning techniques can be used used at Spedali Civili di Brescia to assess the severity of the pul- for tackling the Covid-19 pandemic in different aspects. However, monary conditions [3]. Other important information, such us the pa- given that the pandemic has started only few months ago, most works tient comorbidities or the time and duration of the symptoms related are still preliminary, and there isn’t a clear description of the devel- to Covid-19, were not used because not available to us. oped techniques and of their results (often only pre-printed and not Using raw data from more than 2000 patients, we built some data properly peer-reviewed). sets describing the “clinical history” of each patient during the hos- A preliminary study is presented in [15]. Given a set of only 53 pitalisation. In particular, each dataset contains a “snapshot” of the patients with mild symptoms and their lab tests, comorbidities and infection conditions of every considered patient at a certain day after treatment, the authors train several machine learning models (Lo- the start of the hospitalisation. For each dataset, we built a different gistic Regression, Decision Trees, Random Forests, Support Vector predictor, allowing to make progressive predictions over time that Machines, KNN) to predict if a patient will be subject to more sever take into account the evolution of the disease severity in a patient, symptoms, obtaining a prediction accuracy score of up to 0.8 using which helps the formulation of a personalized prediction of the prog- 10-fold cross validation. The generalizability and strength of these nosis. A change of the predicted risk over time for a patient could also results are questionable, given the very small set of considered pa- hint a link between specific events or treatments and the increase or tients. decrease of the risk for the patient. As snapshot times for a patient, in Another example is the pre-printed work by Li Yan et al. [29] that our experiments we considered the 2nd, 4th, 6th, 8th and 10th hospi- uses lab tests for predicting the mortality risk; the proposed model talization day, and the day before the end of the hospitalisation. is a very simple decision tree based on the three most important fea- Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tures. While the performance seems promising, the test set used for Lab test Normal Range Median Value evaluation was very small (29 patients). C-Reactive Protein (PCR) ≤ 10 34.3 Various AI and machine learning techniques have been developed Lactate dehydrogenase (LDH) [80, 300] 280 Ferritin (Male) [30, 400] 1030 for prognosis and disease progression prediction [7] in the context of Ferritin (Female) [13, 150] 497 diseases different from Covid-19 [20, 21, 22]. In particular, in the last Troponin-T ≤ 14 19 few years, several works about predicting mortality risk or adverse White blood cell (WBC) [4, 11] 7.1 events and on the use of AI in critical care [19] have been published. D-dimer ≤ 250 553 Fibrinogen [180, 430] 442 The survey in [1] presents a review of statistical and ML systems for Lymphocite (over 18 years old patients) [20, 45] 1.0 predicting the mortality risk, the need of beds in intense care units Neutrophils/Lymphocites [0.8, 3.5] 4.9 [30] or the length of the patient hospitalization. In particular, it is Chest XRay-Score (RX) <7 8 worth to mention the work by Harutyunyan et al. [11] which uses Table 1: Lab tests performed during the hospitalisation. In the second LSTM Neural Networks for predicting both the mortality risk and column, we show the range which is considered clinically normal the length of the hospitalisation. for a specific exam. In the third column, we show the median value An overview of the issues and challenges for applying ML in a extracted considering the lab test findings for our set of 2015 patients. critical-care context is available in [16]. This work stresses the need to deal with corrupted data, like missing values, imprecision, and few days to two months), due to different reasons including the nov- errors that can increase the complexity of prediction tasks. elty and the characteristics of the disease, its high contagiousness or Lab test findings and their variation over time are the main focus the absence of an effective treatment. Therefore, the number of per- of the work by Hyland et al. [14], which describes a system that formed lab tests and relative findings significantly varies among the processes these data to generate an alarm predicting that a patient considered set of patients (from only three to hundreds). will have a circulatory failure 2 hours in advance. Moreover, the lab tests and X-ray exams are not performed at a regular frequency due, e.g., to the different kinds and timing of the 3 Available Data Sources relative procedures, the need of different resources (X-Ray machines, lab equipments, technical staff, etc.), or to the different severity of During the Covid-19 outbreak, from February to April 2020 in the health conditions of the patients. For example, in our data we see hospital Spedali Civili di Brescia more than two thousand patients that a patient can be tested for PCR everyday and not be subject to were hospitalised. During their hospitalisation, the medical staff per- a Ferritin exam for two weeks. This leads to the need of handling formed several exams to them in order to monitor their conditions, the issues missing values and outdated values. When we consider a checking the response to some treatments, verifying the need to snapshot of a patient at a certain day, we have a missing value for a transfer a patient to the ICU, etc. We had data from a total of 2015 lab test (or X-ray) feature if that test (X-ray) has not been performed. hospitalised patients; for each of these patients, the specific data that We have an outdated value for a feature if the corresponding lab test were made available to us are: (X-ray) was performed several days earlier: since in the meanwhile the disease has progressed, the findings of the lab test could be incon- • the age and sex; sistent with the current conditions of the patient, and so they could • the values and dates of several lab tests (see Table 1); mislead the prediction. • the scores (each one from 0 to 18), assigned by the physicians, Data quality issues arise especially patients hospitalised in the pe- assessing the severity of the pulmonary conditions resulting from riod of the highest emergency, when several hundreds of patients the X-ray exams [3]; were in the hospital at the same time. • the values and dates of the throat-swab exams for Covid-19; • the final outcome of the hospitalisation at the end of the stay, which is the classification value of our application (either in- 3.2 Concept Drift hospital death, released survivor, or transferred to another hospital An examination of the data available for our cohort of patients re- or rehabilitation center). vealed that their prognostic risk is influenced by multiple factors, Table 1 specifies the considered lab tests, their normal range of val- such as the number of the patients currently hospitalised and the con- ues, and their median values in our set of patients. We had no further sequent availability of ICU beds or other resources, the experimen- information about symptoms, their timing, comorbidities, generic tation of new therapies, and the increase of the clinical knowledge. health conditions or clinical treatment. Moreover, we have no CT im- In machine learning, this change of data distribution is known as ages or text reports associated with the X-ray exams. The available concept drift [6, 23]. A classical method to deal with this problem is information about whether a patient was or had been in ICU was not training the algorithm using only a subset of samples, depending on clear enough to be used. Finally, of course, also the names of the the data distribution that we are considering [6, 24]. patient and of the involved medical staff names were not provided. For this reason, we divided the considered set of patients into two groups: the High Contagion Phase (HCP) group of patients, which is composed by the patients admitted during the last weeks of Febru- 3.1 Data Quality Issues ary and the first weeks of March (the most critical period of the pan- demic outbreak in Italy) and the Moderate Contagion Phase (MCP) When applying machine learning to raw real-world data, there are group of patients, which is composed by the patients admitted from some non-trivial practical issues to deal with, such as the quality of the last decade of March to the end of April. the available data and related aspects, that in biomedical applications The main differences between these groups of patients are: are especially important given the very sensitive domain [12]. In our case, one of such issues is that the length of the hospital- 1. the mortality rate of the HCP patients is about twice the mortality isation period can sensibly differ from one patient to another (from rate of the MCP patients; Figure 1: Length of stay in hospital (left) and weekly death rate histograms for the High Contagion Phase (in blue) and for the Moderate Contagion Phase (in orange). On the x-axis, for the length of stay we indicate the range of days, for the death rate we indicate the week when the patient was released. On the y-axis we indicate the percentage of patients. 2. in HCP patients the median value of the hospitalisation period is 8 4.1.1 Patient Snapshot and Feature Engineering days, while in MCP patients is 14 days. Further details are given in Figure 1; In order to provide a prediction for a patient at different hospitalisa- 3. for many of the considered lab test, the mortality rate associated tion times, we introduced the concept of patient snapshot to repre- with having values in a particular range significantly changes in sent the patient health conditions at a given day. the two groups. For example, in HCP patients the mortality rate In this snapshot, for each lab test of Table 1, we consider its most for the patients which had a PCR value 10 times above the normal recent value. In the ideal case, we should know the lab test findings range is 40.1%, while in MCP patients it is 21.1%. at every day. However, as explained in Section 3.1, in a real-world context the situation is very different. For example, in our data if we These differences clearly indicate that the data in the HCP and consider to take a snapshot of a patient 14 days after the admission MCP groups represent different target (concept) functions; therefore into the hospital, we have cases with very recent values of PCR, LDH predicting mortality during the high infection phase and during the or WBC (obtained one or a few days before), very old values for Fib- moderate phase can be considered as two different tasks. If we had rinogen or Troponin-T (obtained the first day of the hospitalisation) only the patients hospitalised during the high infection phase, using and even no value for Ferritin. these data for training an algorithm that predicts the mortality during Given the difficulty to set a predefined threshold that separates re- the moderate phase would lead to many errors. cent and old values of the lab tests (e.g., for Fibrinogen and Troponin- In our case, we generated two different systems, one for each of T), we choose to always use the most recent value, even if it could the two groups of patients. We are currently investigating ways to be outdated. In order to allow the learning algorithm to capture that automatically select the set of patients for training starting from the a value may not be significant to represent the current status of the latest ones, and keeping the less recent ones until we find significant patient (because too old), we introduce a feature called ageing for changes in the mortality rate or in the data distribution. each test finding. If a lab test has been performed at a day d0 , and the snapshot of a patient is taken at day d1 , the ageing is defined as the number of days between d1 and d0 . If there is no available value for 4 Datasets for Training and Testing a lab test, its ageing is considered a missing value. A patient snapshot can contain the values of the lab test findings in The main task of our work is to provide survival/death predictions at two forms: either numerical, in which we report the value itself, or different days of the patient hospitalisation, according to the current categorical, in which the value is transformed into an integer number patient conditions reflected by the available lab findings and X-ray expressing the gravity of the test finding within a partition of the scores. In this section we describe the specific extracted features and possible real values. This partition is based on the range of values for the (training and testing) datasets that we built for this purpose. normal conditions and on how the test values are distributed over the data of all patients. For example, we divide the D-Dimer vales into 4.1 Pre-processing and Feature Extraction 6 categories: the normal range, up to 2 times the maximum value of the normal range, up to 4 times, 6 times, 10 times and over 10 The issues presented in Section 3.1 compel us to a robust pre- times. The categorical form could help the algorithm to have a clearer processing phase with the goal of extracting features in order to sum- understanding of the data and improve performance. marize the patients conditions and process them by a machine learn- Monitoring the conditions of a patient means knowing not only the ing algorithm. The pre-processing is applied to both HCP and MCP patient status at a specific time, but also how the conditions evolve data. during the hospitalisation. For this purpose, we introduce a feature Given that we have no information about the survival or the de- called trend that is defined as follows: cease of a patient after a transfer (which can be due to limited avail- ability of beds or ICU places), we exclude from our training and test For each lab test, if there is no available value for a lab test or if set the 142 patients which were admitted in Spedali Civili di Bres- the patient has not performed the lab test at least two times, the cia and then transferred to another hospital. However, the 74 patients trend is a missing value. Otherwise, given the values v1 and v2 who were transferred to a rehabilitation center can be considered not of the findings for the lab test performed at days d1 and d2 and a at risk of death; therefore we include them in our datasets and con- threshold T that we set to 15% of v1 , if v2 > (1 + T ) ∗ v1 , then sider the transferred patients as released alive. the trend is increasing, while if v2 < (1 − T ) ∗ v1 the trend is decreasing; otherwise the trend is stable. 5.1 Classification Algorithms We distinguish two types of trends: the start trend, that uses the Decision trees distance between the most recent value and the first available value, Decision Trees [25] are one of the most popular learning methods and the last trend, that uses the distance between the last one and for solving classification tasks. In a decision tree, the root and each the penultimate one. We are currently investigating techniques for internal node provides a condition for splitting the training samples including more than two values in the trend calculation. into two subsets depending on whether the condition holds for a sam- To summarize, for each lab test in a patient snapshot, we have the ple or not. In our context, for each numerical feature f , a candidate most recent finding and the relative ageing and trend, as well as the splitting condition is f ≤ C, where C is called cut point. The final static features age and sex. splitting condition is chosen by finding the f and C providing the best split according to one of some possible measures like Informa- 4.2 Training and Test Sets Generation tion Gain, Entropy index or Gini index. A subset of samples at a tree node can either be split again by In this section we describe how we generated the training and test further feature conditions forming a new internal node, or form a sets for the purpose of predicting, at different days from the start of leaf node labelled with a specific classification (prediction) value; in the patient hospitalization, the final outcome of her/his stay. our application domain the label is either the alive class or the dead First, for both the HCP and MCP sets, we used stratified sampling class. Let us consider a decision tree with a leaf node l and a subset S for selecting 80% of the patients for training the models and 20% for of associated training samples. A test instance X that reaches l from testing them. Then, we created specific training and test sets for each the root tree, is classified (predicted) y with probability element in a sequence of times when the model is used to make the prediction1 : TP P (y|X) = TP + FP • 2 days of hospitalisation. We include all the patients’ snapshots where T P (True Positives) is the number of training samples in S containing the first values for each lab test conducted in the first that have class value y, and F P (False Positives) is the number of two days after the hospital admission. Note that if a patient has samples in S that don’t have class value y [5]. Given that in our task performed a lab test more than once in the first two days, the we have only two classes (y and y), P (y|X) = 1 − P (y|X). The snapshot will consider the oldest value. In fact, the purpose of classification outcome of a decision tree forX is the class value with the model we want to build is to provide the prediction as soon the highest probability. as possible, with the first information available. Furthermore, in these snapshots the ageing and trend values are not included. • 4 days and 6 days of hospitalisation. In these cases, the corre- Random Forests sponding snapshots also contain the ageing and trend features, and Random Forests (RF) [4] is an ensemble learning method [32] that the lab values will be the most recent ones in the available data. builds a number of decision trees at training time. For building each Given that only a few days passed after admission, we consider individual tree of the random forest, a randomly chosen subset of the the start trend. data features is used. While, in the standard implementation of ran- • 8 days and 10 days of hospitalisation. The procedure of creating dom forests the final classification label is provided using the statisti- the corresponding snapshots is the same as for the snapshots of cal mode of the class values predicted by each individual tree, in the 4 days and 6 days cases, except that we consider the last trend well-known tool Scikit-Learn [18] that we used for our system im- instead of the start trend. plementation, the probability of the classification output is obtained • End day (the last day before the patient is released or the patience by averaging the probabilities provided by all trees. Hence, given a decease). In this case, for each lab test the snapshot includes both random forest with n decision trees, a class (prediction) value y is the start trend and the last trend features. assigned to an instance X with the following probability: It is important to observe, that while the datasets of the latter days Pn i=1 Pi (y|X) will contain more information about the single patients (more lab P (y|X) = . n tests findings, less missing values), the overall number of patients in the datasets decreases with the prediction day increase. This is due to the fact that more patients are released or die within longer periods Extra Trees of hospitalisation, and therefore such patients are not included in the Extremely Randomized Trees (Extra Trees or ET) [8] are another corresponding datasets. ensemble learning method based on decision trees. The main differ- Finally, note that the splitting between training and testing of the ences between Extra Trees and Random Forests are: data is done only once considering all patients. Thus if, for instance, a patient belongs to the training set of 2 days, then it does not belong • In the original description of Extra Trees [8] each tree is built us- to the test set of the following days. ing the entire training dataset. However in most implementations of Extra Trees, including Scikit-Learn [18], the decision trees are built exactly as in Random Forests. 5 Machine Learning Algorithms • In standard decision trees and Random Forests, the cut point is chosen by first computing the optimal cut point for each feature, In this section we briefly describe the machine learning algorithms and then choosing the best feature for branching the tree; while used in our prognosis prediction system. in Extra Trees, first we randomly choose k features and then, for 1 While we chose 2, 4, 6, 8, 10 days after the hospitalisation, plus the day each chosen feature f , the algorithm randomly selects a cut point before the patient release, of course other sequences could be considered. Cf in the range of the possible f values. This generates a set of k couples {(fi , Ci ) | i = 1, . . . , k}. Then, the algorithm compares F IND U NCERTAIN T HRESHOLD: Algorithm for computing, the splits generated by each couple (e.g., under split test fi ≤ ci ) during the training phase, an optimised prediction threshold to select the best one using a split quality measure such as the Gini under which the model labels an instance as uncertain. Index or others. Input: The probability P (y|X) of assigning a class value y to an instance – L array of labels (alive or dead) li with l[i] label of the sample i X is computed as in Random Forests (see equation above). of the validation data (fold); – P = [pi = (palive , pdead )i | i is the sample index in val. set]; – max u the maximum percentage of the samples in the validation 5.2 Hyperparameter Search set that can be labeled as uncertain (not predictable); Most machine learning algorithms have several hyperparameters to – n the maximum number of thresholds to try; tune such as, for instance, in a Random Forest the number of decision – EvaluateScore the score function to maximize by dropping the trees to create and their maximum depth. Since in our application uncertain samples; handling the missing values is an important issue, we also used a hy- Output: A pair (v, th) where v is the score function value perparameter for this with three possible settings: a missing value is after dropping the uncertain samples and th the set to either the average value, the median value or a special constant optimized threshold value. (-1). 1 Lpred ← array of labels such that Lpred [i] is the predicted In order to find the best performing configuration of the hyper- label (the label with highest probability) of the val. sample i; parameters, we used the Random Search optimization approach [2], 2 Pmax ← [max(palive , pdead )i | (palive , pdead )i ∈ P ]; which consists of the following main steps: 3 v ← EvaluateScore(L, Lpred ); 1. We divide our training sets into k folds, with either k = 10 or 4 th ← min value in Pmax ; k = 5, depending on the dimension of the considered dataset. 5 δ ← [(max value in Pmax ) − (min value in Pmax )]/n; 2. For each randomly selected combination of hyperparameters, we 6 for i ← 0 to n − 1 do run the learning algorithm in k-fold cross validation. 7 th0 ← min value in Pmax + i · δ; 3. For each fold, we evaluate the performance of the algorithm with 8 S ← {i |i is id sample such that Pmax [i] > th0 } that configuration using the Macro F-β score metric and β = 2. 9 u ← 1 − (|S|/|Pmax |); The F -β score is the weighted harmonic mean of precision and 10 if u ≥ max u then return (v, th); recall measures. The β parameter indicates how many times the 11 L0 ← array of labels such that L[i] is the label of the val. recall is more important with respect to the precision: sample i and i ∈ S; 12 L0pred ← array of labels such that Lpred [i] is the precision + recall predicted label of the val. sample i and i ∈ S; F -β = (1 + β 2 ) ∗ β 2 ∗ precision + recall 13 v 0 ← EvaluateScore(L0 , L0pred ); We choose β = 2 in order to give particular importance to false 14 if v 0 > v then negatives, i.e. those patients which our system could not identify 15 th ← th0 ; as at death risk. Given that we can compute the F2-score both for 16 v ← v0 ; both the alive class and the dead class, we considered the Macro 17 end F2-Score, which is the arithmetic mean of the scores for the two 18 end classes. 4. The overall evaluation score of the k-fold cross validation for Figure 2: Pseudocode of algorithm F IND U NCERTAIN T HRESHOLD . a configuration of the parameters is obtained by averaging the scores obtained for each fold. We designed an algorithm called F IND U NCERTAIN T HRESHOLD 5. The hyperparameter configuration with the best overall score is that is used in the training phase to decide the threshold and opti- selected. mize the prediction performance on the training samples that pass it, under the max u constraint. The pseudocode of the algorithm is in Figure 2. 5.3 Handling Prediction Uncertainty Given the original labels L of the validation samples and their The output for an instance X of every generate classification model prediction probabilities P derived by the learning algorithm, F IND - is an array of two probabilities, P (alive|X) and P (dead|X), de- U NCERTAIN T HRESHOLD first computes: the predicted labels Lpred fined as described in Section 5.1. We can see them as “degrees of (i.e., the class values with highest probabilities) and the relative certainty” of the prediction: the higher the probability is, the more Pmax probabilities; the original score v obtained using the input reliable the prediction is. Given the very sensitive nature of our task, score function evaluating all samples; an initial value of the threshold the system discards potential predictions supported by a low proba- (th) defined as to the minimum probability in Pmax . bility. This is achieved using a prediction threshold under which the The next loop finds an optimal value of threshold th and computes system considers the prediction uncertain (and the patient risk un- the score function for the validation set reduced to the validation predictable). Note that if we used a threshold value that is too high, samples with predicted labels that have probabilities above th. The many patients could be classified uncertain, and our model would be considered threshold values are obtained by using the δ-increments much less useful for clinical practice. To avoid this, at training time defined at lines 5 and 7. First we compute the new threshold th0 in- we impose a maximum percentage of samples that can can be con- creasing the current threshold by δ, and then we derive the set S sidered uncertain (unpredictable), and we implemented this with a of sample ids with prediction probabilities higher than th0 . Next we parameter, called max u, that is given in input; for our experimental compute the percentage u of samples that are labeled as uncertain analysis we used max u = 25%. using threshold th0 . If u ≥ max u, we can terminate returning the minutes. Therefore, we can build all the four most promising models generated by RF and ET using the numerical version (RF-N, TC-N) or the categorical version (RF-C, ET-C) of the data set in less than two hours, and then select the best performing model among them. It is also worth to note that in our system the models for predicting the prognostic risk at different days are completely independent from each other, and so we can consider prediction tasks at different days as different tasks. In Figure 4 and in Table 2 we show the performances of our sys- tem at each considered day for both the High Contagion Phase and Figure 3: Average performance (F2 score) of seven machine learning the Moderate Contagion Phase. As we can see, we obtain promising algorithms for the HCP datasets. The line over the bar represents the results in terms of F2 score for an early evaluation of the risk dur- standard deviation. ing the HCP (with score 77.1% at day 2), while we encounter some problems at the 6th and 10th days. For the MCP datasets, the system current best new score v and the corresponding threshold value th (a performs better at the latter days, in particular for the 10th day F2 greater threshold value cannot lead to label as uncertain less samples is 80.4% and ROC-AUC is 90.2%. For HCP, both RF and ET ob- than the returned th value). Otherwise (u < max u), we compute tain good results in both the numerical and categorical versions of the correct sample labels L0 and the predicted sample labels L0pred the datasets. Instead, for MCP using the categorical datasets does not for the samples identified by S, and we compute the new score value give good performance, and we do not observe an improvement for v 0 using L0 and L0preds . If v 0 is a better score than v, we update both the latter prediction days (the F2 score is always below the 70%). the threshold and the score values. In all but one case, the models using the uncertain threshold in- F IND U NCERTAIN T HRESHOLD is executed during the training crease the performance in terms of both F2 and ROC-AUC scores. phase. In particular during the hyperparameter search, for each at- In particular, in the most problematic cases of HCP, such as for the tempted hyperparamenter configuration, we compute through F IND - 6-days and 10-days datasets, the prediction performance improves in U NCERTAIN T HRESHOLD an optimized threshold and the relative terms of F2 by over than 7 points. The improvement is less significant score function value. These two values are obtained by averaging for MCP. the optimal thresholds and corresponding scores over all folds of the Note that, while the threshold value under which the system labels cross validation for the attempted configuration. The hyperparameter an instance (patient risk) as uncertain is derived at training time im- search returns the best configuration together with the relative (aver- posing a maximum percentage of uncertain samples (we used 25%), aged) threshold. there is no formal guarantee that this percentage limit is satisfied for test set. However, in most cases the percentage of uncertain test sam- 6 Experimental Evaluation and Discussion ples (indicated with % Unc in Table 2) is much below the limit im- In this section, we evaluate the performance the of the machine learn- posed during training, expect for the test set of the 6th day in HCP, ing models that we built. Our system was implemented using the where the unpredicted (labelled as uncertain) patients are 26.1%. The Scikit-Learn [18] library for Python, and the experimental tests were performance for the “end” dataset is good for both HCP and MCP conducted using a Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz. even without omitting the uncertain patients (F2 score 86.6% for The performance of the learning algorithms with the relative op- HCP, and F2 score 86.9% for MCP). timized hyperparameters was evaluated using the test set in terms Figure 4 gives graphical pictures comparing the performance of of F2 score and ROC-AUC score. The second metric is defined as our system for HCP and MCP in terms of F2 and ROC-AUC. The the area under the Receiver Operating Characteristic curve, which performance behaviour over time significantly differs in the two con- plots the true positive rate against the false positive rate, and it takes tagion periods, reflecting the concept drift we discussed in Section also into account the probability that the predictive system produces 3.2. For HCP, considering the results without omitting the uncertain false positives (i.e. false alarms). This metric is a standard method test instances (blue curves), the performance prediction is very good for evaluating medical tests and risk models [9, 10]. at the 2nd day and it decreases at the 6th and 10th days. Instead, for In a preliminary study we examined various machine learning ap- MCP the performance improves over time, reaching 90.2% in terms proaches and we compared their average performances over the HCP of ROC-AUC at the 10th day, as also reported in Table 2. This is due datasets. Figure 3 shows a summary of the relative performance in to several factors: terms of F2 score. We considered Decision Trees [25], ExtraTrees • MCP includes patients that have hospitalisation periods much (ET) [8], Gaussian Naive Bayes [31], Multilayer Perceptron with two longer than the patients in HCP, which can make more difficult to layers (MLP) [13], Quadratic Discriminant Analysis [26], Random predict the mortality risk for some patients with only a few days Forests (RF) [4] and Support Vector Machines [27]. The best perfor- of hospitalisation; mance was obtained with RF and ET. NN and SVM performed much • on the contrary, in HCP half of the patients stayed in hospital for worse and with a much higher variability over the datasets, probably less than 8 days. This decreases significantly the size of the 8- related to the missing values and the scarcity of data. For the MCP days and 10-days training sets, which contain respectively only datasets the relative performance was similar. Given the observed 431 and 339 patients. The lack of training data in these datasets better performance of RF and ET, we focused the evaluation of our is only partially compensated by the increase of the lab tests for a system on these learning algorithms single patient in the datasets; Regarding the training time, including the hyperparamenter search • as described in Section 3.2, the MCP patients are much more un- over 4096 random configurations and the optimization of the uncer- balanced (with only 11% deceased patients) than the HCP pa- tainty threshold, for any specific dataset (e.g., the MCP numerical tients, and this increases the difficulty of learning an high per- dataset for 2 days), the overall training time is between 20 and 30 forming model [17]. Figure 4: Graphical representation of the prediction performance (F2 and ROC-AUC scores) over hospitalisation time for HCP and MCP. HCP data F2 ROC F2-U ROC-U % Unc Model MCP data F2 ROC F2-U ROC-U % Unc Model 2days 77.1 77.8 80.1 83.3 18.3 ET-C 2days 60.0 75.4 61.0 78.1 13.9 ET-N 4days 74.1 79.4 76.7 81.9 13.8 RF-N 4days 63.5 78.5 65.4 82.4 21.1 RF-N 6days 68.7 75.6 75.9 83.6 26.1 RF-N 6days 74.1 86.0 77.2 88.1 9.8 ET-N 8days 74.8 76.5 78.2 82.5 22.1 ET-C 8days 73.2 85.0 76.1 86.5 12.3 ET-N 10days 68.9 75.5 80.6 83.9 24.8 RF-C 10days 80.4 90.2 75.3 89.0 12.7 ET-N end 86.6 89.4 94.3 95.5 19.3 RF-C end 86.9 93.9 95.8 98.4 19.4 RF-N Table 2: Predictive performance for the High Contagion Phase (HCP, left) and the Moderate Contagion Phase (MCP, right) in terms of F2 and ROC-AUC scores considering all instances in the test set (columns F2 and ROC) and omitting the instances classified uncertain (columns F2-U and ROC-U). The percentages of instances that the system classifies uncertain are in the column % Unc. Column Model indicates the method selected for generating the model; ET stands for Extra Trees, RF for Random Forest, C for categorical and N for numerical. Figure 5 shows the confusion matrices for the test sets gener- ated using our predictive models. Above the line we have the HCP datasets and below the MCP datasets. Despite the training phase was optimised (through the use of the F2 metric) to avoid false negatives, for the HCP datasets there are several false negatives (bottom-left of the matrices). This can be explained by the scarcity of lab test and X-ray data in the HCP data that affects prediction. However, false negatives are significantly reduced with the mod- els that can classify a patient as uncertain. For example, at day 6, the system classifies as uncertain 4 patients who otherwise would be false negatives. Moreover, when there are less false negatives, such as at days 8 and 10, classifying some patients as uncertain helps to also avoid false positives and so to generate less false alarms. Remarkably, especially for the MCP datsets, we have very few false negatives even at the early days, which is quite important in our application context. On the other hand, especially for days 2 and 4, our system produces many false positives. This type of error is re- duced in the models with uncertain patients up to only 5 false alarms for the end dataset (e.g., at day 2 we avoid 16 false positives.) Figure 5: Confusion matrices for datasets HCP (above the line) and MCP (below the line) at different days with dead-alive predictions 7 Conclusions and Future Work for all patients (Complete) and omitting patients classified uncertain We have presented a system for predicting the prognosis of Covid- (No Unc). For each matrix of 4 numbers, on the main diagonal we 19 patients focusing on the death risk. We built and engineered some have the correct predictions (alive class on the top-left corner and datasets from lab test and X-ray data of more than 2000 patients in dead class on the bottom-right corner); on the anti-diagonal, we have an hospital in northern Italy that was severely hit by Covid-19. Our the incorrect predictions (false positives and on the top-right corner predictive system uses a collection of machine learning algorithms and the false negatives on the bottom-left corner). and a new method for setting, at training time, an uncertain threshold for prediction that helps to significantly reduce the prediction errors. itives and (few) false negatives. Overall, the experimental results are quite promising, and show For future work we plan to extend our datasets with more informa- that our system often obtains high ROC-AUC scores. The observed tion (both additional features and patients), to consider further meth- predictive performance is especially good in terms of false nega- ods for dealing with the observed concept drift and to address other tives (patients erroneously predicted survivor), that are very few. This prediction tasks such as the duration of the hospitalisation or the need gives a predictive test for patient survival with very good specificity of ICU beds and critical hospital resources. Moreover, we are analyz- in particular when the system can classify a patient as uncertain. ing the importance of the features used in our models, and we intend On the other hand, in terms of false positives, there is room for sig- to investigate additional learning techniques. nificant improvements. We are confident that the availability of more information, such as patient comorbidities or clinical treatments, will Acnowledgements. The work of the first author has been sup- help to improve performance, reducing the number of both false pos- ported by Fondazione Garda Valley. REFERENCES actions’, in Proceedings of the Sixth Italian Conference on Computa- tional Linguistics, (2019). [22] Luca Putelli, Alfonso Emilio Gerevini, Alberto Lavelli, and Ivan Se- rina, ‘Applying self-interaction attention for extracting drug-drug inter- [1] Aya Awad, Mohamed Bader–El–Den, and James McNicholas, ‘Patient actions’, in XVIIIth International Conference of the Italian Association length of stay and mortality prediction: A survey’, Health Services for Artificial Intelligence, Rende, Italy, November 19–22, 2019, Pro- Management Research, 30(2), 105–120, (2017). PMID: 28539083. ceedings, (11 2019). [2] James Bergstra and Yoshua Bengio, ‘Random search for hyper- [23] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, parameter optimization’, Journal of machine learning research, and Neil D Lawrence, Dataset shift in machine learning, The MIT 13(Feb), 281–305, (2012). Press, 2009. [3] Andrea Borghesi and Roberto Maroldi, ‘Covid-19 outbreak in italy: ex- [24] Anna S Rakitianskaia and Andries Petrus Engelbrecht, ‘Training feed- perimental chest x-ray scoring system for quantifying and monitoring forward neural networks with dynamic particle swarm optimisation’, disease progression’, La radiologia medica, (05 2020). Swarm Intelligence, 6(3), 233–270, (2012). [4] Leo Breiman, ‘Random forests’, Machine learning, 45(1), 5–32, [25] Lior Rokach and Oded Maimon, Data Mining with Decision Trees: (2001). Theory and Applications, World Scientific Publishing Co., Inc., River [5] NV CHAWLA, ‘Evaluating probability estimates from decision trees’, Edge, NJ, USA, 2008. in Proc. AAAI Workshop on Evaluation Methods for Machine Learning, [26] Santosh Srivastava, Maya R Gupta, and Béla A Frigyik, ‘Bayesian Boston, MA, 2006, pp. 18–23, (2006). quadratic discriminant analysis’, Journal of Machine Learning Re- [6] João Gama, Indrundefined Žliobaitundefined, Albert Bifet, Mykola search, 8(Jun), 1277–1305, (2007). Pechenizkiy, and Abdelhamid Bouchachia, ‘A survey on concept drift [27] Johan AK Suykens and Joos Vandewalle, ‘Least squares support vector adaptation’, ACM Comput. Surv., 46(4), (March 2014). machine classifiers’, Neural processing letters, 9(3), 293–300, (1999). [7] Alfonso Emilio Gerevini, Alberto Lavelli, Alessandro Maffi, Roberto [28] Mihaela van der Schaar and Ahmed Alaa, ‘How artificial intelligence Maroldi, Anne-Lyse Minard, Ivan Serina, and Guido Squassina, ‘Au- and machine learning can help healthcare systems respond to covid-19’, tomatic classification of radiological reports for clinical care’, in Pro- https://www.vanderschaar-lab.com/covid-19/, (2020). ceedings of the 16th Conference on Artificial Intelligence in Medicine, [29] Li Yan, Hai-Tao Zhang, Yang Xiao, Maolin Wang, et al., ‘Prediction of AIME 2017, Vienna, Austria, June 21-24, 2017, volume 10259 of Lec- criticality in patients with severe covid-19 infection using three clinical ture Notes in Computer Science, pp. 149–159. Springer, (2017). features: a machine learning-based prognostic model with clinical data [8] Pierre Geurts, Damien Ernst, and Louis Wehenkel, ‘Extremely random- in wuhan’, medArxiv preprint, (2020). ized trees’, Machine learning, 63(1), 3–42, (2006). [30] Jinsung Yoon, Ahmed Alaa, Scott Hu, and Mihaela Schaar, ‘Forecas- [9] Gary L Grunkemeier and Ruyun Jin. Receiver operating characteristic ticu: a prognostic decision support system for timely prediction of in- curve analysis of clinical risk models, 2001. tensive care unit admission’, in International Conference on Machine [10] Karimollah Hajian-Tilaki, ‘Receiver operating characteristic (roc) Learning, pp. 1680–1689, (2016). curve analysis for medical diagnostic test evaluation’, Caspian journal [31] Harry Zhang, ‘The optimality of naive bayes’, in Proceedings of the of internal medicine, 4(2), 627, (2013). Seventeenth International Florida Artificial Intelligence Research So- [11] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, ciety Conference, Miami Beach, Florida, USA, eds., Valerie Barr and and Aram Galstyan, ‘Multitask learning and benchmarking with clini- Zdravko Markov, pp. 562–567. AAAI Press, (2004). cal time series data’, Scientific data, 6(1), 1–18, (2019). [32] Zhi-Hua Zhou, Ensemble Methods: Foundations and Algorithms, Chap- [12] Sharique Hasan and Rema Padman, ‘Analyzing the effect of data qual- man & Hall/CRC, 1st edn., 2012. ity on the accuracy of clinical decision support systems: a computer simulation approach’, in AMIA annual symposium proceedings, volume 2006, p. 324. American Medical Informatics Association, (2006). [13] Simon Haykin, Neural networks: a comprehensive foundation, Prentice Hall PTR, 1994. [14] Stephanie Hyland, Martin Faltys, Matthias Hüser, Xinrui Lyu, Thomas Gumbsch, Cristóbal Esteban, Christian Bock, Max Horn, Michael Moor, Bastian Rieck, Marc Zimmermann, Dean Bodenham, Karsten Borgwardt, Gunnar Rätsch, and Tobias Merz, ‘Early prediction of cir- culatory failure in the intensive care unit using machine learning’, Na- ture Medicine, 26, 1–10, (03 2020). [15] Xiangao Jiang, Megan Coffee, Anasse Bari, Junzhang Wang, Xinyue Jiang, Jianping Huang, Jichan Shi, Jianyi Dai, Jing Cai, Tianxiao Zhang, et al., ‘Towards an artificial intelligence framework for data- driven prediction of coronavirus clinical severity’, CMC: Computers, Materials & Continua, 63, 537–51, (2020). [16] Alistair EW Johnson, Mohammad M Ghassemi, Shamim Nemati, Katherine E Niehaus, David A Clifton, and Gari D Clifford, ‘Machine learning and decision support in critical care’, Proceedings of the IEEE, 104(2), 444–466, (2016). [17] Bartosz Krawczyk, ‘Learning from imbalanced data: open challenges and future directions’, Progress in Artificial Intelligence, 5(4), 221– 232, (2016). [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay, ‘Scikit-learn: Machine learning in Python’, Journal of Machine Learning Research, 12, 2825–2830, (2011). [19] Tom J Pollard and Leo Anthony Celi, ‘Enabling machine learning in critical care’, ICU management & practice, 17(3), 198, (2017). [20] Luca Putelli, Alfonso Gerevini, Alberto Lavelli, Matteo Olivato, and Ivan Serina, ‘Deep learning for classification of radiology reports with a hierarchical schema’, in Proceedings of 24th International Conference on Knowledge-Based and Intelligent Infor- mation & Engineering Systems, (2020). [21] Luca Putelli, Alfonso Gerevini, Alberto Lavelli, and Ivan Serina, ‘The impact of self-interaction attention on the extraction of drug-drug inter-