=Paper=
{{Paper
|id=Vol-2845/Paper_34.pdf
|storemode=property
|title=Machine Learning Algorithms for Predicting the Results of COVID-19 Coronavirus Infection
|pdfUrl=https://ceur-ws.org/Vol-2845/Paper_34.pdf
|volume=Vol-2845
|authors=Yuri Kravchenko,Nataliia Dakhno,Olga Leshchenko,Anastasiia Tolstokorova
|dblpUrl=https://dblp.org/rec/conf/iti2/KravchenkoDLT20
}}
==Machine Learning Algorithms for Predicting the Results of COVID-19 Coronavirus Infection==
Machine Learning Algorithms for Predicting the Results of COVID-19 Coronavirus Infection Yuri Kravchenko, Nataliia Dakhno, Olga Leshchenko , Anastasiia Tolstokorova Taras Shevchenko National University of Kyiv, Volodymyrs’ka str. 64/13, Kyiv, 01601, Ukraine Abstract The paper analyzes data collected from around the world on patients with COVID-19. The patients studied were of different ages, with different chronic diseases and symptoms, men and women. A binary classifier has been developed that considers data on a person's health, symptoms, patient's age, and other properties and determines the patient's disease outcome by assigning it to one of two categories: fatal or not. The work's practical value is to help hospitals and health facilities decide who needs care in the first place when the system is overcrowded and to eliminate delays in providing the necessary care. Keywords 1 supervised learning, classification problem, model fitting, feature selection, feature engineering, data normalization, model validation, confusion matrix, logistic regression, Naive Bayes, Decision tree, random forest. 1. Introduction In March 2020, the World Health Organization officially declared the Covid-19 coronavirus a global pandemic. COVID-19 coronavirus disease is an infectious disease caused by the recently discovered coronavirus SARS-CoV-2. The danger of this disease lies not so much in its lethality but the rate of its spread after a long incubation period. The infected person does not yet experience any symptoms and continues to contact people, spreading the infection. The best way to prevent and slow down infection transmission is to be well informed about the COVID-19 virus, the disease it causes, and how it spreads. By visualizing the development of the disease in other countries where the outbreak has passed, it is possible to build a truly effective behavior strategy that will save lives while minimally harming the economy, if possible, in these circumstances. The coronavirus outbreak originated in the Chinese city of Wuhan but has now spread to the rest of the world. Cases of infection continue to grow exponentially. As a result, workers are transferred to telecommuting, pupils and students begin to study at home, conferences are canceled, store shelves are devastated, and the global economy is under serious threat. It is undeniable that coronavirus infection has irreparably affected all spheres of human life - from education to global economic change. It is safe to say that this is one of the most severe health crises in decades, if not centuries. However, there are currently almost no systematic reviews in the available literature that describe the accumulated data on COVID-19 and suggest methods for their analysis. Data analysis will help to understand the basic patterns of data behavior better. Furthermore, a more thorough analysis based on data and sound forecasts can be useful for decision-making and policy-making [1,2,3]. To analyze and build the optimal strategy, you need to use a decision support system. (DSS) [4,5,6,7]. DSS is an automated expert-assistant in selecting decisions by the operator at the stages of analysis, forming possible scenarios, selecting the best of them, and evaluating the results of the decision [8,9,10]. There are currently significant advances in the development and IT&I-2020 Information Technology and Interactions, December 02–03, 2020, KNU Taras Shevchenko, Kyiv, Ukraine EMAIL: kr34@ukr.net (Y. Kravchenko); nataly.dakhno@ukr.net (N. Dakhno); lesolga@ukr.net (O. Leshchenko); tlstkr@gmail.com (A. Tolstokorova); ORCID: 0000-0002-0281-4396 (Y. Kravchenko); 0000-0003-3892-4543 (N. Dakhno); 0000-0002-3997-2785 (O. Leshchenko) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 371 widespread practical application of mathematical models and methods for different classes of problems [11,12,13,14]. The rapid development of information technology, in particular, advances in data collection, storage, and processing methods, have allowed many organizations to collect vast amounts of data that need to be analyzed [15,16,17]. Also, the development of technology entails increasing the requirements for quality and accuracy of decision-making. It makes it necessary to develop further and improve methods in DSS. In this work, a DSS model was developed to help identify patterns between the characteristics of the patient (sex, age, types of symptoms, and chronic diseases) who contracted COVID-19 and mortality. This study offers a model of artificial intelligence that can provide hospitals and medical facilities with the information they need to address congestion. It will also allow developing a patient sorting strategy to address hospitalization priorities and eliminate delays in providing the necessary care. 2. Main part To get started, we need to form a training set that meets the criteria of the goal. The training set is the data on which the algorithm is trained. The training looks different, depending on which algorithm is used. After training the model it is necessary to determine its effectiveness according to certain metrics. 2.1. Input data Before the study, relevant data were found that meet the criteria for the work [18]. They contain a set of characteristics of patients with coronavirus. This data is sufficient to be able to divide them into training and test kits. They also have information about the disease's outcome, which will be a vector of labels of the binary classifier. This dataset collects data from more than 920,000 patients from around the world of all ages, with various chronic diseases and symptoms, including men and women. The dataset's information has a lot of gaps and redundant data, so they were cleared before using them [19]. At the stage of data cleansing, all redundant and useless signs of patients are removed. Only gender, age, information on symptoms, chronic diseases, and treatment outcomes are used for the classification. Also, all data that does not contain information about the patient's illness's outcome is deleted from the dataset. In order to use binary classification, all the necessary features were coded. Each symptom was highlighted and marked with the number 1 or 0, if it is present or absent in each patient, respectively. The result was a dataset of the following form (Figure1). Figure 1: The dataset after encoding In the input data, age differs in its scale. Different scales of data can negatively affect the gradient descent method's convergence because the cost function will be very stretched [20]. Therefore, the minimax method was applied to the data. To do this, use the formula: 372 x xmin xn ; xmax xmin where xn – normalized value of the feature, x – the current value of the feature, x min – the minimum value for this characteristic, xmax – the maximum value for this characteristic. With the help of trait normalization by the minimax method, the values were set in the range [0,1], because other traits lie in this range. After normalization, the data were obtained as follows (Figure 2). Figure 2: The dataset after normalization 2.2. Data classification After obtaining the required data, several algorithms were selected that are most suitable for achieving this goal. In this study, five classification algorithms were selected that had previously proven to be the best in this type of problem [21]. These are the following algorithms: Logistic regression, K -nearest neighbors algorithm, Decision trees, Reference vector method, Naive Bayesian classifier. For data analysis, we will use the Python programming language along with packages for data visualization and analysis. To accomplish this task, the Anaconda distribution kit was used as a software environment, which allows you to immediately install Python and the necessary libraries. Jupyter Notebook was chosen as the environment for the task execution. The main packages used during the research: Pandas, Numpy, Matplotlib, Seaborn, Scipy, Scikit-Learn. The results of solving the binary classification problem were presented in the Confusion Matrix, which consists of four cells: TP - True Positive, objects that have been classified as positive and are actually positive (belonging to this class); FP - False Positive, which were classified as positive but in fact negative; FN - False Negative, which were classified as negative but actually positive; TN - True Negative, which were classified as negative and actually negative (do not belong to this class). Logistic regression: Logistic regression is a well-known statistical method for determining several factors' influence on a logical pair of results. The name "logistic regression" reflects the fact that the data curve is compressed by applying a logistic transformation to reduce the effect of extreme values. In order to solve the regression problem instead of predicting a binary variable, a continuous variable with values on the interval [0,1] is assumed for any values of independent variables. This is achieved by applying the regression equation: 1 P , 1 e y where P – the probability that an event of interest will occur, y – standard regression equation. The normalized error matrix for the implemented logistic regression is presented in Figure 3. Algorithm of K-nearest neighbors: 373 The method of K--nearest neighbors, or KNN -classification, defines the dividing boundaries locally. In the first variation of the 1NN , each attribute belongs to a particular class depending on its nearest neighbor's information. In the KNN variant, each attribute belongs to the nearest neighbors' preferred class, where the k is the parameter of the method. Figure 3: Normalized logistic regression confusion matrix Consider some of the pros and cons of using the KNN algorithm. One of the advantages of KNN is that it is a straightforward algorithm. This makes the KNN algorithm much faster than other algorithms. One of the disadvantages of KNN - the algorithm does not work well with extensive data. With a large number of sizes of the algorithm, it becomes difficult to calculate the distance in each dimension. The KNN algorithm has a high prediction cost for large data sets. The normalized error matrix for the implemented K–nearest neighbor algorithm is presented in Figure 4. Decision trees: The decision tree is one of the most common and widely used machine learning algorithms with a teacher who can perform regression and classification tasks. For each attribute in the data set, the decision tree algorithm forms a node where the essential attribute is located in the root node. To evaluate, we start at the root node and work down the tree, following the appropriate node that matches our condition or "solution." This process continues until a sheet node is reached containing the forecast or result of the decision tree. The normalized error matrix for the implemented decision tree method is presented in Figure 5. Method of reference vectors: If the training set contains two classes of data that allow linear division, then there are many linear classifiers with which you can divide this data. The reference vectors method looks for a dividing surface (hyperplane) as far as possible from any data points. The dividing hyperplane is given by the offset parameter b (the point of intersection with the x -axis) and the normal vector w to the dividing hyperplane. Since the dividing hyperplane is perpendicular to the vector of the normal w , all points x on the hyperplane satisfy the equation: T w x b 0. Now suppose we have a learning set D { x i , yi } in which each element is a pair consisting of a point x and the corresponding class label yi . In the method of reference vectors, two classes are 374 always called +1 and -1 (not 1 and 0). Therefore, the linear classifier is described by the following formula: T f ( x) sign( w x b) Figure 4: Normalized K-neighbors classifier confusion matrix Figure 5: Normalized decision tree method confusion matrix A value of -1 indicates one class and +1 indicates another. Next, we want to choose such w и b 1 which maximize the distance to each class. We can calculate that this distance is equal . The w 375 1 2 problem of finding the maximum is equivalent to the problem of finding the minimum w . w Let's write all this in the form of an optimization problem: arg min w 2 , w ,b y ( wT x i b) 1, i 1, m. i The normalized error matrix for the implemented method of reference vectors is presented in Figure 6. Figure 6: The SVC method confusion matrix Naive Bayesian Classifier: Naive Bayesian classifiers belong to the family of simple probability classifiers. They are based on Bayes' theorem with naive assumptions about the independence between features. Bayes' theorem allows us to calculate the conditional probability: P (C ) P ( x | C ) P (C | x ) . P( x ) If we classify an object that is a vector x ( x1 , x2 ,..., xn ) with n properties, then the classifier will find the probability of k possible classes for this object. If we take into account the "naive" assumptions about the conditional independence of the features, the numerator of the Bayesian formula will take the form: n P (Ck ) P( xi | Ck ) P (Ck | x1 , x2 ,..., xn ) i 1 . P(Ck )P( x | Ck ) k The corresponding classifier is a function that assigns a class label C k for some k in the following way: n yˆ arg max P(Ck | x1 , x2 ,..., xn ) arg max P(Ck ) P( xi | Ck ) . k k i 1 376 The normalized error matrix for the implemented naive Bayesian classifier is presented in Figure 7. To assess the quality of classification models that will solve this problem, the following metrics were chosen: accuracy, precision, recall, f-measure, logarithmic loss (logloss), area under the ROC curve. After evaluating the models, the classification report was analyzed by each of the metrics (Figure 3–7). The measure of accuracy is intuitive and obvious: TP TN accuracy . TP TN FP FN Figure 7: The Naive Bayes classifier confusion matrix Since the accuracy metric does not work well on unbalanced data, it was decided to allocate a separate sample for correct evaluation of the algorithms. Table 1 shows how the accuracy result differs in different models. After analyzing the work of different algorithms, we can conclude that the classifier of decision trees was the most accurate in this metric. Precision can be interpreted as the proportion of objects called positive by the classifier and at the same time actually positive [22]: TP precision . TP FP Table 2 shows how the precision of the results differs in different models. According to the precision score metric, the best decision was again made by the classifier of decision trees, having the highest possible accuracy. Recall shows what proportion of positive class objects out of all positive class objects the algorithm found [22]: TP TN recall . TP FN Table 3 shows how the recall results differ in different models. According to this metric, the most complete was the classifier of decision trees, having a completeness of 0.8. There are several different ways to combine precision and recall into an aggregate quality criterion. F -measure - average harmonic precision and recall [23]: 2 precision recall F precision recall Table 4 shows how the F -measure of the results differs in different models. According to this metric, the most effective classifier was the classifier by the decision tree method. The logarithmic 377 loss metric was also chosen to study the efficiency of classification models [24]. The logloss metric assigns the weight of each predicted probability. The farther the probability from the actual value, the greater the weight. The goal is to minimize the total amount of all error weights. 1 n logloss yi log( yˆ i ) (1 yi )log(1 yˆ i ) . n i 1 Table 1 Classifier accuracy score comparison Classifier model Accuracy score Logistic regression 0.8518518518518519 K-nearest neighbors 0.8641975308641975 Decision tree 0.9012345679012346 Method of reference vectors 0.7777777777777778 Naive Bayesian 0.8148148148148148 Table 2 Classifier precision score comparison Classifier model Precision score Logistic regression 0.9565217391304348 K-nearest neighbors 0.9310344827586207 Decision tree 1.00 Method of reference vectors 0.9565217391304348 Naive Bayesian 0.7575757575757576 The results of calculations on this metric for each of the studied algorithms are given in Table 5. Table 3 Classifier recall comparison Classifier model Recall Logistic regression 0.55 K-nearest neighbors 0.675 Decision tree 0.8 Method of reference vectors 0.55 Naive Bayesian 0.625 One way to estimate the model as a whole without being tied to a specific threshold is the area under the ROC error curve [25]. This curve is a line from (0,0) to (1,1) in the coordinates True Positive Rate (TPR) and False Positive Rate (FPR): 378 TP FP TPR ; FPR . TP FN FP TN Table 4 Classifier F -measure comparison Classifier model F-measure Logistic regression 0.78 K-nearest neighbors 0.58 Decision tree 0.89 Method of reference vectors 0.47 Naive Bayesian 0.68 Table 5 Classifier logarithmic loss comparison Classifier model logloss Logistic regression 0.22168696855844788 K-nearest neighbors 0.3694783949797303 Decision tree 0.11823286741946416 Method of reference vectors 0.4285948286894619 Naive Bayesian 0.33992223100658514 The area under the curve in this case shows the quality of the algorithm. The closer the area is to 1, the more accurately the model works. Figure 8 shows a graph of ROC-curves for each of the selected models, which can clearly see the efficiency of the algorithms. Figure 8: Graph of ROC-curves for different classifiers To compare the created classifiers by the area under the ROC curve, it is necessary to calculate it for each of the algorithms. The results of the calculations are given in Table 6. According to the classifiers' evaluation, the decision tree algorithm has the largest share of correctly classified objects with a balanced data set. Besides, this method has the lowest logarithmic loss and the largest area under the ROC curve, which indicates the harmony of sensitivity and specificity. 379 Table 6 Classifier ROC comparison Classifier model ROC Logistic regression 0.824782324771441 K-nearest neighbors 0.7120646495428821 Decision tree 0.9 Method of reference vectors 0.662064649542882 Naive Bayesian 0.8107585981715281 3. Conclusions After analyzing each of the algorithms and comparing the results, we can say that in this study, among all the proposed methods of machine learning to solve problems of binary classification, the algorithm of decision-making trees best coped. Evaluating it according to the selected metrics and comparing the results with other algorithms, we can talk about its most significant effectiveness in DSS. The developed classifier and its application in DSS can help hospitals and health facilities decide who needs attention in the first place when the system is overcrowded, as well as eliminate delays in providing the necessary care. This study could be scaled up to other diseases to help the health care system respond more effectively to an outbreak or pandemic. 4. References [1] O. Pysarchuk, A. Gizun, A. Dudnik, T. V. Griga, Domkiv, S. Gnatyuk. "Bifurcation prediction method for the emergence and development dynamics of information conflicts in cybernetic space." CEUR Workshop Proceedings, 2020, 2654, pp. 692–709. http://ceur-ws.org/Vol- 2654/paper54.pdf. [2] O. Barabash, H. Shevchenko, N. Dakhno, Y. Kravchenko and L. Olga. "Effectiveness of Targeting Informational Technology Application." 2020 IEEE 2nd International Conference on System Analysis & Intelligent Computing (SAIC). Conference Proceedings. 05- 09 October, 2020, Kyiv, Ukraine. Igor Sikorsky Kyiv Polytechnic Institute. pp. 193 – 196., doi: 10.1109/SAIC51296.2020.9239154. [3] S. Toliupa, I. Tereikovskiy, I. Dychka, L. Tereikovska and A. Trush. "The Method of Using Production Rules in Neural Network Recognition of Emotions by Facial Geometry." 2019 3rd International Conference on Advanced Information and Communications Technologies (AICT), (2019): 323–327. doi: 10.1109/AIACT.2019.8847847. [4] O. Barabash, N. Dakhno, H. Shevchenko and V. Sobchuk. “Unmanned Aerial Vehicles Flight Trajectory Optimisation on the Basis of Variational Enequality Algorithm and Projection Method.” 2019 IEEE 5th International Conference Actual Problems of Unmanned Aerial Vehicles Developments (APUAVD) (2019): 136–139. doi: 10.1109/APUAVD47061.2019.8943869. [5] K. Kolesnikova, O. Mezentseva and O. Savielieva. "Modeling of Decision Making Strategies In Management of Steelmaking Processes." 2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT), Kyiv, Ukraine, 2019, pp. 455 – 460, doi: 10.1109/ATIT49449.2019.9030524. [6] O. Barabash, P. Open’ko, O. Kopiika, H. Shevchenko and N. Dakhno. "Target Programming with Multicriterial Restrictions Application to the Defense Budget Optimization. " Advances in Military Technology. 14.2 (2019): 213–229. ISSN 1802-2308, eISSN 2533-4123. doi: 10.3849/aimt.01291, http://aimt.unob.cz/articles/19_02/1291.pdf. [7] N. Dakhno, O. Barabash, H. Shevchenko, O. Leshchenko and A. Musienko. "Modified Gradient Method for K-positive Operator Models for Unmanned Aerial Vehicle Control." 2020 IEEE 6th 380 International Conference on Methods and Systems of Navigation and Motion Control (MSNMC), KYIV, Ukraine, 2020, pp. 81-84, doi: 10.1109/MSNMC50359.2020.9255516. [8] V. Tuyrin, O. Barabash, P. Openko, I. Sachuk and A. Dudush. “Informational support system for technical state control of military equipment.” 2017 IEEE 4th International Conference Actual Problems of Unmanned Aerial Vehicles Developments (APUAVD) (2017): 230–232, doi: 10.1109/APUAVD.2017.8308817. [9] D. Obidin, V. Ardelyan, N. Lukova-Chuiko and A. Musienko. "Estimation of functional stability of special purpose networks located on vehicles." 2017 IEEE 4th International Conference Actual Problems of Unmanned Aerial Vehicles Developments (APUAVD) (2017): 167–170, doi: 10.1109/APUAVD.2017.8308801. [10] O. Barabash, N. Lukova-Chuiko, V. Sobchuk and A. Musienko. "Application of Petri Networks for Support of Functional Stability of Information Systems." 2018 IEEE First International Conference on System Analysis & Intelligent Computing (SAIC) (2018): 1–4, doi: 10.1109/SAIC.2018.8516747. [11] M. Prats’ovytyi, O. Svynchuk. "Spread of Values of a Cantor-Type Fractal Continuous Nonmonotone Function." J Math Sci 240 (2019): 342–357. https://doi.org/10.1007/s10958-019- 04356-0. [12] A. Rokochinskiy, P. Volk, L. Kuzmych, V. Turcheniuk, L. Volk and A. Dudnik. "Mathematical Model of Meteorological Software for Systematic Flood Control in the Carpathian Region." 2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT), (2019): 143– 148. doi: 10.1109/ATIT49449.2019.9030455. [13] V. Mukhin, V. Zavgorodnii, O. Barabash, R. Mykolaichuk, Y. Kornaga, A. Zavgorodnya, V. Statkevych. "Method of restoring parameters of information objects in a unified information space based on computer networks." International Journal of Computer Network and Information Security, 12(2) (2020): 11–21. DOI:10.5815/ijcnis.2020.02.02. [14] H. Hnatiienko, V. Kudin, A. Onyshchenko, V. Snytyuk and A. Kruhlov, "Greenhouse Gas Emission Determination Based on the Pseudo-Base Matrix Method for Environmental Pollution Quotas Between Countries Allocation Problem," 2020 IEEE 2nd International Conference on System Analysis & Intelligent Computing (SAIC), Kyiv, Ukraine, 2020, pp. 1-8, doi: 10.1109/SAIC51296.2020.9239125. [15] D. Lukianov, M. Mazeika, V. Gogunskii, K. Kolesnikova. "SWOT analysis as an effective way to obtain primary data for mathematical modeling in project risk management." CEUR Workshop Proceedings, 2711, 2020: 79 – 92. http://ceur-ws.org/Vol-2711/paper7.pdf. [16] Hu Zhenbing, V. Mukhin, Y. Kornaga, O. Herasymenko, Y. Bazaka. "The scheduler for the gridsystem based on the parameters monitoring of the computer components, Eastern-European Journal of Enterprise Technologies." 1(2017): 31–39. doi: https://doi.org/10.15587/1729- 4061.2017.91271. [17] Y. Kravchenko, O. Leshchenko, N. Dakhno, O. Trush, O. Makhovych. "Evaluating the Effectiveness of Cloud Services." 2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT) (2019): 120–124. doi: 10.1109/ATIT49449.2019.9030430. [18] B. Xu, B. Gutierrez, S. Mekaru and et al., "Epidemiological data from the COVID-19 outbreak, real-time case information." Scientific Data. 2020 Mar;7(1):106. doi: 10.1038/s41597-020-0448-0. [19] C. Sammut, G. I. Webb. "Encyclopedia of Machine Learning and Data Mining." Springer Science+Business Media New York, 2017. doi: https://doi.org/10.1007/978-1-4899-7687-1. [20] D. T. Larose, C. D. Larose. "Discovering Knowledge in Data: An Introduction to Data Mining." John Wiley & Sons, 2014. [21] T.M. Mitchell. "Machine Learning." McGraw Hill, 1997. [22] S. Geman, E. Bienenstock and R. Doursat. "Neural Networks and the Bias/Variance Dilemma." Neural Computation, 4/1 (1992): 1–58. doi: 10.1162/neco.1992.4.1.1. [23] Y. Sasaki. "The truth of the F-measure." School of Computer Science, University of Manchester, 2007. [24] A. L. Samuel. "Some Studies in Machine Learning Using the Game of Checkers." IBM Journal of Research and Development, 3/3(1959): 210–229. doi: 10.1147/rd.33.0210. [25] T. Fawcett. "Introduction to ROC analysis." Pattern Recognition Letters, 27/8 (2006): 861–874. doi: 10.1016/j.patrec.2005.10.010. 381