=Paper=
{{Paper
|id=Vol-2786/Paper50
|storemode=property
|title=CURE: An Effective COVID-19 Remedies based on Machine Learning Prediction Models
|pdfUrl=https://ceur-ws.org/Vol-2786/Paper50.pdf
|volume=Vol-2786
|authors=Poonam Phogat,Rajat Chaudhary
|dblpUrl=https://dblp.org/rec/conf/isic2/PhogatC21a
}}
==CURE: An Effective COVID-19 Remedies based on Machine Learning Prediction Models==
415 CURE: An Effective COVID-19 Remedies based on Machine Learning Prediction Models Poonam Phogata , Rajat Chaudharyb a Computer Science & Engineering, SGT University, Gurugram, Haryana (India) b Computer Science & Engineering, Bharat Institute of Engineering & Technology, Hyderabad, Telangana (India) Abstract Coronavirus disease (COVID-19) is a severe pandemic infectious virus that enters into healthy cells of a living body. COVID-19 virus makes copies in the organs of the host body by multiplying itself which ultimately leads to the death of some healthy cells and therefore weakens the immune system. In a mild stage, it mainly affects the respiratory tract and leads to pneumonia, organ failure, and death reaching the last stage. This paper focused on the early detection of the COVID-19 patient based on the positive symptoms of the disease. In this paper, the COVID-19 Remedies (CURE) scheme is proposed based on machine learning prediction models for the treatment of COVID patients. For experimental results, the performance analysis of the CURE scheme is evaluated on the Python platform which is tested using the Kaggle dataset from Johns Hopkins University. Keywords COVID-19, Machine Learning, Prediction Model 1. Introduction are of India. Figure 1(b) shows the statistics of the active cases, where there are 65,04,303 active cases occurred The virus that induces COVID-19 is a severe acute respi- globally. ratory syndrome coronavirus-2 (COVID-2) that was first Figure 1(c) presents the total death cases, and finally, diagnosed in late December 2019 during an investigation Figure 1(d) shows the total cured cases [4]. This is a into an outbreak in Wuhan, China. As the cases were communication spreading virus that spreads through increasing rapidly throughout the world, the WHO de- respiratory droplets present in the air. These aerosols clared the disease pandemic on March 11, 2020. Currently, come to an open environment when an infected person the transmission of COVID-19 becomes uncontrollable sneezes and coughs and enter in other persons through because the number of cases has reached the threshold the mouth and nostrils and reach to lungs. There is no limit [1]. The virus enters into healthy cells of a living precise treatment to cure COVID-19. Some steps are be- body and makes copies in the organs of the host body by ing taken to eliminate the virus using different medicines multiplying itself which ultimately leads to the death of like Hydroxychloroquine which is an antimalarial antibi- some healthy cells and therefore weakens the immune otic. Currently, it is used to treat coronavirus patients, it system. In a mild stage, it mainly affects the respiratory helps in inhibition of infection by increasing the endoso- tract and leads to pneumonia, organ failure, and death mal pH which provides enough strength to the immune reaching the last stage [2]. The disease is prominent system to fight against the viral disease [5]. in old age people with a weak immune system and al- Some preventions are necessary for the treatment of ready having other primitive diseases like diabetes, high this pandemic. From the very beginning of COVID-19, blood pressure, cardiovascular and respiratory diseases the government of almost all the countries has taken [3]. Figure 1 shows the global statistics till July 30, strict actions such as complete lockdown, social distanc- 2020, on the total confirmed cases, active cases, total ing, use of sanitizer, and masks to reduce all the caus- deaths, and total cured cases on the COVID-19 virus. Fig- ing elements [6]. By exploring various studies, Machine ure 1(a) presents the total number of coronavirus cases Learning seems to be the best prediction model for fore- across different countries which shows that the virus is casting the increasing COVID-19 infected cases. Regres- spreading rapidly with the highest cases in the USA fol- sion and classification approach of ML work according lowed by India. The total confirmed positive cases across to the availability of data to diagnose this problem. the world are 2,18,69,976 out of which 26,47,663 cases 1.1. Contributions ISIC’21: International Semantic Intelligence Conference, February 25-27, 2021, Delhi, India The contributions of the paper are summarized below. Envelope-Open poonamphogat07@gmail.com (P. Phogat); rajat@biet.ac.in (R. Chaudhary) • Diagnose the symptoms of COVID-19 patients Orcid 0000-0002-6554-918X (R. Chaudhary) based on the classification of the diseases. © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) • To recover the COVID-19 patients, CURE scheme 416 25.08% 38% 2.55% 1.59% 1.7% 1.76% 2.22% 1.61% 2.7% 10.43% 12.3% (a) (b) (c) (d) Active Cases Total Cases Total Deaths Total Cured India 6,76,900 26,47,663 50,921 19,19,842 India's Share (%) 10.4% 12.1% 6.6% 13.3% World 65,04,303 2,18,69,976 7,73,741 1,45,91,932 Figure 1: Data statistics of total, active, death, and cured cases on COVID-19. is proposed scheme based on machine learning of the prediction performance evaluation, and finally, prediction model to forecast the best suitable Section VI concludes the paper. treatment for COVID disease. • For simulation, the proposed scheme is tested 2. Literature Review using the Kaggle dataset. The researchers introduce some methods of Machine • Finally, the performance evaluation is compared Learning for classification. The easiest classification is with the five classifiers and predicts the most effi- the Linear Regression method which is used to reduce cient outcome using the Python platform. the sum of squared differences between real and pre- dicted data. The drawbacks of this model are its non- 1.2. Paper Organization effectiveness with non-alignment data and sensitiveness to deviation [7]. Through the Logistic Regression Model, The rest of the paper is structured as follows: Section II it is shown that the contingency of conclusion is Logistic discusses the literature review of the existing schemes. function-based. The positiveness of this model is that it Section III presents the system model followed by the pro- is free of complications. But it fails to assume linearity. posed CURE scheme in Section IV. Section V comprises By Naive Bayes Model, it is proposed that it confined 417 training data to calculate inevitable parameters and effi- COVID-19 caciously deals with real-world data. One another model Dataset Training Dataset Trained Model K-Nearest Neighbour shows that it works efficiently with Data Pre- Prediction models Performance Metrics (Output) processing Comparison of modest data and relevant with multi-class problems [8], 1. Linear Regression 2. SVM 1. H-Measure 2. Gini Index performance analysis of 3. k-NN 3. AUC [9]. Feature Selection (Diagnose COVID 4. Naive Bayes 4. AUCH prediction models. 5. Random Forest 5. KS Pinter et al. [10] proposed Machine Learning approaches Symptoms) 6. MER 7. MWL multi-layered perceptron-imperialist competitive algo- 8. Spec.Sens95 9. Sens.Spec95 rithm (MLP-ICA) and adaptive network-based fuzzy inter- 10. ER ference system (ANFIS) for prediction of the COVID-19 confirmed positive and death cases. This model is used Figure 2: Workflow of the proposed CURE Scheme for the to maintain accuracy for the next 9 days which gives the treatment of COVID patient. reassuring results [11]. The government and the public have to appreciate the researchers and help in lowering the data by maintaining social distancing and following symptoms of COVID-19 patients. other precautions [12]. Hamzeh et al. [13] works on Problem is heightened with the unbalancing of data. In Susceptible-Exposed-Infectious-Recovered (SEIR) model medical data the class imbalance problem is frequent which predicts that it performs well on moderate data. which occurs with the dominancy of more cases of some The outbreak of this infectious disease may cause varia- classes over others. To handle the imbalanced dataset, tions in the data prediction. several elucidations are appropriate at both algorithmic Jia et al. [14] defines four stages for COVID-19 cases. and data level. In this paper, the performances of 5 classi- In the first stage, there comes travel history of a person fiers and regressions are compared on imbalanced dataset having COVID-19 symptoms which leads to lockdown. which is obtained while studying on the prediction of When the infected person comes in contact with other COVID-19. On the bases of attainment of these regres- persons, the virus reached in the second stage. To pre- sion and classifiers, impact of SMOTE (Synthetic Minor- vent the increasing data social distancing is applied. Next, ity Oversampling Technique) - an approach which deals the third stage in which there is neither travel history with imbalanced dataset, is thoroughly evaluated. nor contact with an infected person. So the chances of With the comfort of the algorithms used in this method, viral spreading through the respiratory droplets become k samples are finding out which are in proximity to high. Hence, the use of masks and sanitizers is neces- the minority samples in minority classes and standard sary. The next and last stage is an uncontrollable stage Euclidean distance method is used to attain this dis- where the cases reached the threshold limit. Tuli et al. tance. With the number of cases in minority and ma- [15] improved COVID-19 prediction by using a model of jority classes, imbalanced dataset is taken. Based on the Machine Learning. In this model data-driven approach independent variable, the original dataset is partitioned is used to help the government and the public. After cov- into two sets – training set (80%) and test sets (20%) us- ering data with ML and AI, researchers can forecast the ing stratified random sampling. By applying SMOTE time scale and regions where the possibility of spread- technique, training set is over samples to find out the dis- ing of this disease is maximum. This is predicted that tribution of class suited best to the dataset and 8 training using different models of ML, COVID-19 cases can be sets obtain among which 1 is original set other than 7 controlled or eliminated from all the countries of the over sampled set having different rates. world which are facing this critical situation. 3. System Model 4. Proposed CURE Scheme The proposed CURE scheme uses wide range of methods Figure 2 presents the workflow of the proposed CURE and tools are used for prediction. With the combination scheme for the treatment of COVID patient. Initially, the of different models- SVM (Support Vector Machine), LR input is the dataset that is taken from Johns Hopkins Uni- (Linear Regression), k-NN (k- Nearest Neighbors), Clas- versity dataset. Then the symptoms of positive cases are sification Naïve Bayes and R tool, a machine learning analyzed which are categorized into 3 sub-parts: Severe, model is proposed for forecasting of COVID-19 infection Moderate, and Mild symptoms. A patient having severe rate. Collected Dataset is cleaned before further process- symptoms which includes throttling must face a harsh ing and is considered as first step in knowledge discovery period. Moderate symptoms include shortness of breath, in databases. For written characters classification prob- fever, cough. Mild symptoms include fever, cough, and lems this data cleansing process is applied using Machine headache. The proposed scheme for COVID-19 outbreak Learning techniques. The process that implements meth- analysis is trained and tested on real-time data using the ods to detect missing and incorrect data, error correction 418 and explore data bases is called data cleaning in which using mean squared error (MSE). The pros of using LR are reassembling and disintegrating of data is involved. Data easy, simple implementation, fast training, regularized cleansing is practiced on numerous merged data bases in to avoid over fitting, easily updated with new data using which appearance of duplicate records takes place. Four gradient descent. The disadvantages of LR model is that it dimensional qualities are proposed which includes cer- performs poorly for non-linear relationships, not flexible tainty, correctness, integrity and consistency. to capture complex patterns, polynomials can be time Primary symptoms of this disease include loss of taste consuming. However to generate a discrete output i.e., 0 and smell, headache, fever, dizziness, tiredness and short- or 1, the logistic regression (binary classification) model is ness of breath. Since seriousness, symptoms are clas- used. Figure 3(b) shows an example of Logistic regression sified into three categories i.e. mild, moderate, and se- which calculates the aggregate sum of the input variables vere. Mild symptoms possess fever, cough, headache. similar to LR model but it runs the output through non- The frequency of seriousness is low at this stage. Then linear sigmoidal function to generate the output. comes the moderate stage in which shortness of breath 1 is the main symptom along with high fever and cough. 𝑦= , (2) 1 + 𝑒 −𝑥 In severe stage, the patient reach into critical situation where x is the input value, y is the output value of the and becomes profoundly serious. Respiratory problem model, and 𝑒 is exponential. LR prediction model can be is the main problem the patients must face. The virus implemented on Python. mainly affects the lungs which damages alveoli respon- sible for supply of oxygen to all parts of body through blood vessels and RBCs, respectively. The virus damages 4.2. Support Vector Machine method the alveolus wall and results into its thickening due to (SVM) which transfer of oxygen to RBCs lowers down which ultimately leads to hypoxia. Due to insufficient intake of SVM is a supervised ML algorithm used for both clas- oxygen, chances of organ failure remain high. Collected sification and regression. An example of SVM classi- data is first trained and then tested using different models fier is shown in Figure 3(c) which is a representation - SVM (Support Vector Machine), LR (Linear Regression), of different classes in a decision plane or hyperplane in k-NN (k- Nearest Neighbors), Classification and Naïve n-dimensional space. In this figure, support vector are Bayes. The explanation of these prediction methods are the datapoints that are nearest to the hyperplane. These listed below. data points are divided into classes by using separating line (𝐻1 , 𝐻2 , 𝐻3 ). Here, a margin is defined as the gap or perpendicular distance from the line to the support 4.1. Linear Regression (LR) vectors. The objective of SVM is to separate the datasets LR is the most usable statistical technique for predic- into classes to calculate maximum marginal hyperplane. tive analysis in Machine Learning. Based on supervised Initially, SVM find hyperplanes iteratively that isolate learning, Linear regression is a Machine Learning algo- the classes based on that SVM select the hyperplane that rithm which performs a regression task. LR prediction divides the classes in best way. SVM can perform ef- model use the given data points to obtain the optimal ficiently on non-linear classification while performing fit line to train the dataset. A simple equation of a line linear classification. With dimensional spaces and the is 𝑦 = 𝑚𝑥 + 𝑐, where 𝑦 is a dependent variable, 𝑥 is in- cases having number of dimensions greater than num- dependent variable, and 𝑚, 𝑐 are constant whose values ber of samples, it is extremely effective. SVM tranform are computed by using the calculus theories. Figure 3(a) the input vector to n-dimensional space known as a fea- shows an example of LR prediction model that consider ture space (f) by using non-linear function then a linear the features as input and predict a continuous output function of linear regression is performed to space. It is as a result by obtaining a linear curve for a given prob- implemented in Python by using SVM kernels. The types lem. The output of LR model is computed by using the of SVM kernels are linear kernel, polynomial kernel, and equation. raial bias function (RBF) kernel. 𝑦 = 𝜇0 + 𝜇1 𝑥1 + 𝜖, (1) Linear Kernel: It is the dot product between two ob- servations and the linear kernel function is defined by where 𝜇0 represents y intercept, 𝜇1 represents slope, using the equation. 𝑥1 is the input value, 𝜖 represents error term, and 𝑦 is the output value of the model. Initially at the start of 𝑓 (𝑣, 𝑣𝑖 ) = 𝑠𝑢𝑚(𝑣 ∗ 𝑣𝑖 ), (3) the training, 𝛽 is initialized randomly but we correct 𝜇 where 𝑣, 𝑣𝑖 are two vectors. during the training specified to each feature such that Polynomial Kernel: It discriminate curved or non-linear the loss (deviation between the desired and predicted input space which is defined by using the equation. output) is minimized. The metric of loss is calculated by 𝑓 (𝑣, 𝑣𝑖 ) = 1 + 𝑠𝑢𝑚(𝑣 ∗ 𝑣𝑖 )𝑑 , (4) 419 where 𝑑 is the degree of polynomial which is manually dataset to pandas dataframe, (v) perform data preprocess- set in the learning algorithm. ing, (vi) split the data into train and test dataset (60% Radial Bias Function (RBF) Kernel: It transform input training data and 40% of testing data), (vii) perform data space into multi-dimensional space which is defined by scaling, (viii) train the model using K-nearest neighbors using the equation. classifier class of sklearn, (ix) obtain prediction, (x) out- put results- confusion matrix, classification report, and 𝑓 (𝑣, 𝑣𝑖 ) = 𝑒𝑥𝑝(−𝛾 ∗ 𝑠𝑢𝑚(𝑣 ∗ 𝑣𝑖 )2 ), (5) accuracy. The benefits of k-NN algorithms are simple, useful for nonlinear data, high accuracy. The limitations where 𝛾 lies between 0 and 1 which is set manually of k-NN algorithm is that it is costly algorithm as it stores and its default value is 0.1. all the training data. In addition, it requires more mem- The steps to be followed in implementing SVM classi- ory storage, and prediction is slow in case of large dataset. fier for text classification are as follows: (i) import 𝑠𝑣𝑚 packages. (ii) load the input dataset. (iii) select features from the dataset. (iv) plot SVM boundaries with original data. (v) generate the values of regularization parameter. 4.4. Naïve Bayes (vi) SVM classifier object are created by using kernel (lin- Naïve Bayes is a classification method based on bayes ear, polynomial, RBF). (vii) text final output is the text theorem which works on the principle of strong assump- classification. The advantage of using SVM classifiers are tions of conditional independence that the existence of a high accuracy with multi-dimensional space, stores very feature in a class is independent to the existence of any less memory and use a subset of training points. The other feature in the same class. Let us consider an ex- disadvantage of SVM classifiers is that the performance ample of smart 4K TV, a smart TV is considered into the of SVM does not scale for larger datasets due to high category of smart if covers the features such as Internet training time, and does not perform good with overlap- connection, high definition, bluetooth, USB ports, HDMI ping classes. Thus, decision tree are usually preferred connectivity, support multiple applications. However, over SVM for large datasets. these are dependent on each other but individual feature contribute independently to the probability of the smart 4.3. k-NN (k-Nearest Neighbors) 4K TV is a smart TV. Naïve Bayes is a highly scalable algorithm that can be certainly train on small dataset. k-nearest neighbors (k-NN) algorithm is supervised ML Figure 3(e) shows an example of Naïve Bayes model that technique which is generally used for classification prob- classify the data points based on posterior probability lems. It can be used for both classification as well as of class into three different classes i.e., classifier 1 (red regression. k-NN method classifies documents based on data points), classifier 2 (orange data points), and classi- resemblance measurements which estimating the factors fier 3 (blue data points). The expression of Naïve Bayes such as distance and proximity, the similarity between algorithm based on bayes theorem is defined as follows. two data points is quantified and classified based on near- est neighbors of each data point. Figure 3(d) shows an 𝑃(𝐵|𝐴)𝑃(𝐴) 𝑃(𝐴|𝐵) = , (6) example of k-NN model which assumes the closeness 𝑃(𝐵) of two data points (similar data points). k-NN works on the principle of feature similarity in order to predict where 𝑃(𝐴|𝐵) indicates the posterior probability of the values of new datapoints. Thus, the new data point class, 𝑃(𝐵|𝐴) indicates likelihood probability of predictor allocates a value based on the proximity as it matches given class, while P(A) refers to prior probability of class, the data points in the training set. The steps involved in and P(B) refers to marginal probability or prior proba- k-NN algorithm are as follows: (i) Load the training and bility of predictor. For building the prediction model testing dataset. (ii) Select the value of k (integer) i.e. the using Naïve Bayes classifier, the model is categorized closest data points. (iii) For each point in the test data, into three types: (i) Gaussian Naïve Bayes (GNB), (ii) compute the distance between test data and each row Bernoulli Naïve Bayes (BNB), and (iii) Multinomial Naïve of training data with the help of Euclidean or Hamming Bayes (MNB). Python library, Scikit learn is the most distance and sort the distance values in ascending order. useful library that helps us to build a Naïve Bayes model (iv) Select the top k rows from the sorted array. Next, in Python. We have the following three types of Naïve allocate a class to the test point based on most frequent Bayes model under Scikit learn Python library. class of these rows. (v) final output. GNB Classifier: It is based on the consideration that k-NN algorithm can be implemented in Python by the data from each label is drawn from a simple Gaus- using the following approach: (i) importing necessary sian distribution. MNB Classifier: Here, the features are python packages, (ii) download the Kaggle COVID-19 considered to be drawn from a simple Multinomial dis- dataset, (iii) assign column names to the dataset, (iv) read tribution which is most suitable for the features that represents discrete counts. BNB classifier: BNB consider 420 (a) Linear Regression Model (b) Logistic Regression Model (c) SVM Model w1 w2 Total: 600 lactic dehydrogenase (LDH) < 365 U I -1 X 1 0 Total: 426 Total: 174 high-sensitivity C-reactive protein death (hs-CRP) < 41.2 mg I -1 w3 1 0 True: 172, False: 2 Total: 391 Total: 35 (d) k-NN Model cured lymphocytes > 14.7 % True: 391, False: 0 1 0 Total: 23 Total: 12 cured death True: 22, False: 1 True: 12, False: 0 True : number of correctly classified patients, False : number of misclassified patients Total : number of patients in a dataset (f) Decision Tree based on three key features of COVID patient (e) Naive Bayes Classifier Figure 3: Prediction models: (a) Linear Regression model, (b) Logistic Regression, (c) SVM model, (d) k-NN classifier, (e) Naive Bayer Classifier, and (f) Decision Tree Induction model. the features to be binary (0s and 1s). For example, text classifier are real-time prediction, multi-class prediction, classification with ‘bag of words’ model. text classification. The steps involved in implementing the GNB classifier in Python are as follows: (i) import the GNB packages un- 4.5. Decision Tree Induction Classifier der Scikit learn Python library. (ii) obtain blobs of points by using 𝑚𝑎𝑘𝑒_𝑏𝑙𝑜𝑏𝑠() function of Scikit with Gaussian is a simple, easy understandable non parametric classi- distribution. (iii) for GNB model, we need to import fier which is based on flexible decision tree algorithm. GaussianNB and make its object. (iv) perform predic- It can perform both classification and regression with tion after obtaining some new data. (v) plot new data the help of algorithms used to formulate this model from to find its boundaries. (vi) using line of codes compute the original dataset, unpremeditated selection of training posterior probabilities of labels. (vii) output array. The data is accomplished. The steps to be involved in the benefits of using Naïve Bayes classifier are fast and easy working of decision tree algorithm are as follows. (i) implementation, less training data, converge faster than selection of random samples from a given dataset. (ii) discriminative models like logistic regression, and suit- construct a decision tree for every sample and compute able for both continuous as well as discrete data. The the prediction result from every decision tree. (iii) voting limitations of Naïve Bayes classifier are zero frequency is done for every predicted result. (iv) choose the most in case a variable is assigned with a category but not voted prediction result as the output of the prediction being observed in training data set, then Naïve Bayes algorithm. classifier set a zero probability and does not give a predic- The decision tree is implemented in Python by us- tion, feature independence as in real life application it is ing the following approaches. (i) importing necessary difficult to have a set of features which are completely in- Python packages, (ii) download the Kaggle dataset, (iii) dependent of each other. The applications of Naïve Bayes assign column names to the dataset, (iv) read dataset to 421 pandas dataframe, (v) perform data pre-processing by disease is missed. using script lines, (vi) divide the data into train and test Accuracy (𝐴𝐶 ): The accuracy in a given datasets with split (suppose, split the dataset into 70% training data and data points (TP + TN) is the ratio of total correct predic- 30% of testing data), (vii) train the decision tree model tions by the classifier to the total data points. The value with the help of RandomForest Classifier class of sklearn, of 𝐴𝐶 lies between 0 and 1. (viii) generate prediction by using script, and (ix) final output is the confusion matrix and classification Report. (𝑇 𝑃 + 𝑇 𝑁 ) 𝐴𝑐 = ∗ 100. (7) Figure 3(f) shows an example of the rule based on three (𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁 ) key features disease of COVID-19 patient dataset i.e., Area Under Curve (AUC): AUC measures the quality of lactic dehydrogenase (LDH), high-sensitivity C-reactive models used for classification problems. It is a metric for protein (hs-CRP), and lymphocytes. The decision tree binary calculation which calculates the area under the was obtained by a random split of total 600 patients at curve of a given performance measure whose value lies the root of the forest which is the number of patients to between 0.5 and 1. training and validation datasets, whereas the leaf node Gini-Index (GI): GI is used for comparison of models returns the outcome as the number of cured and death which is the difference of a distribution is calculated by patients. using Gini-coefficient and its values lies between 0 and 1. The key benefits of using decision tree model are it is suitable for large range of datasets, overcomes the 𝐺𝐼 = (2 ∗ 𝐴𝑈 𝐶 − 1). (8) problem of overfitting by merging the results of different decision trees, flexible and possess very high accuracy, KS: KS chart measures performance of classification scaling of data is not required. The limitations of de- models. More accurately, K-S is a measure of the degree cision tree algorithm are high complexity, harder and of separation between the positive and negative distribu- time-consuming in comparison to other prediction mod- tions. els, and requires more computational resources. 𝐾 𝑆 = |𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒% + 𝑣𝑒 − 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒% − 𝑣𝑒| (9) 5. Prediction Models Performance Error Rate (ER): ER is defined as the ratio of the total mis-classification count (FP + FN) divided by the number Evaluation of samples. The performance of prediction models can be assessed 𝐹𝑃 + 𝐹𝑁 𝐹𝑃 + 𝐹𝑁 using a variety of metrics listed as follows: 𝐸𝑅 = = . (10) 𝑛 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁 + 𝑇𝑃 (1) H-measure, (2) Gini-Index, (3) Area Under Curve (AUC), (4) Area Under the convex Hull of the ROC Curve MER: It represents the Minimum Error Rate. Here (AUCH), (5) Kolmogorov-Smirnoff statistic (KS), (6) Min- threshold value act as a free parameter. imum Error Rate (MER), (7) Minimum Cost Weighted MWL: It is related to the KS statistics. Here, cost guides Error Rate (MWL), (8) Specificity when Sensitivity is held the threshold value in this measure. fixed at 95% (Spec.Sens95), (9) Sensitivity when Speci- Specificity and Sensitivity: True Positive Rate (TPR) ficity is held fixed at 95% (Sens.Spec95), and (10) Error or Sensitivity (Sens), and True Negative Rate (TNR), or Rate (ER). called Specificity (Spec.) H-measure: H-measure is an important measure of classification performance that measures the accuracy 𝑇𝑃 𝑇𝑁 𝑆𝑒𝑛𝑠 = , 𝑆𝑝𝑒𝑐. = . (11) of the model. The primary statistics of interest are the 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃 so-called mis-classification counts, i.e., the number of Figure 7 computes the H measure by using five classi- False Negatives (FN) and False Positives (FP). There are fiers. The normalised cost is computed on X-axis. Let four scenarios in prediction modeling. (i) True positives us assume that 𝑐 ∈ [0, 1] denote the cost of misclassify- (TP): In case of true positives (TP), actuals are positives ing a class 0 object as class 1 (FP), and 1 − 𝑐 represensts and are predicted as positives. (ii) False positives (FP): the cost of misclassifying a class 1 object as class 0 (FN). In case of false positives (FP), actuals are negatives and This asymmetry can be seen to underlie the KS statis- are predicted as positives. (iii) False negatives (FN): In tic, which is a simple linear transformation of the MWL case of false negatives (FN), actuals are positives and are when 𝑐 = 𝜋1 , 1 − 𝑐 = 𝜋0 . The severity ratio (SR) is defined predicted as negatives. (iv) True negatives (TN): In case as the ratio between the two costs, where SR = 1 that of true negatives, actuals are negatives and are predicted represents the symmetric costs. as positives. An example of false positive is occurrences where a disease is mistakenly diagnosed, and an example 𝑐 𝑆𝑅 𝑆𝑅 = , 𝑁 𝑜𝑟𝑚𝑎𝑙𝑖𝑠𝑒𝑑 𝐶𝑜𝑠𝑡 = . (12) of false negatives is occurrences where the presence of a 1−𝑐 1 + 𝑆𝑅 422 where, the Y-axis represents the weighted cost. The H-measure is computed for all the five classifiers and finally, the mean value of Severity Ratio (SR) is 1.12. We pre-process the data to make the experimental data more efficient and remove redundancy. 5.1. Dataset To validate the performance of the proposed CURE scheme, the dataset is being collected from the Kaggle COVID-19 patient pre-condition dataset [16]. The Kaggle dataset is provided by the Johns Hopkins University through Github repository which contains the real-time updated record of the total active cases, death cases, recovered cases of the COVID-19 pandemic. In the modern time of advancement in technology and all rounded progress, to Figure 4: Histogram of missing values. make human beings as well as the medical science more mentally and physically prepared and attentive, such type of health issues or threatening disease will prove very helpful and challenging. As per the reports dis- closed by World Health Organization (WHO), the health curve (infectious cases and cured cases) remains chang- ing abruptly every day, it becomes burdensome for the medical and other departments engaged in this kind act to serve the world medical facilities and other necessary things to make an estimate of total requirements of the health related equipment’s and resources. It becomes very helpful for the entire medical department and other concerned authorities if the corona patients be accom- modated all the resources which will prove a blessing for them to fight the lethal disease. In this context, the data collected contains 23 features of 5,66,603 patients. 5.2. Results and Discussion Figure 5: Heatmap of all the features of COVID-19 dataset. The implementation of the experimental results are per- formed in Python. The results are computed based on finding the missing values, heatmap function, feature selection, and comparison of the machine learning mod- sented the complete dataset in Figure 5. It is drawn using els. The discussion related to the results are summarized the heatmap function of python and capable to presenting below. the diagrammatically view of the dataset. The parame- ters of the COVID patients are considered on the X and Y axis. 5.2.1. Missing Values The initial step is to find the missing values in the Kag- 5.2.3. Feature selection gle dataset [16] and plot these missing values. Figure 4 visualized the histogram of the missing values in COVID As shown in Figure 6, We have selected 10 features among dataset. As a substitute to these, we computed the mean 23 features from the COVID patient dataset. This selec- and replaced the missing value with its mean. The de- tion is being made by analyzing the features after comput- fault input is a numeric array with levels 0 and 1, where ing the feature importance score in the form of Gini-index the minimum value is 0 and the maximum value is 1. through the implementation of decision tree method. 5.2.2. Heatmap Representation 5.2.4. Machine Learning Model As the Kaggle COVID-19 dataset, we collected does not As discussed in the CURE scheme, the machine mod- contain any missing or redundant value, so we repre- els are being used on the pre-processed data. However, 423 Table 1 Comparison of the performance analysis of various ML prediction models. Models H Gini Index AUC AUCH KS MER MWL Spec.Sens95 Sens.Spec95 ER SVM 0.687 0.802 0.901 0.901 0.802 0.099 0.098 0.443 0.447 0.46 LR 0.672 0.791 0.896 0.896 0.791 0.104 0.104 0.421 0.506 0.482 k-NN 0.655 0.781 0.891 0.891 0.781 0.109 0.109 0.478 0.49 0.469 Naïve Bayes 0.632 0.765 0.882 0.882 0.765 0.117 0.117 0.494 0.52 0.47 Random Forest 0.675 0.794 0.897 0.897 0.794 0.103 0.103 0.448 0.475 0.476 Figure 6: Representation of selected values of dataset. Figure 7: H-measure of ensembled model. there are different methods to enhance the performance of the prediction models which dependent on the tech- nique involved. One such technique is to construct the toms of the coronavirus. Next, the collected data is first ensemble models in order to obtain a score for a partic- trained and then tested using different machine learning ular outcome, we can start integrating them to produce prediction models (such as SVM, LR, k-NN, , and Naive ensemble scores. Figure 7 computes H-measure of en- Bayes) that classify the features of the COVID patient sembled model which can be used to improve the area for forecasting of infection rate. Finally, the performance under the curve for these models even further. Let us of the prediction models are assessed using a variety assume, a decision tree classifier and a logistic regression of metrics listed as follows: (1) H-measure, (2) Gini In- model, both predicting standard risks. A new score can dex, (3) Area Under Curve (AUC), AUCH, KS, Minimum be calculated as the average of these two classifiers and Error Rate (MER), Minimum Cost Weighted Error Rate then assess it as a further model. Usually the area under (MWL), Spec.Sens95, Sens.Spec95, Error Rate (ER). The the curve improves for these ensemble models. performance evaluation shows that the CURE scheme After experimentation, the results are computed in outperforms the existing approach which deals with im- Table 1. balanced dataset. In future, we will ensure the secrecy of the corona 6. Conclusion virus data as the patients sensitive credentials can be leaked during data transmission through wireless chan- In this paper, a CURE scheme is proposed based on ma- nels (Internet). chine learning prediction models for the treatment of the COVID patients through remote e-heathcare. The per- formance analysis of the proposed scheme is evaluated References on Python platform which is tested using Kaggle dataset [1] Punn, Narinder Singh, Sanjay Kumar Sonbhadra, from Johns Hopkins University on COVID-19 patient and Sonali Agarwal. ”COVID-19 Epidemic pre-condition. Then, the features are extracted from the Analysis using Machine Learning and Deep datasets of the COVID patient for diagnosing the symp- 424 Learning Algorithms” medRxiv (2020), doi: of MERS in the USA.” Journal of Public Health 39, no. https://doi.org/10.1101/2020.04.08.20057679. 2 (2017): 282-289. [2] Jamshidi, M., Lalbakhsh, A., Talla, J., Peroutka, Z., [13] Hamzah, FA Binti, C. Lau, H. Nazri, D. V. Ligot, Hadjilooei, F., Lalbakhsh, P., Jamshidi, M., La Spada, G. Lee, and C. L. Tan. ”CoronaTracker: worldwide L., Mirmozafari, M., Dehghani, M. and Sabet, A. ”Ar- COVID-19 outbreak data analysis and prediction.” Bull tificial Intelligence and COVID-19: Deep Learning World Health Organ 1 (2020): 32. Approaches for Diagnosis and Treatment” IEEE Ac- [14] Jia, Lin, Kewen Li, Yu Jiang, and Xin Guo. ”Predic- cess, vol. 8, pp.109581-109595, Jun. 2020. tion and analysis of Coronavirus Disease 2019.” arXiv [3] Yan, Li, Hai-Tao Zhang, Yang Xiao, Maolin Wang, preprint arXiv:2003.05447 (2020). Chuan Sun, Jing Liang, Shusheng Li et al. ”Prediction [15] Tuli, Shreshth, Shikhar Tuli, Rakesh Tuli, and Sukh- of survival for severe Covid-19 patients with three pal Singh Gill. ”Predicting the Growth and Trend of clinical features: development of a machine learning- COVID-19 Pandemic using Machine Learning and based prognostic model with clinical data in Wuhan” Cloud Computing.” Internet of Things (2020): 100222. medRxiv (2020). [16] ”COVID-19 patient pre-condition dataset”, [4] ”COVID-19 Worldwide Dashboard - WHO 2020. Online Available: https://www.kag- Live World Statistics” Online available: gle.com/tanmoyx/covid19-patient-precondition- https://covid19.who.int/, accessed on 31 July, dataset/notebooks 2020. [5] Rehman, Suriya, Tariq Majeed, Mohammad Azam Ansari, Uzma Ali, Hussein Sabit, and Ebtesam A. Al- Suhaimi. ”Current scenario of COVID-19 in pediatric age group and physiology of immune and thymus response.” Saudi Journal of Biological Sciences (2020). [6] Nguyen, Thanh Thi. ”Artificial intelligence in the battle against coronavirus (COVID-19): a survey and future research directions.” Preprint, DOI 10 (2020). [7] Zhang, Jian, and Yiming Yang. ”Robustness of regu- larized linear classification methods in text categoriza- tion.” In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 190-197. 2003. [8] Tan, Yuxuan. ”An improved KNN text classification algorithm based on K-medoids and rough set.” In 2018 10th International Conference on Intelligent Human- Machine Systems and Cybernetics (IHMSC), vol. 1, pp. 109-113. IEEE, 2018. [9] Samuel, Jim, G. G. Ali, Md Rahman, Ek Esawi, and Yana Samuel. ”Covid-19 public sentiment insights and machine learning for tweets classification.” Informa- tion, vol. 11, no. 6 Jun. (2020). [10] Pinter, Gergo, Imre Felde, Amir Mosavi, Pedram Ghamisi, and Richard Gloaguen. ”COVID-19 Pan- demic Prediction for Hungary; a Hybrid Machine Learning Approach.” Mathematics, vol. 8, no. 6 (2020):890. [11] Yan, Li, Hai-Tao Zhang, Yang Xiao, Maolin Wang, Chuan Sun, Jing Liang, Shusheng Li et al. ”Prediction of criticality in patients with severe Covid-19 infec- tion using three clinical features: a machine learning- based prognostic model with clinical data in Wuhan.” MedRxiv (2020). [12] Lin, Leesa, Rachel F. McCloud, Cabral A. Bigman, and Kasisomayajula Viswanath. ”Tuning in and catch- ing on? Examining the relationship between pan- demic communication and awareness and knowledge