=Paper=
{{Paper
|id=Vol-2786/Paper50
|storemode=property
|title=CURE: An Effective COVID-19 Remedies based on Machine Learning Prediction Models
|pdfUrl=https://ceur-ws.org/Vol-2786/Paper50.pdf
|volume=Vol-2786
|authors=Poonam Phogat,Rajat Chaudhary
|dblpUrl=https://dblp.org/rec/conf/isic2/PhogatC21a
}}
==CURE: An Effective COVID-19 Remedies based on Machine Learning Prediction Models==
<pdf width="1500px">https://ceur-ws.org/Vol-2786/Paper50.pdf</pdf>
<pre>
                                                                                                                                                                                      415


CURE: An Effective COVID-19 Remedies based on Machine
Learning Prediction Models
Poonam Phogata , Rajat Chaudharyb
a
    Computer Science & Engineering, SGT University, Gurugram, Haryana (India)
b
    Computer Science & Engineering, Bharat Institute of Engineering & Technology, Hyderabad, Telangana (India)


                                             Abstract
                                             Coronavirus disease (COVID-19) is a severe pandemic infectious virus that enters into healthy cells of a living body. COVID-19
                                             virus makes copies in the organs of the host body by multiplying itself which ultimately leads to the death of some healthy
                                             cells and therefore weakens the immune system. In a mild stage, it mainly affects the respiratory tract and leads to pneumonia,
                                             organ failure, and death reaching the last stage. This paper focused on the early detection of the COVID-19 patient based on
                                             the positive symptoms of the disease. In this paper, the COVID-19 Remedies (CURE) scheme is proposed based on machine
                                             learning prediction models for the treatment of COVID patients. For experimental results, the performance analysis of the
                                             CURE scheme is evaluated on the Python platform which is tested using the Kaggle dataset from Johns Hopkins University.

                                             Keywords
                                             COVID-19, Machine Learning, Prediction Model


1. Introduction                                                                                                       are of India. Figure 1(b) shows the statistics of the active
                                                                                                                      cases, where there are 65,04,303 active cases occurred
The virus that induces COVID-19 is a severe acute respi-                                                              globally.
ratory syndrome coronavirus-2 (COVID-2) that was first                                                                   Figure 1(c) presents the total death cases, and finally,
diagnosed in late December 2019 during an investigation                                                               Figure 1(d) shows the total cured cases [4]. This is a
into an outbreak in Wuhan, China. As the cases were                                                                   communication spreading virus that spreads through
increasing rapidly throughout the world, the WHO de-                                                                  respiratory droplets present in the air. These aerosols
clared the disease pandemic on March 11, 2020. Currently,                                                             come to an open environment when an infected person
the transmission of COVID-19 becomes uncontrollable                                                                   sneezes and coughs and enter in other persons through
because the number of cases has reached the threshold                                                                 the mouth and nostrils and reach to lungs. There is no
limit [1]. The virus enters into healthy cells of a living                                                            precise treatment to cure COVID-19. Some steps are be-
body and makes copies in the organs of the host body by                                                               ing taken to eliminate the virus using different medicines
multiplying itself which ultimately leads to the death of                                                             like Hydroxychloroquine which is an antimalarial antibi-
some healthy cells and therefore weakens the immune                                                                   otic. Currently, it is used to treat coronavirus patients, it
system. In a mild stage, it mainly affects the respiratory                                                            helps in inhibition of infection by increasing the endoso-
tract and leads to pneumonia, organ failure, and death                                                                mal pH which provides enough strength to the immune
reaching the last stage [2]. The disease is prominent                                                                 system to fight against the viral disease [5].
in old age people with a weak immune system and al-                                                                      Some preventions are necessary for the treatment of
ready having other primitive diseases like diabetes, high                                                             this pandemic. From the very beginning of COVID-19,
blood pressure, cardiovascular and respiratory diseases                                                               the government of almost all the countries has taken
[3].    Figure 1 shows the global statistics till July 30,                                                            strict actions such as complete lockdown, social distanc-
2020, on the total confirmed cases, active cases, total                                                               ing, use of sanitizer, and masks to reduce all the caus-
deaths, and total cured cases on the COVID-19 virus. Fig-                                                             ing elements [6]. By exploring various studies, Machine
ure 1(a) presents the total number of coronavirus cases                                                               Learning seems to be the best prediction model for fore-
across different countries which shows that the virus is                                                              casting the increasing COVID-19 infected cases. Regres-
spreading rapidly with the highest cases in the USA fol-                                                              sion and classification approach of ML work according
lowed by India. The total confirmed positive cases across                                                             to the availability of data to diagnose this problem.
the world are 2,18,69,976 out of which 26,47,663 cases
                                                                                                                      1.1. Contributions
ISIC’21: International Semantic Intelligence Conference, February
25-27, 2021, Delhi, India                                                                                             The contributions of the paper are summarized below.
Envelope-Open poonamphogat07@gmail.com (P. Phogat); rajat@biet.ac.in (R.
Chaudhary)                                                                                                                 • Diagnose the symptoms of COVID-19 patients
Orcid 0000-0002-6554-918X (R. Chaudhary)                                                                                     based on the classification of the diseases.
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                             • To recover the COVID-19 patients, CURE scheme
                                                                                                                                          416


                                                                                         25.08%

                                                                                                                                    38%
                                                                         2.55%
                                                                                 1.59%


                                                                         1.7%


                                                                         1.76%      2.22%
                                                                                1.61%                                        2.7%
                                                                                                  10.43%
                                                                                                                     12.3%


                                (a)                                                                         (b)


                                (c)                                                                            (d)

                                                Active Cases   Total Cases               Total Deaths      Total Cured

                                 India            6,76,900      26,47,663                   50,921          19,19,842

                            India's Share (%)      10.4%         12.1%                       6.6%            13.3%

                                 World           65,04,303     2,18,69,976                 7,73,741        1,45,91,932


Figure 1: Data statistics of total, active, death, and cured cases on COVID-19.


        is proposed scheme based on machine learning             of the prediction performance evaluation, and finally,
        prediction model to forecast the best suitable           Section VI concludes the paper.
        treatment for COVID disease.
     • For simulation, the proposed scheme is tested             2. Literature Review
       using the Kaggle dataset.
                                                             The researchers introduce some methods of Machine
     • Finally, the performance evaluation is compared Learning for classification. The easiest classification is
       with the five classifiers and predicts the most effi- the Linear Regression method which is used to reduce
        cient outcome using the Python platform.             the sum of squared differences between real and pre-
                                                             dicted data. The drawbacks of this model are its non-
1.2. Paper Organization                                      effectiveness with non-alignment data and sensitiveness
                                                             to deviation [7]. Through the Logistic Regression Model,
 The rest of the paper is structured as follows: Section II it is shown that the contingency of conclusion is Logistic
discusses the literature review of the existing schemes. function-based. The positiveness of this model is that it
Section III presents the system model followed by the pro- is free of complications. But it fails to assume linearity.
posed CURE scheme in Section IV. Section V comprises By Naive Bayes Model, it is proposed that it confined
                                                                                                                                                 417


training data to calculate inevitable parameters and effi-        COVID-19
caciously deals with real-world data. One another model            Dataset           Training Dataset       Trained Model

K-Nearest Neighbour shows that it works efficiently with          Data Pre-
                                                                                   Prediction models    Performance Metrics      (Output)
                                                                 processing                                                   Comparison of
modest data and relevant with multi-class problems [8],                           1. Linear Regression
                                                                                  2. SVM
                                                                                                       1. H-Measure
                                                                                                       2. Gini Index
                                                                                                                               performance
                                                                                                                                analysis of
                                                                                  3. k-NN              3. AUC
[9].                                                            Feature Selection
                                                                (Diagnose COVID
                                                                                  4. Naive Bayes       4. AUCH
                                                                                                                            prediction models.
                                                                                  5. Random Forest     5. KS
   Pinter et al. [10] proposed Machine Learning approaches         Symptoms)                           6. MER
                                                                                                       7. MWL
multi-layered perceptron-imperialist competitive algo-                                                 8. Spec.Sens95
                                                                                                       9. Sens.Spec95
rithm (MLP-ICA) and adaptive network-based fuzzy inter-                                                10. ER


ference system (ANFIS) for prediction of the COVID-19
confirmed positive and death cases. This model is used Figure 2: Workflow of the proposed CURE Scheme for the
to maintain accuracy for the next 9 days which gives the treatment of COVID patient.
reassuring results [11]. The government and the public
have to appreciate the researchers and help in lowering
the data by maintaining social distancing and following
                                                             symptoms of COVID-19 patients.
other precautions [12]. Hamzeh et al. [13] works on
                                                             Problem is heightened with the unbalancing of data. In
Susceptible-Exposed-Infectious-Recovered (SEIR) model
                                                             medical data the class imbalance problem is frequent
which predicts that it performs well on moderate data.
                                                             which occurs with the dominancy of more cases of some
The outbreak of this infectious disease may cause varia-
                                                             classes over others. To handle the imbalanced dataset,
tions in the data prediction.
                                                             several elucidations are appropriate at both algorithmic
   Jia et al. [14] defines four stages for COVID-19 cases.
                                                             and data level. In this paper, the performances of 5 classi-
In the first stage, there comes travel history of a person
                                                             fiers and regressions are compared on imbalanced dataset
having COVID-19 symptoms which leads to lockdown.
                                                             which is obtained while studying on the prediction of
When the infected person comes in contact with other
                                                             COVID-19. On the bases of attainment of these regres-
persons, the virus reached in the second stage. To pre-
                                                             sion and classifiers, impact of SMOTE (Synthetic Minor-
vent the increasing data social distancing is applied. Next,
                                                             ity Oversampling Technique) - an approach which deals
the third stage in which there is neither travel history
                                                             with imbalanced dataset, is thoroughly evaluated.
nor contact with an infected person. So the chances of
                                                                With the comfort of the algorithms used in this method,
viral spreading through the respiratory droplets become
                                                             k samples are finding out which are in proximity to
high. Hence, the use of masks and sanitizers is neces-
                                                             the minority samples in minority classes and standard
sary. The next and last stage is an uncontrollable stage
                                                             Euclidean distance method is used to attain this dis-
where the cases reached the threshold limit. Tuli et al.
                                                             tance. With the number of cases in minority and ma-
[15] improved COVID-19 prediction by using a model of
                                                             jority classes, imbalanced dataset is taken. Based on the
Machine Learning. In this model data-driven approach
                                                             independent variable, the original dataset is partitioned
is used to help the government and the public. After cov-
                                                             into two sets – training set (80%) and test sets (20%) us-
ering data with ML and AI, researchers can forecast the
                                                             ing stratified random sampling. By applying SMOTE
time scale and regions where the possibility of spread-
                                                             technique, training set is over samples to find out the dis-
ing of this disease is maximum. This is predicted that
                                                             tribution of class suited best to the dataset and 8 training
using different models of ML, COVID-19 cases can be
                                                             sets obtain among which 1 is original set other than 7
controlled or eliminated from all the countries of the
                                                             over sampled set having different rates.
world which are facing this critical situation.


3. System Model                                                         4. Proposed CURE Scheme
                                                                        The proposed CURE scheme uses wide range of methods
Figure 2 presents the workflow of the proposed CURE
                                                                        and tools are used for prediction. With the combination
scheme for the treatment of COVID patient. Initially, the
                                                                        of different models- SVM (Support Vector Machine), LR
input is the dataset that is taken from Johns Hopkins Uni-
                                                                        (Linear Regression), k-NN (k- Nearest Neighbors), Clas-
versity dataset. Then the symptoms of positive cases are
                                                                        sification Naïve Bayes and R tool, a machine learning
analyzed which are categorized into 3 sub-parts: Severe,
                                                                        model is proposed for forecasting of COVID-19 infection
Moderate, and Mild symptoms. A patient having severe
                                                                        rate. Collected Dataset is cleaned before further process-
symptoms which includes throttling must face a harsh
                                                                        ing and is considered as first step in knowledge discovery
period. Moderate symptoms include shortness of breath,
                                                                        in databases. For written characters classification prob-
fever, cough. Mild symptoms include fever, cough, and
                                                                        lems this data cleansing process is applied using Machine
headache. The proposed scheme for COVID-19 outbreak
                                                                        Learning techniques. The process that implements meth-
analysis is trained and tested on real-time data using the
                                                                        ods to detect missing and incorrect data, error correction
                                                                                                                                 418


 and explore data bases is called data cleaning in which        using mean squared error (MSE). The pros of using LR are
 reassembling and disintegrating of data is involved. Data      easy, simple implementation, fast training, regularized
 cleansing is practiced on numerous merged data bases in        to avoid over fitting, easily updated with new data using
which appearance of duplicate records takes place. Four         gradient descent. The disadvantages of LR model is that it
 dimensional qualities are proposed which includes cer-         performs poorly for non-linear relationships, not flexible
 tainty, correctness, integrity and consistency.                to capture complex patterns, polynomials can be time
    Primary symptoms of this disease include loss of taste      consuming. However to generate a discrete output i.e., 0
 and smell, headache, fever, dizziness, tiredness and short-    or 1, the logistic regression (binary classification) model is
 ness of breath. Since seriousness, symptoms are clas-          used. Figure 3(b) shows an example of Logistic regression
 sified into three categories i.e. mild, moderate, and se-      which calculates the aggregate sum of the input variables
vere. Mild symptoms possess fever, cough, headache.             similar to LR model but it runs the output through non-
The frequency of seriousness is low at this stage. Then         linear sigmoidal function to generate the output.
 comes the moderate stage in which shortness of breath                                        1
 is the main symptom along with high fever and cough.                                   𝑦=         ,                    (2)
                                                                                          1 + 𝑒 −𝑥
 In severe stage, the patient reach into critical situation
                                                                  where x is the input value, y is the output value of the
 and becomes profoundly serious. Respiratory problem
                                                                model, and 𝑒 is exponential. LR prediction model can be
 is the main problem the patients must face. The virus
                                                                implemented on Python.
 mainly affects the lungs which damages alveoli respon-
 sible for supply of oxygen to all parts of body through
 blood vessels and RBCs, respectively. The virus damages        4.2. Support Vector Machine method
 the alveolus wall and results into its thickening due to            (SVM)
which transfer of oxygen to RBCs lowers down which
 ultimately leads to hypoxia. Due to insufficient intake of     SVM is a supervised ML algorithm used for both clas-
 oxygen, chances of organ failure remain high. Collected        sification and regression. An example of SVM classi-
 data is first trained and then tested using different models   fier is shown in Figure 3(c) which is a representation
- SVM (Support Vector Machine), LR (Linear Regression),         of different classes in a decision plane or hyperplane in
 k-NN (k- Nearest Neighbors), Classification and Naïve          n-dimensional space. In this figure, support vector are
 Bayes. The explanation of these prediction methods are         the datapoints that are nearest to the hyperplane. These
 listed below.                                                  data points are divided into classes by using separating
                                                                line (𝐻1 , 𝐻2 , 𝐻3 ). Here, a margin is defined as the gap
                                                                or perpendicular distance from the line to the support
4.1. Linear Regression (LR)                                     vectors. The objective of SVM is to separate the datasets
LR is the most usable statistical technique for predic-         into classes to calculate maximum marginal hyperplane.
tive analysis in Machine Learning. Based on supervised          Initially, SVM find hyperplanes iteratively that isolate
learning, Linear regression is a Machine Learning algo-         the classes based on that SVM select the hyperplane that
rithm which performs a regression task. LR prediction           divides the classes in best way. SVM can perform ef-
model use the given data points to obtain the optimal           ficiently on non-linear classification while performing
fit line to train the dataset. A simple equation of a line      linear classification. With dimensional spaces and the
is 𝑦 = 𝑚𝑥 + 𝑐, where 𝑦 is a dependent variable, 𝑥 is in-        cases having number of dimensions greater than num-
dependent variable, and 𝑚, 𝑐 are constant whose values          ber of samples, it is extremely effective. SVM tranform
are computed by using the calculus theories. Figure 3(a)        the input vector to n-dimensional space known as a fea-
shows an example of LR prediction model that consider           ture space (f) by using non-linear function then a linear
the features as input and predict a continuous output           function of linear regression is performed to space. It is
as a result by obtaining a linear curve for a given prob-       implemented in Python by using SVM kernels. The types
lem. The output of LR model is computed by using the            of SVM kernels are linear kernel, polynomial kernel, and
equation.                                                       raial bias function (RBF) kernel.
                     𝑦 = 𝜇0 + 𝜇1 𝑥1 + 𝜖,                (1)        Linear Kernel: It is the dot product between two ob-
                                                                servations and the linear kernel function is defined by
  where 𝜇0 represents y intercept, 𝜇1 represents slope,         using the equation.
𝑥1 is the input value, 𝜖 represents error term, and 𝑦 is
the output value of the model. Initially at the start of                           𝑓 (𝑣, 𝑣𝑖 ) = 𝑠𝑢𝑚(𝑣 ∗ 𝑣𝑖 ),             (3)
the training, 𝛽 is initialized randomly but we correct 𝜇          where 𝑣, 𝑣𝑖 are two vectors.
during the training specified to each feature such that           Polynomial Kernel: It discriminate curved or non-linear
the loss (deviation between the desired and predicted           input space which is defined by using the equation.
output) is minimized. The metric of loss is calculated by
                                                                                𝑓 (𝑣, 𝑣𝑖 ) = 1 + 𝑠𝑢𝑚(𝑣 ∗ 𝑣𝑖 )𝑑 ,          (4)
                                                                                                                                 419


   where 𝑑 is the degree of polynomial which is manually        dataset to pandas dataframe, (v) perform data preprocess-
set in the learning algorithm.                                  ing, (vi) split the data into train and test dataset (60%
   Radial Bias Function (RBF) Kernel: It transform input        training data and 40% of testing data), (vii) perform data
space into multi-dimensional space which is defined by          scaling, (viii) train the model using K-nearest neighbors
using the equation.                                             classifier class of sklearn, (ix) obtain prediction, (x) out-
                                                                put results- confusion matrix, classification report, and
              𝑓 (𝑣, 𝑣𝑖 ) = 𝑒𝑥𝑝(−𝛾 ∗ 𝑠𝑢𝑚(𝑣 ∗ 𝑣𝑖 )2 ),        (5) accuracy. The benefits of k-NN algorithms are simple,
                                                                useful for nonlinear data, high accuracy. The limitations
   where 𝛾 lies between 0 and 1 which is set manually of k-NN algorithm is that it is costly algorithm as it stores
and its default value is 0.1.                                   all the training data. In addition, it requires more mem-
   The steps to be followed in implementing SVM classi- ory storage, and prediction is slow in case of large dataset.
fier for text classification are as follows: (i) import 𝑠𝑣𝑚
packages. (ii) load the input dataset. (iii) select features
from the dataset. (iv) plot SVM boundaries with original
data. (v) generate the values of regularization parameter. 4.4. Naïve Bayes
(vi) SVM classifier object are created by using kernel (lin- Naïve Bayes is a classification method based on bayes
ear, polynomial, RBF). (vii) text final output is the text theorem which works on the principle of strong assump-
classification. The advantage of using SVM classifiers are tions of conditional independence that the existence of a
high accuracy with multi-dimensional space, stores very feature in a class is independent to the existence of any
less memory and use a subset of training points. The other feature in the same class. Let us consider an ex-
disadvantage of SVM classifiers is that the performance ample of smart 4K TV, a smart TV is considered into the
of SVM does not scale for larger datasets due to high category of smart if covers the features such as Internet
training time, and does not perform good with overlap- connection, high definition, bluetooth, USB ports, HDMI
ping classes. Thus, decision tree are usually preferred connectivity, support multiple applications. However,
over SVM for large datasets.                                    these are dependent on each other but individual feature
                                                                contribute independently to the probability of the smart
4.3. k-NN (k-Nearest Neighbors)                                 4K TV is a smart TV. Naïve Bayes is a highly scalable
                                                                algorithm that can be certainly train on small dataset.
k-nearest neighbors (k-NN) algorithm is supervised ML Figure 3(e) shows an example of Naïve Bayes model that
technique which is generally used for classification prob- classify the data points based on posterior probability
lems. It can be used for both classification as well as of class into three different classes i.e., classifier 1 (red
regression. k-NN method classifies documents based on data points), classifier 2 (orange data points), and classi-
resemblance measurements which estimating the factors fier 3 (blue data points). The expression of Naïve Bayes
such as distance and proximity, the similarity between algorithm based on bayes theorem is defined as follows.
two data points is quantified and classified based on near-
est neighbors of each data point. Figure 3(d) shows an                                        𝑃(𝐵|𝐴)𝑃(𝐴)
                                                                                   𝑃(𝐴|𝐵) =                ,               (6)
example of k-NN model which assumes the closeness                                                 𝑃(𝐵)
of two data points (similar data points). k-NN works
on the principle of feature similarity in order to predict         where 𝑃(𝐴|𝐵) indicates the posterior probability of
the values of new datapoints. Thus, the new data point          class, 𝑃(𝐵|𝐴) indicates likelihood probability of predictor
allocates a value based on the proximity as it matches          given  class, while P(A) refers to prior probability of class,
the data points in the training set. The steps involved in and P(B) refers to marginal probability or prior proba-
k-NN algorithm are as follows: (i) Load the training and bility of predictor. For building the prediction model
testing dataset. (ii) Select the value of k (integer) i.e. the using Naïve Bayes classifier, the model is categorized
closest data points. (iii) For each point in the test data, into three types: (i) Gaussian Naïve Bayes (GNB), (ii)
compute the distance between test data and each row Bernoulli Naïve Bayes (BNB), and (iii) Multinomial Naïve
of training data with the help of Euclidean or Hamming Bayes (MNB). Python library, Scikit learn is the most
distance and sort the distance values in ascending order. useful library that helps us to build a Naïve Bayes model
(iv) Select the top k rows from the sorted array. Next, in Python. We have the following three types of Naïve
allocate a class to the test point based on most frequent Bayes model under Scikit learn Python library.
class of these rows. (v) final output.                             GNB Classifier: It is based on the consideration that
   k-NN algorithm can be implemented in Python by               the  data from each label is drawn from a simple Gaus-
using the following approach: (i) importing necessary           sian  distribution. MNB Classifier: Here, the features are
python packages, (ii) download the Kaggle COVID-19              considered    to be drawn from a simple Multinomial dis-
dataset, (iii) assign column names to the dataset, (iv) read tribution which is most suitable for the features that
                                                                represents discrete counts. BNB classifier: BNB consider
                                                                                                                                                    420


       (a) Linear Regression Model                     (b) Logistic Regression Model                                           (c) SVM Model

         w1
                                     w2                                                                      Total: 600

                                                                                        lactic dehydrogenase (LDH) < 365 U I -1
                  X
                                                                                                         1                0
                                                                           Total: 426                                            Total: 174
                                                        high-sensitivity C-reactive protein
                                                                                                                                    death
                                                             (hs-CRP) < 41.2 mg I -1
                           w3
                                                                       1            0                                         True: 172, False: 2
                                                     Total: 391                                       Total: 35
              (d) k-NN Model
                                                       cured                                      lymphocytes > 14.7 %

                                                 True: 391, False: 0                                 1               0
                                                                                    Total: 23                                 Total: 12

                                                                                        cured                                   death

                                                                                True: 22, False: 1                        True: 12, False: 0
                                                True : number of correctly classified patients, False : number of misclassified patients
                                                Total : number of patients in a dataset

                                                               (f) Decision Tree based on three key features of COVID patient
       (e) Naive Bayes Classifier

Figure 3: Prediction models: (a) Linear Regression model, (b) Logistic Regression, (c) SVM model, (d) k-NN classifier, (e) Naive
Bayer Classifier, and (f) Decision Tree Induction model.


the features to be binary (0s and 1s). For example, text               classifier are real-time prediction, multi-class prediction,
classification with ‘bag of words’ model.                              text classification.
The steps involved in implementing the GNB classifier in
Python are as follows: (i) import the GNB packages un-                 4.5. Decision Tree Induction Classifier
der Scikit learn Python library. (ii) obtain blobs of points
by using 𝑚𝑎𝑘𝑒_𝑏𝑙𝑜𝑏𝑠() function of Scikit with Gaussian                 is a simple, easy understandable non parametric classi-
distribution. (iii) for GNB model, we need to import                   fier which is based on flexible decision tree algorithm.
GaussianNB and make its object. (iv) perform predic-                   It can perform both classification and regression with
tion after obtaining some new data. (v) plot new data                  the help of algorithms used to formulate this model from
to find its boundaries. (vi) using line of codes compute               the original dataset, unpremeditated selection of training
posterior probabilities of labels. (vii) output array. The             data is accomplished. The steps to be involved in the
benefits of using Naïve Bayes classifier are fast and easy             working of decision tree algorithm are as follows. (i)
implementation, less training data, converge faster than               selection of random samples from a given dataset. (ii)
discriminative models like logistic regression, and suit-              construct a decision tree for every sample and compute
able for both continuous as well as discrete data. The                 the prediction result from every decision tree. (iii) voting
limitations of Naïve Bayes classifier are zero frequency               is done for every predicted result. (iv) choose the most
in case a variable is assigned with a category but not                 voted prediction result as the output of the prediction
being observed in training data set, then Naïve Bayes                  algorithm.
classifier set a zero probability and does not give a predic-             The decision tree is implemented in Python by us-
tion, feature independence as in real life application it is           ing the following approaches. (i) importing necessary
difficult to have a set of features which are completely in-           Python packages, (ii) download the Kaggle dataset, (iii)
dependent of each other. The applications of Naïve Bayes               assign column names to the dataset, (iv) read dataset to
                                                                                                                               421


pandas dataframe, (v) perform data pre-processing by           disease is missed.
using script lines, (vi) divide the data into train and test      Accuracy (𝐴𝐶 ): The accuracy in a given datasets with
split (suppose, split the dataset into 70% training data and   data points (TP + TN) is the ratio of total correct predic-
30% of testing data), (vii) train the decision tree model      tions by the classifier to the total data points. The value
with the help of RandomForest Classifier class of sklearn,     of 𝐴𝐶 lies between 0 and 1.
(viii) generate prediction by using script, and (ix) final
output is the confusion matrix and classification Report.                              (𝑇 𝑃 + 𝑇 𝑁 )
                                                                          𝐴𝑐 =                            ∗ 100.        (7)
Figure 3(f) shows an example of the rule based on three                          (𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁 )
key features disease of COVID-19 patient dataset i.e.,
                                                                 Area Under Curve (AUC): AUC measures the quality of
lactic dehydrogenase (LDH), high-sensitivity C-reactive
                                                               models used for classification problems. It is a metric for
protein (hs-CRP), and lymphocytes. The decision tree
                                                               binary calculation which calculates the area under the
was obtained by a random split of total 600 patients at
                                                               curve of a given performance measure whose value lies
the root of the forest which is the number of patients to
                                                               between 0.5 and 1.
training and validation datasets, whereas the leaf node
                                                                 Gini-Index (GI): GI is used for comparison of models
returns the outcome as the number of cured and death
                                                               which is the difference of a distribution is calculated by
patients.
                                                               using Gini-coefficient and its values lies between 0 and 1.
   The key benefits of using decision tree model are it
is suitable for large range of datasets, overcomes the                            𝐺𝐼 = (2 ∗ 𝐴𝑈 𝐶 − 1).                  (8)
problem of overfitting by merging the results of different
decision trees, flexible and possess very high accuracy,          KS: KS chart measures performance of classification
scaling of data is not required. The limitations of de-        models. More accurately, K-S is a measure of the degree
cision tree algorithm are high complexity, harder and          of separation between the positive and negative distribu-
time-consuming in comparison to other prediction mod-          tions.
els, and requires more computational resources.
                                                                      𝐾 𝑆 = |𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒% + 𝑣𝑒 − 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒% − 𝑣𝑒|       (9)
5. Prediction Models Performance                                  Error Rate (ER): ER is defined as the ratio of the total
                                                               mis-classification count (FP + FN) divided by the number
   Evaluation                                                  of samples.
The performance of prediction models can be assessed                          𝐹𝑃 + 𝐹𝑁        𝐹𝑃 + 𝐹𝑁
using a variety of metrics listed as follows:                          𝐸𝑅 =           =                   .            (10)
                                                                                 𝑛      𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁 + 𝑇𝑃
(1) H-measure, (2) Gini-Index, (3) Area Under Curve
(AUC), (4) Area Under the convex Hull of the ROC Curve           MER: It represents the Minimum Error Rate. Here
(AUCH), (5) Kolmogorov-Smirnoff statistic (KS), (6) Min-       threshold value act as a free parameter.
imum Error Rate (MER), (7) Minimum Cost Weighted                 MWL: It is related to the KS statistics. Here, cost guides
Error Rate (MWL), (8) Specificity when Sensitivity is held     the threshold value in this measure.
fixed at 95% (Spec.Sens95), (9) Sensitivity when Speci-          Specificity and Sensitivity: True Positive Rate (TPR)
ficity is held fixed at 95% (Sens.Spec95), and (10) Error      or Sensitivity (Sens), and True Negative Rate (TNR), or
Rate (ER).                                                     called Specificity (Spec.)
   H-measure: H-measure is an important measure of
classification performance that measures the accuracy                              𝑇𝑃                     𝑇𝑁
                                                                        𝑆𝑒𝑛𝑠 =           ,    𝑆𝑝𝑒𝑐. =           .      (11)
of the model. The primary statistics of interest are the                         𝑇𝑃 + 𝐹𝑁                𝑇𝑁 + 𝐹𝑃
so-called mis-classification counts, i.e., the number of       Figure 7 computes the H measure by using five classi-
False Negatives (FN) and False Positives (FP). There are       fiers. The normalised cost is computed on X-axis. Let
four scenarios in prediction modeling. (i) True positives      us assume that 𝑐 ∈ [0, 1] denote the cost of misclassify-
(TP): In case of true positives (TP), actuals are positives    ing a class 0 object as class 1 (FP), and 1 − 𝑐 represensts
and are predicted as positives. (ii) False positives (FP):     the cost of misclassifying a class 1 object as class 0 (FN).
In case of false positives (FP), actuals are negatives and     This asymmetry can be seen to underlie the KS statis-
are predicted as positives. (iii) False negatives (FN): In     tic, which is a simple linear transformation of the MWL
case of false negatives (FN), actuals are positives and are    when 𝑐 = 𝜋1 , 1 − 𝑐 = 𝜋0 . The severity ratio (SR) is defined
predicted as negatives. (iv) True negatives (TN): In case      as the ratio between the two costs, where SR = 1 that
of true negatives, actuals are negatives and are predicted     represents the symmetric costs.
as positives. An example of false positive is occurrences
where a disease is mistakenly diagnosed, and an example                        𝑐                              𝑆𝑅
                                                                       𝑆𝑅 =       ,    𝑁 𝑜𝑟𝑚𝑎𝑙𝑖𝑠𝑒𝑑 𝐶𝑜𝑠𝑡 =          .   (12)
of false negatives is occurrences where the presence of a                     1−𝑐                           1 + 𝑆𝑅
                                                                                                                          422


   where, the Y-axis represents the weighted cost. The
H-measure is computed for all the five classifiers and
finally, the mean value of Severity Ratio (SR) is 1.12. We
pre-process the data to make the experimental data more
efficient and remove redundancy.

5.1. Dataset
To validate the performance of the proposed CURE scheme,
the dataset is being collected from the Kaggle COVID-19
patient pre-condition dataset [16]. The Kaggle dataset
is provided by the Johns Hopkins University through
Github repository which contains the real-time updated
record of the total active cases, death cases, recovered
cases of the COVID-19 pandemic. In the modern time of
advancement in technology and all rounded progress, to Figure 4: Histogram of missing values.
make human beings as well as the medical science more
mentally and physically prepared and attentive, such
type of health issues or threatening disease will prove
very helpful and challenging. As per the reports dis-
closed by World Health Organization (WHO), the health
curve (infectious cases and cured cases) remains chang-
ing abruptly every day, it becomes burdensome for the
medical and other departments engaged in this kind act
to serve the world medical facilities and other necessary
things to make an estimate of total requirements of the
health related equipment’s and resources. It becomes
very helpful for the entire medical department and other
concerned authorities if the corona patients be accom-
modated all the resources which will prove a blessing for
them to fight the lethal disease. In this context, the data
collected contains 23 features of 5,66,603 patients.

5.2. Results and Discussion
                                                             Figure 5: Heatmap of all the features of COVID-19 dataset.
The implementation of the experimental results are per-
formed in Python. The results are computed based on
finding the missing values, heatmap function, feature
selection, and comparison of the machine learning mod- sented the complete dataset in Figure 5. It is drawn using
els. The discussion related to the results are summarized the heatmap function of python and capable to presenting
below.                                                    the diagrammatically view of the dataset. The parame-
                                                          ters of the COVID patients are considered on the X and
                                                          Y axis.
5.2.1. Missing Values
The initial step is to find the missing values in the Kag-   5.2.3. Feature selection
gle dataset [16] and plot these missing values. Figure 4
visualized the histogram of the missing values in COVID      As shown in Figure 6, We have selected 10 features among
dataset. As a substitute to these, we computed the mean      23 features from the COVID patient dataset. This selec-
and replaced the missing value with its mean. The de-        tion is being made by analyzing the features after comput-
fault input is a numeric array with levels 0 and 1, where    ing the feature importance score in the form of Gini-index
the minimum value is 0 and the maximum value is 1.           through the implementation of decision tree method.

5.2.2. Heatmap Representation                                5.2.4. Machine Learning Model

As the Kaggle COVID-19 dataset, we collected does not As discussed in the CURE scheme, the machine mod-
contain any missing or redundant value, so we repre- els are being used on the pre-processed data. However,
                                                                                                                           423


Table 1
Comparison of the performance analysis of various ML prediction models.
           Models         H     Gini Index   AUC     AUCH      KS     MER     MWL     Spec.Sens95   Sens.Spec95    ER
            SVM         0.687     0.802      0.901    0.901   0.802   0.099   0.098       0.443        0.447       0.46
             LR         0.672     0.791      0.896    0.896   0.791   0.104   0.104       0.421        0.506       0.482
            k-NN        0.655     0.781      0.891    0.891   0.781   0.109   0.109       0.478         0.49       0.469
         Naïve Bayes    0.632     0.765      0.882    0.882   0.765   0.117   0.117       0.494         0.52       0.47
        Random Forest   0.675     0.794      0.897    0.897   0.794   0.103   0.103       0.448        0.475      0.476


Figure 6: Representation of selected values of dataset.


                                                                  Figure 7: H-measure of ensembled model.
there are different methods to enhance the performance
of the prediction models which dependent on the tech-
nique involved. One such technique is to construct the
                                                          toms of the coronavirus. Next, the collected data is first
ensemble models in order to obtain a score for a partic-
                                                          trained and then tested using different machine learning
ular outcome, we can start integrating them to produce
                                                          prediction models (such as SVM, LR, k-NN, , and Naive
ensemble scores. Figure 7 computes H-measure of en-
                                                          Bayes) that classify the features of the COVID patient
sembled model which can be used to improve the area
                                                          for forecasting of infection rate. Finally, the performance
under the curve for these models even further. Let us
                                                          of the prediction models are assessed using a variety
assume, a decision tree classifier and a logistic regression
                                                          of metrics listed as follows: (1) H-measure, (2) Gini In-
model, both predicting standard risks. A new score can
                                                          dex, (3) Area Under Curve (AUC), AUCH, KS, Minimum
be calculated as the average of these two classifiers and
                                                          Error Rate (MER), Minimum Cost Weighted Error Rate
then assess it as a further model. Usually the area under
                                                          (MWL), Spec.Sens95, Sens.Spec95, Error Rate (ER). The
the curve improves for these ensemble models.
                                                          performance evaluation shows that the CURE scheme
   After experimentation, the results are computed in
                                                          outperforms the existing approach which deals with im-
Table 1.
                                                          balanced dataset.
                                                             In future, we will ensure the secrecy of the corona
6. Conclusion                                             virus  data as the patients sensitive credentials can be
                                                          leaked during data transmission through wireless chan-
In this paper, a CURE scheme is proposed based on ma- nels (Internet).
chine learning prediction models for the treatment of the
COVID patients through remote e-heathcare. The per-
formance analysis of the proposed scheme is evaluated References
on Python platform which is tested using Kaggle dataset
                                                          [1] Punn, Narinder Singh, Sanjay Kumar Sonbhadra,
from Johns Hopkins University on COVID-19 patient
                                                              and Sonali Agarwal. ”COVID-19 Epidemic
pre-condition. Then, the features are extracted from the
                                                              Analysis using Machine Learning and Deep
datasets of the COVID patient for diagnosing the symp-
                                                                                                                           424


   Learning Algorithms” medRxiv (2020), doi:                      of MERS in the USA.” Journal of Public Health 39, no.
   https://doi.org/10.1101/2020.04.08.20057679.                   2 (2017): 282-289.
[2] Jamshidi, M., Lalbakhsh, A., Talla, J., Peroutka, Z.,      [13] Hamzah, FA Binti, C. Lau, H. Nazri, D. V. Ligot,
   Hadjilooei, F., Lalbakhsh, P., Jamshidi, M., La Spada,         G. Lee, and C. L. Tan. ”CoronaTracker: worldwide
   L., Mirmozafari, M., Dehghani, M. and Sabet, A. ”Ar-           COVID-19 outbreak data analysis and prediction.” Bull
   tificial Intelligence and COVID-19: Deep Learning              World Health Organ 1 (2020): 32.
   Approaches for Diagnosis and Treatment” IEEE Ac-            [14] Jia, Lin, Kewen Li, Yu Jiang, and Xin Guo. ”Predic-
   cess, vol. 8, pp.109581-109595, Jun. 2020.                     tion and analysis of Coronavirus Disease 2019.” arXiv
[3] Yan, Li, Hai-Tao Zhang, Yang Xiao, Maolin Wang,               preprint arXiv:2003.05447 (2020).
   Chuan Sun, Jing Liang, Shusheng Li et al. ”Prediction       [15] Tuli, Shreshth, Shikhar Tuli, Rakesh Tuli, and Sukh-
   of survival for severe Covid-19 patients with three            pal Singh Gill. ”Predicting the Growth and Trend of
   clinical features: development of a machine learning-          COVID-19 Pandemic using Machine Learning and
   based prognostic model with clinical data in Wuhan”            Cloud Computing.” Internet of Things (2020): 100222.
   medRxiv (2020).                                             [16] ”COVID-19 patient pre-condition dataset”,
[4] ”COVID-19 Worldwide Dashboard - WHO                           2020. Online Available:             https://www.kag-
   Live      World      Statistics”   Online      available:      gle.com/tanmoyx/covid19-patient-precondition-
   https://covid19.who.int/, accessed on 31 July,                 dataset/notebooks
   2020.
[5] Rehman, Suriya, Tariq Majeed, Mohammad Azam
   Ansari, Uzma Ali, Hussein Sabit, and Ebtesam A. Al-
   Suhaimi. ”Current scenario of COVID-19 in pediatric
   age group and physiology of immune and thymus
   response.” Saudi Journal of Biological Sciences (2020).
[6] Nguyen, Thanh Thi. ”Artificial intelligence in the
   battle against coronavirus (COVID-19): a survey and
   future research directions.” Preprint, DOI 10 (2020).
[7] Zhang, Jian, and Yiming Yang. ”Robustness of regu-
   larized linear classification methods in text categoriza-
   tion.” In Proceedings of the 26th annual international
   ACM SIGIR conference on Research and development
   in informaion retrieval, pp. 190-197. 2003.
[8] Tan, Yuxuan. ”An improved KNN text classification
   algorithm based on K-medoids and rough set.” In 2018
   10th International Conference on Intelligent Human-
   Machine Systems and Cybernetics (IHMSC), vol. 1,
   pp. 109-113. IEEE, 2018.
[9] Samuel, Jim, G. G. Ali, Md Rahman, Ek Esawi, and
   Yana Samuel. ”Covid-19 public sentiment insights and
   machine learning for tweets classification.” Informa-
   tion, vol. 11, no. 6 Jun. (2020).
[10] Pinter, Gergo, Imre Felde, Amir Mosavi, Pedram
   Ghamisi, and Richard Gloaguen. ”COVID-19 Pan-
   demic Prediction for Hungary; a Hybrid Machine
   Learning Approach.” Mathematics, vol. 8, no. 6
   (2020):890.
[11] Yan, Li, Hai-Tao Zhang, Yang Xiao, Maolin Wang,
   Chuan Sun, Jing Liang, Shusheng Li et al. ”Prediction
   of criticality in patients with severe Covid-19 infec-
   tion using three clinical features: a machine learning-
   based prognostic model with clinical data in Wuhan.”
   MedRxiv (2020).
[12] Lin, Leesa, Rachel F. McCloud, Cabral A. Bigman,
   and Kasisomayajula Viswanath. ”Tuning in and catch-
   ing on? Examining the relationship between pan-
   demic communication and awareness and knowledge

</pre>