=Paper= {{Paper |id=Vol-2823/Paper11 |storemode=property |title=Features Contributing Towards Heart Disease Prediction Using Machine Learning |pdfUrl=https://ceur-ws.org/Vol-2823/Paper11.pdf |volume=Vol-2823 |authors=Chetan Sharma, Shankar Shambhu, Prasenjit Das, Shaily Jain, Sakshi }} ==Features Contributing Towards Heart Disease Prediction Using Machine Learning== https://ceur-ws.org/Vol-2823/Paper11.pdf
                                                                                                                                               84




Features Contributing Towards Heart Disease Prediction Using
Machine Learning
Chetan Sharmaa, Shankar Shambhub, Prasenjit Dasb, Shaily Jainc, Sakshid

a
  Chitkara University Himachal Pradesh, India
b
  Chitkara University School of Computer Applications, Chitkara University, Himachal Pradesh, India
c
  Chitkara University Institute of Engineering and Technology, Chitkara University, Himachal Pradesh, India
d
  Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India


Abstract

WHO and other health organizations claimed that the death rate due to cardiovascular disease is one-third of
worldwide. Although, many researchers have worked in this direction to help our medical professionals diagnose
this disease at an early stage. This paper aims to apply data mining algorithms to predict heart disease occurrence
in patients based on some features like diabetes, blood pressure, etc. We have implemented two data mining
algorithms, Naive Bayes and NB tree, on two data different datasets of the UCI repository to evaluate the
accuracy, f-measure, precision, and recall. Our results show NB tree outperforms with 84.6% accuracy compared
to Naive Bayes with only 80.58 % accuracy.

Keywords: Machine Learning, Classification, Heart, Disease, WEKA


1.        Introduction

The heart is the essential central part of the human                                         diseases. Typically, the heart is unable to push the
body, which provides the purified blood to each                                              necessary amount of blood to other areas of the
part of the body. Without a healthy working heart,                                           body to satisfy the body's normal functioning.
a person cannot live a single second. But,                                                   Because of this, heart failure eventually occurs
nowadays, heart diseases are increasing at a rapid                                           [2]. In the United States, the incidence of heart
speed. As per the WHO, over 17.9 million people                                              illness is very high [3]. Swelling in the feet, Chest
died every year because of heart disease, and 80%                                            pain, breathe shortness, body tiredness, Pain in
of people died because of a heart attack [1]. Heart                                          the neck and shoulders, etc., are some significant
disease has been recognized as one of the world's                                            symptoms of heart disease [4]. Techniques used
most complex and life-threatening human                                                      to diagnose heart diseases at an early stage have
____________________________________                                                         been complicated, and the resulting difficulty is
                                                                                             one of the critical factors affecting the standard of
ACI’21: Workshop on Advances in Computational Intelligence at
ISIC 2021, February 25-27, 2021, Delhi, India
                                                                                             living [5]. Because of the low availability of
EMAIL: chetan.sharma@chitkarauniversity.edu.in (C. Sharma);                                  instruments and lack of physician, diagnosis of
shankar.shambhu@chitkarauniversity.edu.in (S. Shambhu);                                      heart diseases and their treatment is very involved
prasenjit.das@chitkarauniversity.edu.in (P. Das);
shaily.jain@chitkarauniversity.edu.in (S. Jain);                                             in developing countries [6]. It affects the
sakshi@chitkara.edu.in (Sakshi)                                                              prediction results and treatment of heart patients,
ORCID: 0000-0001-5401-8503 (C. Sharma); 0000-0002-2348-                                      which is the main reason for the high mortality
1041(S. Shambhu); 0000-0002-7988-2418 (P. Das); 0000-0001-
6078-3607 (S. Jain); 0000-0002-8757-4001 (Sakshi)                                            rate of heart patients. Hence, to reduce the
             ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative
             Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                             mortality rate of heart patients and provide the
             CEUR Workshop Proceedings (CEUR-WS.org)                                         best treatment of heart diseases, appropriate and
                                                                                                      85



accurate heart disease diagnosis techniques are        implementations of the algorithms. Archived
required [7]. These techniques should be capable       results have shown that the Naive Bayes
of detecting heart disease at an early stage [19-      algorithm provided the best results compared to
20].     The rest of the paper is organized as         Artificial Neural Network and J48. Naive Bayes
follows: Section 2 discusses the background and        achieved an accuracy of 79.90% and took 0.01
history of this work, Methodology is explained in      second to build the model, where J48 attained the
section 3 along with the description of tools,         accuracy of 77.03% and took 0.01 second to build
datasets, and algorithms used in evaluation,           the model. Artificial Neural Network achieved an
evaluation matrices etc, Results are discussed in
section 4 and finally section 5 gives us conclusion    Singh et al. developed a new hybrid model named
of the research done in this paper.                    "Hybrid Genetic Naive Bayes Model". This
                                                       model was developed with two different
                                                       supervised techniques (Naive Bayes, Genetic
2.   Literature Review and Related                     Algorithm) for the correct prediction of heart
Work                                                   diseases. To develop this model, the researcher
                                                       used a dataset taken from the UCI repository with
In the last decade, many researchers worked on         303 instances and 14 important attributes.
heart disease datasets to predict heart diseases.      Implementation results gave the accuracy of
They used multiple machine learning and data           97.14% with 98% precision value and 97.14%
mining algorithms for the implementation and           recall value [10].
achieved different results. Yet today, we also face
a lot of issues with heart disease. Following are      Krishnan et al. used two machine learning
the literate review of recent research:                algorithms, Decision Tree and Naive Bayes
                                                       algorithms, to predict Heart Diseases. They used
The authors implemented three different                a dataset of 300 instances and 14 attributes taken
algorithms Naive Bayes(NB), Artificial Neural          from the UCI repository. Researchers
Network, and J48 to find the best heart disease        implemented the python programming language
prediction results. Researchers used a dataset of 8    model and achieved the highest accuracy of 91%
additional attributes and 210 instances of male        with a Decision tree and 87% with Naive Bayes
persons. WEKA tool was used for the                    [11].

3.      Methodology
3.1     Proposed Work




                                Figure 1: Proposed Methodology of Study
                                                                                                        86



                                                      output is fed to the node. Based on the outcome
3.2      Tool Used                                    of each node, other features are selected. In this
                                                      hybrid approach, the split is done in the same
WEKA 3.8.4 machine learning tool is used to           manner by considering only one feature at every
conduct this study, written in Java and developed     node but with Naive-Bayes classifiers at the
at the University of Waikato. WEKA tool               leaves. In large datasets, data splitting is regarded
provides us with different classifiers to examine     as a vital and essential task for classification
the performance. WEKA is used to evaluate other       using the features we have implemented the naive
data mining tasks like preprocessing,                 Bayes tree classification.
classification, regression, and many more.
WEKA accepts .csv and .arff file format and the       Naive Bayes Classification [13]–[15] : This
chosen dataset has already created the required       classification technique is based on Baye's
data in the mentioned format.                         theorem, which works on the assumption that the
                                                      existence of one feature is independent of the
3.3      Data Preprocessing                           other feature. The advantage of the Naive Bayes
                                                      classification is that it requires a small amount of
                                                      data to create/train the model.
The real-life data consists of redundant values
                                                      Bayes theorem provides a way of calculating
and lots of noise. The data needs to be cleaned,
                                                      posterior probability (conditional probability
and the missing values need to be filled before the
                                                      where we are finding probability under a given
data is fed to generate a model [18]. In the
                                                      condition assumed to be confirmed) P(c|x) from
preprocessing process, these issues are taken care
                                                      P(c), P(x), and P(x|c). The following is the
of so that the prediction can be made accurately.
                                                      formula to calculate posterior probability:
Once the cleaning of data is done, i.e., the noise
is removed, and the missing values are filled, we                  P(c|x)=P(x|c)*P(c)/P(x|c)
                                                      Where:
need to transform it. Many supervised learning
                                                      P(c|x) is the conditional probability that occurs
algorithms work on nominal or cardinal data. So
                                                      when x has already occurred
data transformation is applied to the dataset
                                                      P(c) is the known probability of the class.
obtained from UCI in the present work.
                                                      P(x|c) is the conditional probability of x condition
Reduction of the dataset is applied to convert the
                                                      that c has occurred.
complex dataset into a more straightforward
                                                      P(x) the known probability of the class.
form, which improves the accuracy of the model.
                                                      Dataset Description
3.4      Classification Algorithms
                                                      Two datasets were used in this study. The
After going through an intensive literature           first one was obtained from the "Cleveland
review, we have selected two classification           Clinic Foundation", the First dataset
algorithms: naive Bayes tree, naive Bayes             comprises 303 instances. The second dataset
classification based on their dependency on
                                                      is taken from the public available platform, a
attributes.
                                                      combination of five other datasets named
Naive Bayes Tree [12]: It is a hybrid approach        Heart Disease Dataset (Comprehensive). All
in which the model is generated using the Naive       the dataset are available for heart disease
Bayes and Decision tree Approach. The naive           having a total of 76 attributes and each
Bayes classification assumes that the features are    dataset choose their dataset features
independent of each other, and the decision tree      accordingly. Initially, both the dataset was
assumes that the components are dependent on          selected for the study with 76 attributes, but
each other. So the hybrid approach takes              they were preprocessed to produce 14 and 11
advantage of both approaches. The decision tree
                                                      characteristics to reduce redundant variables.
is built by considering only one feature, and
                                                                                                      87



Consequently, we used these specific                   The first dataset is taken as the Cleveland
attributes (listed in Table 1 and Table 3) to          database, which is publically available at
compare.                                               [16]. There are 303 instances in the dataset,
                                                       and their description is given in Table 1, and
                                                       the results using the WEKA tool are given in
                                                       Table 2.

                      Table 1: Cleveland Dataset Attribute Information

Attribute Used                                     Attribute Information

Age                   Age of Patient. The value ranges from 29 years to 77 years
Sex                   Gender of the patient represented in binary form
                      1 = male.
                      0 = female
Chest Pain            Chest pain. Its value range from 1 to 4.
                       1 used to represent typical angina, 2 used to describe atypical angina, 3 used to
                      represent non-anginal Pain, and 4 is used to represent asymptomatic.
Resting Blood         The attribute is used to represent the patient's resting BP, and the unit to
Pressure              measure it is mm Hg.
Cholesterol           The attribute is used to represent the patient's serum cholesterol, and its unit
                      of measurement is mg/dl.
Fasting Blood Sugar   An attribute represents the Fasting blood sugar of the patient. There are two
                      values used in the dataset if the recorded value is > 120 mg/dl, then it is shown
                      by 1 (true), else it is shown by 0 (false).
                      1 = True.
                      0 = False.
Resting ECG           The attribute is used to represent the resting electro-cardiographic records of
                      the patient. The value ranges from 0 to 2
                      0 is representing the Normal range.
                      1 is representing the ST-T wave abnormality of the patient.
                      2 is used to show probable or definite left ventricular hypertrophy by Estes'
                      criteria.
Heart Rate            The attribute is used to represent the maximum heart rate of the patient
                      achieved.
Exercise Included     Exercise-induced angina and represented in binary
Angina                1 is used to represent yes.
                      0 is used to represent no.
Old Peak              The attribute is used to represent ST depression induced by exercise, which is
                      relative to rest.
Slope                 The attribute is used to measure the slope for peak exercise. The range of the
                      recorded values is from 1 to 3.
                      Up sloping is represented by 1, flat is shown through value 2, and 3 is used to
                      represent downsloping.
                                                                                                          88



Major Vessels             The attribute is used to represent the no. of significant vessels colored by
                          fluoroscopy. Recorded values are range from 0 to 3, and the value is related to
                          the darkness of the color.
Thallium Scan             The attribute is used to record the Thallium Scan of the patient. It represents
                          the values 3, 6, or 7. 3 represents a normal range, 6 is used to represent fixed
                          defect, and 7 represents reversible defect.


                                  Table 2: Cleveland Dataset Results
 Algorithms       Accuracy (%)     F-Measure (%)        Precision (%)     Recall (%)    Time (In Seconds)

 NB Tree          84.46            84.5                 84.5              84.5          0
 Naive Bayes      80.58            80.6                 80.6              80.6          1.57



The second dataset is taken from [17],                     dataset instances have taken 123, Long Beach
collected from five other heart disease                    VA heart disease dataset instances have taken
databases. There is a total of 1190 instances              200 and Stalog heart disease dataset instances
in the dataset, and these instances are                    taken 270. Dataset is a combination of 11
collected from the dataset Cleveland heart                 common features between all the datasets.
disease dataset instances taken 303,                       Description of all feature used in the dataset
Hungarian heart disease dataset instances                  is given in Table 3, and their results using the
have taken 294, Switzerland heart disease                  WEKA tool is given in Table 4.

               Table 3: Heart Disease Dataset (Comprehensive) Attribute Information

 Attribute Used            Attribute Information
 Age                       Age of Patient. The value ranges from 28 years to 77 years
 Sex                       Gender of the patient represented in binary form
                           1 = male.
                           0 = female
 Chest Pain                Chest pain. Its value range from 1 to 4.
                            1 used to represent typical angina, 2 used to represent atypical angina, 3 used
                           to represent non-anginal Pain, and 4 is used to represent asymptomatic.
 Resting BP                The attribute is used to represent the patient's resting BP, and the unit to
                           measure it is mm Hg.
 Cholesterol               The attribute is used to represent the patient's serum cholesterol, and its unit
                           of measurement is mg/dl.
                                                                                                        89



Fasting Blood Sugar     An attribute represents the Fasting blood sugar of the patient. There are two
                        values used in the dataset if the recorded value is > 120 mg/dl then it is shown
                        by 1 (true), else it is shown by 0 (false).
                        1 = True.
                        0 = False.
Resting ECG             The attribute is used to represent the resting electro-cardiographic records of
                        the patient. The value ranges from 0 to 2
                        0 is representing the Normal range.
                        1 is representing the ST-T wave abnormality of the patient.
                        2 is used to show probable or definite left ventricular hypertrophy by Estes'
                        criteria.
Maximum Heart Rate The attribute is used to represent the maximum heart rate of the patient
                        achieved.
Exercise Angina         Exercise-induced angina and represented in binary
                        1 is used to represent yes.
                        0 is used to represent no.
Old Peak                The attribute is used to represent ST depression induced by exercise, which is
                        relative to rest.
ST Slope                The attribute is used to measure the slope for peak exercise. The range of the
                        recorded values is from 1 to 3.
                        Up sloping is represented by 1, flat is shown through value 2, and 3 is used to
                        represent downsloping.
Target                  Used for the prediction


                  Table 4: Heart Disease Dataset (Comprehensive) Results
Algorithms        Accuracy (%)       F-Measure (%)        Precision (%)   Recall (%)   Time (In Seconds)

NB Tree               88.39                 88.4               88.4         88.4             5.54
Naive Bayes           83.70                 83.7               83.7         83.7               0


3.5      Evaluation Matrices
                                                                                                           90



We have considered four parameters for our                Recall is the measure of correctly predicted
paper. In the present work, the prediction class is       classes out of the total positive classes. The
if the person having specific attributes has died         formula is as follows:
because of heart disease or not, so the class C in        Recall= (TP)/(TP+FN)                       (2)
the above table is no. of instances belonging to          Precision is the measure of actual positive classes
the class. Figure 2 is the confusion matrix.              out of all the correctly predicted positive classes.
TP is the actual no of people who died because of         The formula for the recall is as follows:
heart disease, and the model also predicted the           Precision = TP/(TP+FP)                     (3)
same. Similarly, TN is the person who didn't die          Comparing the two models becomes problematic
of a heart ailment, and our model also predicted          when the precision is low, and the recall value is
the same. False Positive (FP) is a Type I error           high. In the case of vice versa is true. The two
because the model predicted that the person died          parameters are not of much use for comparison of
of an ailment, but actually, the patient didn't.          the models. F-score is used to compare the
False-negative is a type II error. The model              models in such cases. F-score uses the harmonic
predicted that the person didn't die of the               mean of the two values. This helps to measure the
alignment, but actually, he/she did.                      recall and precision at the same time. Instead of
The accuracy of the model is calculated through           the Arithmetic mean, harmonic mean is used
the formula given below:                                  because Arithmetic mean is sensitive to extreme
Accuracy = (TP+TN)/Total no. of instance (1)              values.
                                                          F-score= (2*Recall*Precision) / (Recall +
                                                          Precision)                                 (4)

                 Actual                               C                    Not in C
                 class\Predicted class
                 C                          True Positives (TP)      False Negatives (FN)
                 Not in C                   False Positives (FP)     True Negatives (TN)
                                         Figure2: Confusion Matrix


                                                          same time, the decision tree assumes that the
4.      Results and Discussion                            features are dependent on each other. The
                                                          present work tries to determine if the
We have used two datasets with 303                        parameters age, gender, cholesterol, etc., do
instances in the present work in the first and            contribute towards heart disease, and a
1190 in the second set. Naive Bayes and                   machine learning algorithm can be used to
Naive Bayes tree Algorithm has been applied               predict the alignment based on these
on the two datasets. We find that the NB tree             parameters with an accuracy of 88%.
performs better in the two datasets, which are
of different sizes and attributes. The accuracy           5.       Conclusion
and other measures are better in the NB tree
case, which is a hybrid of Naive Bayes and                The two datasets used in the present work
Decision tree. We have applied these two                  show a similar accuracy, which leads us to
algorithms because the Naive Bayes                        conclude that the machine learning
Algorithm works on the hypothesis that the                algorithms can predict heart diseases in
features are independent of each other. At the            patients with specific existing alignments like
                                                                                                         91



High BP, cholesterol, etc. We find a                   Appl. Sci. Eng. Technol., vol. 4, no. 2, 2016, doi:
difference in the accuracy of the two methods          10.18775/ijmsba.1849-5664-5419.2014.43.1004.
applied on the two datasets, namely Naive              [9]      R. V. Sarangam Kodati, "A Comparative
Bayes and NB tree. The difference in                   Study on Open Source Data Mining Tool for
                                                       Heart Disease," Int. J. Innov. Adv. Comput. Sci.,
accuracy is that Naive Bayes assumes the
                                                       vol. 7, no. 3, 2018, [Online]. Available:
independence of features. NB Tree (a hybrid            http://www.diva-
of the Decision tree) assumes that the features        portal.org/smash/get/diva2:1080911/FULLTEX
are dependent on each other. Higher accuracy           T01.pdf.
in the NB tree makes us conclude that                  [10]     N. Singh, P. Firozpur, and S. Jindal,
parameters like age, gender, cholesterol, and          "Heart disease prediction system using hybrid
high Bp are dependent on each other, leading           technique of data mining algorithms," Int. J. Adv.
to a heart ailment in patients.                        Res. Ideas Innov. Technol., vol. 4, no. 2, pp. 982–
                                                       987, 2018.
References:                                            [11]     S. Krishnan and S. Geetha, "Prediction of
                                                       Heart Disease Using Machine Learning
[1]      W. H. O. (WHO), "Cardiovascular               Algorithms.," in 2019 1st International
Diseases."             https://www.who.int/health-     Conference on Innovations in Information and
topics/cardiovascular-diseases#tab=tab_1               Communication Technology (ICIICT), 2019, pp.
(accessed Nov. 15, 2020).                              1–5.
[2]      A. L. Bui, T. B. Horwich, and G. C.           [12]     S. Wang, L. Jiang, and C. Li, "Adapting
Fonarow, "Epidemiology and risk profile of heart       naive Bayes tree for text classification," Knowl.
failure," Nat. Rev. Cardiol., vol. 8, no. 1, p. 30,    Inf. Syst., vol. 44, no. 1, pp. 77–89, 2015.
2011.                                                  [13]     L. Li, Y. Wu, and M. Ye, "Experimental
[3]      P. A. Heidenreich et al., "Forecasting the    comparisons       of     multi-class    classifiers,"
future of cardiovascular disease in the United         Informatica, vol. 39, no. 1, 2015.
States: a policy statement from the American           [14]     P. Ahmad, S. Qamar, and S. Q. A. Rizvi,
Heart Association," Circulation, vol. 123, no. 8,      "Techniques of data mining in healthcare: a
pp. 933–944, 2011.                                     review," Int. J. Comput. Appl., vol. 120, no. 15,
[4]      M. Durairaj and N. Ramasamy, "A               2015.
comparison of the perceptive approaches for            [15]     S. S. Nikam, "A comparative study of
preprocessing the data set for predicting fertility    classification techniques in data mining
success rate," Int. J. Control theory Appl., vol. 9,   algorithms," Orient. J. Comput. Sci. Technol.,
no. 27, 2016.                                          vol. 8, no. 1, pp. 13–19, 2015.
[5]      J. Mourao-Miranda, A. L. W. Bokde, C.         [16]     Ronit, "Heart Disease UCI," 2018.
Born, H. Hampel, and M. Stetter, "Classifying          https://www.kaggle.com/ronitf/heart-disease-uci
brain states and determining the discriminating        (accessed Nov. 12, 2020).
activation patterns: support vector machine on         [17]     M. Siddhartha, "Heart Disease Dataset
functional MRI data," Neuroimage, vol. 28, no. 4,      (Comprehensive),"                              2019.
pp. 980–995, 2005.                                     https://www.kaggle.com/sid321axn/heart-
[6]      S. Ghwanmeh, A. Mohammad, and A.              statlog-cleveland-hungary-final (accessed Nov.
Al-Ibrahim, "Innovative artificial neural              12, 2020).
networks-based decision support system for heart       [18]     V. Madaan and A. Goyal, "Predicting
diseases diagnosis," 2013.                             Ayurveda-Based Constituent Balancing in
[7]      F. Amato, A. López, E. M. Peña-               Human Body Using Machine Learning
Méndez, P. Va\vnhara, A. Hampl, and J. Havel,          Methods," in IEEE Access, vol. 8, pp. 65060-
“Artificial neural networks in medical diagnosis.”     65070,                    2020,                  doi:
Elsevier, 2013.                                        10.1109/ACCESS.2020.2985717.
[8]      S. K.Gomath, "Heart Disease Prediction        [19] Vishu Madaan and Anjali Goyal, “Analysis
Using Data Mining Classification," Int. J. Res.
                                                                                               92



and Synthesis of a Human Prakriti Identification   Kumar, “Fuzzy Rule Based Medical Expert
System Based on Soft Computing Techniques”,        System to Identify the Disorders of Eyes, ENT
Recent Patents on Computer Science, 12(1), pp 1-   and Liver”, International Journal of Advanced
10,                 2019.                  DOI:    Intelligence Paradigm (IJAIP), vol 7, issue3-4,
10.2174/2213275912666190207144831                  pp. 352-367, Inderscience Publications, 2015.
[20]    Prateek Agrawal, Vishu Madaan, Vikas