84 Features Contributing Towards Heart Disease Prediction Using Machine Learning Chetan Sharmaa, Shankar Shambhub, Prasenjit Dasb, Shaily Jainc, Sakshid a Chitkara University Himachal Pradesh, India b Chitkara University School of Computer Applications, Chitkara University, Himachal Pradesh, India c Chitkara University Institute of Engineering and Technology, Chitkara University, Himachal Pradesh, India d Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India Abstract WHO and other health organizations claimed that the death rate due to cardiovascular disease is one-third of worldwide. Although, many researchers have worked in this direction to help our medical professionals diagnose this disease at an early stage. This paper aims to apply data mining algorithms to predict heart disease occurrence in patients based on some features like diabetes, blood pressure, etc. We have implemented two data mining algorithms, Naive Bayes and NB tree, on two data different datasets of the UCI repository to evaluate the accuracy, f-measure, precision, and recall. Our results show NB tree outperforms with 84.6% accuracy compared to Naive Bayes with only 80.58 % accuracy. Keywords: Machine Learning, Classification, Heart, Disease, WEKA 1. Introduction The heart is the essential central part of the human diseases. Typically, the heart is unable to push the body, which provides the purified blood to each necessary amount of blood to other areas of the part of the body. Without a healthy working heart, body to satisfy the body's normal functioning. a person cannot live a single second. But, Because of this, heart failure eventually occurs nowadays, heart diseases are increasing at a rapid [2]. In the United States, the incidence of heart speed. As per the WHO, over 17.9 million people illness is very high [3]. Swelling in the feet, Chest died every year because of heart disease, and 80% pain, breathe shortness, body tiredness, Pain in of people died because of a heart attack [1]. Heart the neck and shoulders, etc., are some significant disease has been recognized as one of the world's symptoms of heart disease [4]. Techniques used most complex and life-threatening human to diagnose heart diseases at an early stage have ____________________________________ been complicated, and the resulting difficulty is one of the critical factors affecting the standard of ACI’21: Workshop on Advances in Computational Intelligence at ISIC 2021, February 25-27, 2021, Delhi, India living [5]. Because of the low availability of EMAIL: chetan.sharma@chitkarauniversity.edu.in (C. Sharma); instruments and lack of physician, diagnosis of shankar.shambhu@chitkarauniversity.edu.in (S. Shambhu); heart diseases and their treatment is very involved prasenjit.das@chitkarauniversity.edu.in (P. Das); shaily.jain@chitkarauniversity.edu.in (S. Jain); in developing countries [6]. It affects the sakshi@chitkara.edu.in (Sakshi) prediction results and treatment of heart patients, ORCID: 0000-0001-5401-8503 (C. Sharma); 0000-0002-2348- which is the main reason for the high mortality 1041(S. Shambhu); 0000-0002-7988-2418 (P. Das); 0000-0001- 6078-3607 (S. Jain); 0000-0002-8757-4001 (Sakshi) rate of heart patients. Hence, to reduce the ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). mortality rate of heart patients and provide the CEUR Workshop Proceedings (CEUR-WS.org) best treatment of heart diseases, appropriate and 85 accurate heart disease diagnosis techniques are implementations of the algorithms. Archived required [7]. These techniques should be capable results have shown that the Naive Bayes of detecting heart disease at an early stage [19- algorithm provided the best results compared to 20]. The rest of the paper is organized as Artificial Neural Network and J48. Naive Bayes follows: Section 2 discusses the background and achieved an accuracy of 79.90% and took 0.01 history of this work, Methodology is explained in second to build the model, where J48 attained the section 3 along with the description of tools, accuracy of 77.03% and took 0.01 second to build datasets, and algorithms used in evaluation, the model. Artificial Neural Network achieved an evaluation matrices etc, Results are discussed in section 4 and finally section 5 gives us conclusion Singh et al. developed a new hybrid model named of the research done in this paper. "Hybrid Genetic Naive Bayes Model". This model was developed with two different supervised techniques (Naive Bayes, Genetic 2. Literature Review and Related Algorithm) for the correct prediction of heart Work diseases. To develop this model, the researcher used a dataset taken from the UCI repository with In the last decade, many researchers worked on 303 instances and 14 important attributes. heart disease datasets to predict heart diseases. Implementation results gave the accuracy of They used multiple machine learning and data 97.14% with 98% precision value and 97.14% mining algorithms for the implementation and recall value [10]. achieved different results. Yet today, we also face a lot of issues with heart disease. Following are Krishnan et al. used two machine learning the literate review of recent research: algorithms, Decision Tree and Naive Bayes algorithms, to predict Heart Diseases. They used The authors implemented three different a dataset of 300 instances and 14 attributes taken algorithms Naive Bayes(NB), Artificial Neural from the UCI repository. Researchers Network, and J48 to find the best heart disease implemented the python programming language prediction results. Researchers used a dataset of 8 model and achieved the highest accuracy of 91% additional attributes and 210 instances of male with a Decision tree and 87% with Naive Bayes persons. WEKA tool was used for the [11]. 3. Methodology 3.1 Proposed Work Figure 1: Proposed Methodology of Study 86 output is fed to the node. Based on the outcome 3.2 Tool Used of each node, other features are selected. In this hybrid approach, the split is done in the same WEKA 3.8.4 machine learning tool is used to manner by considering only one feature at every conduct this study, written in Java and developed node but with Naive-Bayes classifiers at the at the University of Waikato. WEKA tool leaves. In large datasets, data splitting is regarded provides us with different classifiers to examine as a vital and essential task for classification the performance. WEKA is used to evaluate other using the features we have implemented the naive data mining tasks like preprocessing, Bayes tree classification. classification, regression, and many more. WEKA accepts .csv and .arff file format and the Naive Bayes Classification [13]–[15] : This chosen dataset has already created the required classification technique is based on Baye's data in the mentioned format. theorem, which works on the assumption that the existence of one feature is independent of the 3.3 Data Preprocessing other feature. The advantage of the Naive Bayes classification is that it requires a small amount of data to create/train the model. The real-life data consists of redundant values Bayes theorem provides a way of calculating and lots of noise. The data needs to be cleaned, posterior probability (conditional probability and the missing values need to be filled before the where we are finding probability under a given data is fed to generate a model [18]. In the condition assumed to be confirmed) P(c|x) from preprocessing process, these issues are taken care P(c), P(x), and P(x|c). The following is the of so that the prediction can be made accurately. formula to calculate posterior probability: Once the cleaning of data is done, i.e., the noise is removed, and the missing values are filled, we P(c|x)=P(x|c)*P(c)/P(x|c) Where: need to transform it. Many supervised learning P(c|x) is the conditional probability that occurs algorithms work on nominal or cardinal data. So when x has already occurred data transformation is applied to the dataset P(c) is the known probability of the class. obtained from UCI in the present work. P(x|c) is the conditional probability of x condition Reduction of the dataset is applied to convert the that c has occurred. complex dataset into a more straightforward P(x) the known probability of the class. form, which improves the accuracy of the model. Dataset Description 3.4 Classification Algorithms Two datasets were used in this study. The After going through an intensive literature first one was obtained from the "Cleveland review, we have selected two classification Clinic Foundation", the First dataset algorithms: naive Bayes tree, naive Bayes comprises 303 instances. The second dataset classification based on their dependency on is taken from the public available platform, a attributes. combination of five other datasets named Naive Bayes Tree [12]: It is a hybrid approach Heart Disease Dataset (Comprehensive). All in which the model is generated using the Naive the dataset are available for heart disease Bayes and Decision tree Approach. The naive having a total of 76 attributes and each Bayes classification assumes that the features are dataset choose their dataset features independent of each other, and the decision tree accordingly. Initially, both the dataset was assumes that the components are dependent on selected for the study with 76 attributes, but each other. So the hybrid approach takes they were preprocessed to produce 14 and 11 advantage of both approaches. The decision tree characteristics to reduce redundant variables. is built by considering only one feature, and 87 Consequently, we used these specific The first dataset is taken as the Cleveland attributes (listed in Table 1 and Table 3) to database, which is publically available at compare. [16]. There are 303 instances in the dataset, and their description is given in Table 1, and the results using the WEKA tool are given in Table 2. Table 1: Cleveland Dataset Attribute Information Attribute Used Attribute Information Age Age of Patient. The value ranges from 29 years to 77 years Sex Gender of the patient represented in binary form 1 = male. 0 = female Chest Pain Chest pain. Its value range from 1 to 4. 1 used to represent typical angina, 2 used to describe atypical angina, 3 used to represent non-anginal Pain, and 4 is used to represent asymptomatic. Resting Blood The attribute is used to represent the patient's resting BP, and the unit to Pressure measure it is mm Hg. Cholesterol The attribute is used to represent the patient's serum cholesterol, and its unit of measurement is mg/dl. Fasting Blood Sugar An attribute represents the Fasting blood sugar of the patient. There are two values used in the dataset if the recorded value is > 120 mg/dl, then it is shown by 1 (true), else it is shown by 0 (false). 1 = True. 0 = False. Resting ECG The attribute is used to represent the resting electro-cardiographic records of the patient. The value ranges from 0 to 2 0 is representing the Normal range. 1 is representing the ST-T wave abnormality of the patient. 2 is used to show probable or definite left ventricular hypertrophy by Estes' criteria. Heart Rate The attribute is used to represent the maximum heart rate of the patient achieved. Exercise Included Exercise-induced angina and represented in binary Angina 1 is used to represent yes. 0 is used to represent no. Old Peak The attribute is used to represent ST depression induced by exercise, which is relative to rest. Slope The attribute is used to measure the slope for peak exercise. The range of the recorded values is from 1 to 3. Up sloping is represented by 1, flat is shown through value 2, and 3 is used to represent downsloping. 88 Major Vessels The attribute is used to represent the no. of significant vessels colored by fluoroscopy. Recorded values are range from 0 to 3, and the value is related to the darkness of the color. Thallium Scan The attribute is used to record the Thallium Scan of the patient. It represents the values 3, 6, or 7. 3 represents a normal range, 6 is used to represent fixed defect, and 7 represents reversible defect. Table 2: Cleveland Dataset Results Algorithms Accuracy (%) F-Measure (%) Precision (%) Recall (%) Time (In Seconds) NB Tree 84.46 84.5 84.5 84.5 0 Naive Bayes 80.58 80.6 80.6 80.6 1.57 The second dataset is taken from [17], dataset instances have taken 123, Long Beach collected from five other heart disease VA heart disease dataset instances have taken databases. There is a total of 1190 instances 200 and Stalog heart disease dataset instances in the dataset, and these instances are taken 270. Dataset is a combination of 11 collected from the dataset Cleveland heart common features between all the datasets. disease dataset instances taken 303, Description of all feature used in the dataset Hungarian heart disease dataset instances is given in Table 3, and their results using the have taken 294, Switzerland heart disease WEKA tool is given in Table 4. Table 3: Heart Disease Dataset (Comprehensive) Attribute Information Attribute Used Attribute Information Age Age of Patient. The value ranges from 28 years to 77 years Sex Gender of the patient represented in binary form 1 = male. 0 = female Chest Pain Chest pain. Its value range from 1 to 4. 1 used to represent typical angina, 2 used to represent atypical angina, 3 used to represent non-anginal Pain, and 4 is used to represent asymptomatic. Resting BP The attribute is used to represent the patient's resting BP, and the unit to measure it is mm Hg. Cholesterol The attribute is used to represent the patient's serum cholesterol, and its unit of measurement is mg/dl. 89 Fasting Blood Sugar An attribute represents the Fasting blood sugar of the patient. There are two values used in the dataset if the recorded value is > 120 mg/dl then it is shown by 1 (true), else it is shown by 0 (false). 1 = True. 0 = False. Resting ECG The attribute is used to represent the resting electro-cardiographic records of the patient. The value ranges from 0 to 2 0 is representing the Normal range. 1 is representing the ST-T wave abnormality of the patient. 2 is used to show probable or definite left ventricular hypertrophy by Estes' criteria. Maximum Heart Rate The attribute is used to represent the maximum heart rate of the patient achieved. Exercise Angina Exercise-induced angina and represented in binary 1 is used to represent yes. 0 is used to represent no. Old Peak The attribute is used to represent ST depression induced by exercise, which is relative to rest. ST Slope The attribute is used to measure the slope for peak exercise. The range of the recorded values is from 1 to 3. Up sloping is represented by 1, flat is shown through value 2, and 3 is used to represent downsloping. Target Used for the prediction Table 4: Heart Disease Dataset (Comprehensive) Results Algorithms Accuracy (%) F-Measure (%) Precision (%) Recall (%) Time (In Seconds) NB Tree 88.39 88.4 88.4 88.4 5.54 Naive Bayes 83.70 83.7 83.7 83.7 0 3.5 Evaluation Matrices 90 We have considered four parameters for our Recall is the measure of correctly predicted paper. In the present work, the prediction class is classes out of the total positive classes. The if the person having specific attributes has died formula is as follows: because of heart disease or not, so the class C in Recall= (TP)/(TP+FN) (2) the above table is no. of instances belonging to Precision is the measure of actual positive classes the class. Figure 2 is the confusion matrix. out of all the correctly predicted positive classes. TP is the actual no of people who died because of The formula for the recall is as follows: heart disease, and the model also predicted the Precision = TP/(TP+FP) (3) same. Similarly, TN is the person who didn't die Comparing the two models becomes problematic of a heart ailment, and our model also predicted when the precision is low, and the recall value is the same. False Positive (FP) is a Type I error high. In the case of vice versa is true. The two because the model predicted that the person died parameters are not of much use for comparison of of an ailment, but actually, the patient didn't. the models. F-score is used to compare the False-negative is a type II error. The model models in such cases. F-score uses the harmonic predicted that the person didn't die of the mean of the two values. This helps to measure the alignment, but actually, he/she did. recall and precision at the same time. Instead of The accuracy of the model is calculated through the Arithmetic mean, harmonic mean is used the formula given below: because Arithmetic mean is sensitive to extreme Accuracy = (TP+TN)/Total no. of instance (1) values. F-score= (2*Recall*Precision) / (Recall + Precision) (4) Actual C Not in C class\Predicted class C True Positives (TP) False Negatives (FN) Not in C False Positives (FP) True Negatives (TN) Figure2: Confusion Matrix same time, the decision tree assumes that the 4. Results and Discussion features are dependent on each other. The present work tries to determine if the We have used two datasets with 303 parameters age, gender, cholesterol, etc., do instances in the present work in the first and contribute towards heart disease, and a 1190 in the second set. Naive Bayes and machine learning algorithm can be used to Naive Bayes tree Algorithm has been applied predict the alignment based on these on the two datasets. We find that the NB tree parameters with an accuracy of 88%. performs better in the two datasets, which are of different sizes and attributes. The accuracy 5. Conclusion and other measures are better in the NB tree case, which is a hybrid of Naive Bayes and The two datasets used in the present work Decision tree. We have applied these two show a similar accuracy, which leads us to algorithms because the Naive Bayes conclude that the machine learning Algorithm works on the hypothesis that the algorithms can predict heart diseases in features are independent of each other. At the patients with specific existing alignments like 91 High BP, cholesterol, etc. We find a Appl. Sci. Eng. Technol., vol. 4, no. 2, 2016, doi: difference in the accuracy of the two methods 10.18775/ijmsba.1849-5664-5419.2014.43.1004. applied on the two datasets, namely Naive [9] R. V. Sarangam Kodati, "A Comparative Bayes and NB tree. The difference in Study on Open Source Data Mining Tool for Heart Disease," Int. J. Innov. Adv. Comput. Sci., accuracy is that Naive Bayes assumes the vol. 7, no. 3, 2018, [Online]. Available: independence of features. NB Tree (a hybrid http://www.diva- of the Decision tree) assumes that the features portal.org/smash/get/diva2:1080911/FULLTEX are dependent on each other. Higher accuracy T01.pdf. in the NB tree makes us conclude that [10] N. Singh, P. Firozpur, and S. Jindal, parameters like age, gender, cholesterol, and "Heart disease prediction system using hybrid high Bp are dependent on each other, leading technique of data mining algorithms," Int. J. Adv. to a heart ailment in patients. Res. Ideas Innov. Technol., vol. 4, no. 2, pp. 982– 987, 2018. References: [11] S. Krishnan and S. Geetha, "Prediction of Heart Disease Using Machine Learning [1] W. H. O. (WHO), "Cardiovascular Algorithms.," in 2019 1st International Diseases." https://www.who.int/health- Conference on Innovations in Information and topics/cardiovascular-diseases#tab=tab_1 Communication Technology (ICIICT), 2019, pp. (accessed Nov. 15, 2020). 1–5. [2] A. L. Bui, T. B. Horwich, and G. C. [12] S. Wang, L. Jiang, and C. Li, "Adapting Fonarow, "Epidemiology and risk profile of heart naive Bayes tree for text classification," Knowl. failure," Nat. Rev. Cardiol., vol. 8, no. 1, p. 30, Inf. Syst., vol. 44, no. 1, pp. 77–89, 2015. 2011. [13] L. Li, Y. Wu, and M. Ye, "Experimental [3] P. A. Heidenreich et al., "Forecasting the comparisons of multi-class classifiers," future of cardiovascular disease in the United Informatica, vol. 39, no. 1, 2015. States: a policy statement from the American [14] P. Ahmad, S. Qamar, and S. Q. A. Rizvi, Heart Association," Circulation, vol. 123, no. 8, "Techniques of data mining in healthcare: a pp. 933–944, 2011. review," Int. J. Comput. Appl., vol. 120, no. 15, [4] M. Durairaj and N. Ramasamy, "A 2015. comparison of the perceptive approaches for [15] S. S. Nikam, "A comparative study of preprocessing the data set for predicting fertility classification techniques in data mining success rate," Int. J. Control theory Appl., vol. 9, algorithms," Orient. J. Comput. Sci. Technol., no. 27, 2016. vol. 8, no. 1, pp. 13–19, 2015. [5] J. Mourao-Miranda, A. L. W. Bokde, C. [16] Ronit, "Heart Disease UCI," 2018. Born, H. Hampel, and M. Stetter, "Classifying https://www.kaggle.com/ronitf/heart-disease-uci brain states and determining the discriminating (accessed Nov. 12, 2020). activation patterns: support vector machine on [17] M. Siddhartha, "Heart Disease Dataset functional MRI data," Neuroimage, vol. 28, no. 4, (Comprehensive)," 2019. pp. 980–995, 2005. https://www.kaggle.com/sid321axn/heart- [6] S. Ghwanmeh, A. Mohammad, and A. statlog-cleveland-hungary-final (accessed Nov. Al-Ibrahim, "Innovative artificial neural 12, 2020). networks-based decision support system for heart [18] V. Madaan and A. Goyal, "Predicting diseases diagnosis," 2013. Ayurveda-Based Constituent Balancing in [7] F. Amato, A. López, E. M. Peña- Human Body Using Machine Learning Méndez, P. Va\vnhara, A. Hampl, and J. Havel, Methods," in IEEE Access, vol. 8, pp. 65060- “Artificial neural networks in medical diagnosis.” 65070, 2020, doi: Elsevier, 2013. 10.1109/ACCESS.2020.2985717. [8] S. K.Gomath, "Heart Disease Prediction [19] Vishu Madaan and Anjali Goyal, “Analysis Using Data Mining Classification," Int. J. Res. 92 and Synthesis of a Human Prakriti Identification Kumar, “Fuzzy Rule Based Medical Expert System Based on Soft Computing Techniques”, System to Identify the Disorders of Eyes, ENT Recent Patents on Computer Science, 12(1), pp 1- and Liver”, International Journal of Advanced 10, 2019. DOI: Intelligence Paradigm (IJAIP), vol 7, issue3-4, 10.2174/2213275912666190207144831 pp. 352-367, Inderscience Publications, 2015. [20] Prateek Agrawal, Vishu Madaan, Vikas