Prediction of Heart Disease Mortality Rate Using Data Mining Prasenjit Dasa, Shaily Jainb, Chetan Sharmac, Shankar Shambhua, Sakshid a, Chitkara University School of Computer Applications, Chitkara University, Himachal Pradesh, India b Chitkara University Institute of Engineering and Technology, Chitkara University, Himachal Pradesh, India c Chitkara University Himachal Pradesh, India d Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India Abstract Heart disease is the most acute disease with the highest mortality rate in the world. Prediction and timely treatment of this deadly disease only can reduce its effectiveness. Our paper aims to predict heart disease death using different data mining algorithms with utmost accuracy. In this context, we have used five data mining algorithms, Naive Bayes, LBLinear, Naive tree, Regression and Bayesian network on weka implementing on a dataset from UCI repository. According to the results obtained after execution, all data mining algorithms are predictive with good accuracy. We have evaluated accuracy, f-measure, recall, and precision to compare different data mining algorithms in consideration. However, the Bayes network outperforms all with a maximum accuracy of 79.26%. The values of other parameters are also highest in the Bayes network compared to the other four algorithms. Keywords: Classification, Prediction, Algorithms, Heart Disease, Data Mining, WEKA 1. INTRODUCTION Data Mining is a branch of computer science Cardiovascular disease is the most commonly that is widely used in many fields. Data mining occurring disease, leading to maximum deaths means that mining or digging out knowledge or around the globe [1]. According to WHO, more useful information from a vast amount of data. than 19 million people died from cardiovascular Through data mining, we can explore small to diseases in 2018, and around 4 million of these large datasets to dig out any useful data deaths are of non-senior citizens. previously hidden or unknown and detect Large amounts of data are available with our relationships between different parameters that health care industry which can be mined to were not possible with statistical methods. In determine hidden information about diseases the health care industry, by applying data and be used for effective decision making mining techniques, we can diagnose and predict beforehand [2]. Many researchers have already the occurrence of disease and the probability of been motivated by the increasing mortality rate death. Early prediction and diagnosis of the of cardiovascular diseases and started working disease can further decrease the death rate. in the direction of extracting useful information ____________________________________ using various data mining techniques [3]. Hence, if we can design a prediction system for ACI’21: Workshop on Advances in Computational Intelligence at ISIC 2021, February 25-27, 2021, Delhi, India different diseases like heart using machine EMAIL: prasenjit.das@chitkarauniversity.edu.in (P. Das); learning or deep learning methods, medical shaily.jain@chitkarauniversity.edu.in (S. Jain); professionals can forego symptoms or problems chetan.sharma@chitkarauniversity.edu.in (C. Sharma); shankar.shambhu@chitkarauniversity.edu.in (S. Shambhu); related to the heart based on the available data sakshi@chitkara.edu.in (Sakshi) about patients and various attributes that ORCID: 0000-0002-7988-2418 (P. Das); 0000-0001-6078-3607 contribute to the occurrence of heart disease. (S. Jain); 0000-0001-5401-8503 (C. Sharma); 0000-0002-2348- 1041 (S. Shambhu); 0000-0002-8757-4001 (Sakshi) One major challenge in assisting doctors in ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). diagnosing the world’s most deadly disease CEUR Workshop Proceedings (CEUR-WS.org) needs utmost accuracy [4]. Hence, most of the research is aiming to improve diagnosis with the accuracy of 86.12% and Decision Tree accuracy. with only 80.4%. The authors proposed a three- This paper used different classification phase model in [11] for heart disease diagnosis. algorithms to evaluate and compare some They achieved an accuracy of only 88.89%. parameters like accuracy in predicting death Accuracy is the most important factor in rate, f-measure, precision, and recall. Section 2 prediction, but this is not only the one. Some of this article is about the related work in this researchers have taken some other parameters domain, and the proposed methodology is like precision, recall, f-measure, and R2 values discussed in section 3. Section 4 tabulates our into heart disease prediction. In [12], the experimental setup along with our results and a authors used the Dimensionality reduction discussion on them. Finally, the paper is technique to process the raw data of 74 features concluded in section 5. first and then divide them into three groups. They could achieve the highest accuracy of 2. Background and Motivation 99.4% for CH, 100% precision, and 97.1% recall while using CHI-PCA with RF classifier. Shamsollahi in [13] has used combined Intensive research has been going on for the predictive and descriptive approaches for past few decades to predict heart disease using predicting Coronary Artery Disease. They have data mining techniques. Various data mining selected the k-means method for clustering algorithms like Naive Bayes, Decision Tree, (descriptive) and various classification methods Neural Network, Support Vector, Logistic (predictive), including CHAID, Quest, C5.0, Regression, Machine(SVM), k-Nearest C&RT decision tree, and ANN method. They Neighbour, Artificial neural network, Random compared the results on parameters precision, Forest, J48 have already been used by accuracy, specificity, sensitivity, and error rate. researching in determining different levels of As per the results, C&RT comes out as the best accuracy on multiple datasets around the globe method for the entire dataset with only 0.074 [5]. errors. In [14], authors applied decision tree Guidi et al. and others in [6] designed a clinical classification using J48, random forest, and decision support system (CDSS) for the heart logistic model trees algorithms on the UCI failure analysis. In their paper, performance repository. It is concluded from their results comparison of various machine learning that the J48 tree classification algorithm is the classifiers like Artificial neural network most excellent classifier for heart disease (ANN), support vector machine (SVM), CART prediction because it achieves the highest system with fuzzy rules, and Random forests accuracy and smallest amount of total time to has been made in which CART model and build. Moreover, effect is pruning is clearly Random forest outperformed by achieving an visible. They could achieve an accuracy of only accuracy of 87.6%. In [7], the authors proposed 56.76% and time to build is 0.04 seconds for a logistic regression classifier after feature J48 while logistic model trees reach the only selection based upon a decision support system accuracy of 55.77% with a total time to build for the classification of heart disease and 0.39 seconds. achieved an accuracy of 77%. Authors in [8] Authors have implemented five different used two approaches, multilayer perceptron classifying algorithms: Naïve Bayes, Decision (MLP) and support vector machine, to classify Tree, discriminant, Random Forest, and heart disease and reached an accuracy of Support Vector Machine with big datasets and 80.41%. In [9], the authors proposed and compared their performance in terms of evaluated a hybrid classification system of heart accuracy, precision, specificity, recall, and F- disease and achieved an accuracy of 87.4%. measure [15]. Among all five classifiers, the They combine the fuzzy and artificial neural decision tree ranks first, achieving an accuracy network techniques for classification to find the of 99.0%, with random forest stands at the results. Palaniappan et al. In [10] have applied second position with an accuracy of 93.4%. Naive Bayes, ANN, and Decision Tree algorithms to diagnose the existence of heart disease. According to their results, ANN comes 3. Proposed Methodology out as the best predictive model with an accuracy of 88.12% compared to Naive Bayes The experiment's process flow is explained in Figure 1 and further sections explain the proposed methodology used. Figure1: Methodology used records. This patient’s data includes the data of 3.1 Dataset 194 men patients and 105 women patients. However, we have used only 12 attributes for this experimentation, as shown in table 1. We We have taken the UCI repository dataset from have not taken the Time attribute considering it Kaggle [16] named as Heart Failure prediction. is the consultation duration, and we feel it not The dataset has in total 13 attributes and 299 so relevant for our study. Table 1: Dataset Information Attribute ID Attribute Used Attribute Information A1 Age Age of Patient. The value ranges from 40 years to 95 years A2 Sex Gender of the patient represented in binary form 1 = male. 0 = female A3 Anemia Reduction in hemoglobin 1:Yes 0:No A4 Creatinine Level of the creatinine phosphokinase (CPK) enzyme in the Phosphokinase blood measured in micrograms per liter A5 Diabetes Fasting blood sugar of the patient. If greater than 120 mg/dl the value is 1 (true), otherwise value is 0 (false). 1 = true. 0 = false. A6 Ejection Percentage of blood leaving ranges from 14 to 80 Fraction A7 High Blood If a patient has high blood pressure (BP>120/80) Pressure 1:Yes 0:No A8 Platelets Platelet count in the blood and its unit is shown in kplatelets/ml A9 Serum Creatinine level in blood and its unit of measure is mg/dl Creatinine A10 Serum Sodium Sodium level found in patient blood and its unit is milliequivalents per liter A11 Smoking Patient smokes 1: Yes 0: No A12 Time This is follow up time with patients A13 DEATH_EVENT The occurrence of death due to heart disease 1 = yes. 0 = no 3.2 Data Pre Processing Bayes classification, Bayes network, and Liblinear. The real-life data consists of redundant values Regression [17][18]: Regression is a and lots of noise. The data needs to be cleaned, supervised learning technique used to predict and the missing values need to be filled before the class of the dataset when the target values the data is fed to generate a model. In the pre- are known[19]. The current study includes the processing process, these issues are taken care regression to generate a model with the of so that the prediction can be made accurately. parameters, namely, age, gender, etc., and we Once the cleaning of data is done, i.e., the noise have predicted the unknown class. The is removed, and the missing values are filled, technique of regression works as follows: we need to transform it. Many supervised The parameters used to make the prediction are learning algorithms work on nominal or continuous variables (θ1, θ2, ..., θn). Based on cardinal data. So data transformation is applied these parameters, the model tries to find the best to the dataset obtained from UCI in the present fit to predict Y's target variable and improve work. Reduction of the dataset is applied to upon the accuracy. Using the function F of convert the complex dataset into a more more predictors (x1, x2, ..., xn ) and a factor e as straightforward form, improving the model's an error, the formula for calculation Y (value of accuracy the target variable ) as 3.3 Tool Used Y=F(x, θ) + e (1) WEKA 3.8.4 machine learning tool is used to The target variable Y is dependent on the conduct this study written in Java and predictor variables, which are independent of developed at the University of Waikato. WEKA each other. The model is generated based on the tool provides us with different classifiers to relation between the predictors and the target examine the performance. WEKA is used to class. This is done in the training process. The evaluate different data mining tasks like pre- model thus built is now fed with different processing, classification, regression, and many unknown datasets for which the target value is more. WEKA accepts .csv and .arff file format predicted. The number of correctly predicted and the chosen dataset has already created the classes constitutes the accuracy and establishes required data in the mentioned format. the effectiveness of the model. Naive Bayes Tree: It is a hybrid approach in 3.4 Classification Algorithms which the model is generated using the naïve Bayes and Decision tree Approach. The naïve After going through an intensive literature Bayes classification assumes that the features review, we have selected five classification are unbiased of each other, and the decision tree algorithms: regression, naive Bayes tree, naive assumes that the features are dependent on each other. So the hybrid approach takes advantage ordered pair U= (G, Y). The first component of of both approaches. The decision tree is built by the ordered pair G is the acyclic graph. In this considering only one feature, and output is fed graph, the vertices represent the random to the node. Based on the outcome of each node, variable X1, X2……, Xn, and the edges other features are selected. In this hybrid represent the relationship between these approach, the split is done in the same manner variables. The second component, Y, is the set by considering only one feature at every node of features that constitute the network. It but with Naive-Bayes classifiers at the leaves. contains a feature Yxi|xi = PB(xi|xi ) for each In large datasets, data splitting is considered a possible value xi of Xi, and Πxi of ΠXi , where vital and important task for classification using ΠXi denotes the set of parents of Xi in G. A the features we have implemented the naive Bayesian network B defines a joint probability Bayes tree classification. distribution (PDF) over U, and this is a unique PDF. Naive Bayes Classification [20]–[22]: This PB(X1,X2,……,Xn) = Π PB(Xi|ΠXi) (3) classification technique is based on the Bayes theorem, which works on the assumption that LiBLinear: LIBLinear is an open-source the existence of one feature is independent of library for linear classification. It supports two the other feature. The advantage of the Naive linear classifications, one logistic regression, Bayes classification is that it requires a small and another is the Linear Support vector amount of data to create/train the model. machine. Given a set of instance-label pairs (xi; Bayes theorem provides a way of calculating yi); i = 1; : : : ; l; xi 2, both methods solve the posterior probability (conditional probability following unconstrained optimization problem where we are finding probability under a given with different loss functions _(w; xi; yi): condition assumed to be true ) P(c|x) from P(c), P(x), and P(x|c). The following is the formula to calculate posterior probability: C is a P(c|x)=P(x|c)*P(c)/P(x|c) (2) penalty parameter, and C>0 (4) Where: P(c|x) is the conditional probability that occurs when x has already occurred P(c) is the known probability of the class. 3.5 Evaluation Matrices P(x|c) is the conditional probability of x condition that c has occurred. We have considered four parameters for our P(x) the known probability of the class. paper. In the present work, the prediction class is if the person having certain attributes has died Bayes Network: The naïve Bayes algorithm because of heart disease or not, so the class C in assumes the independence of features. This the above table is no. of instances belonging to hypothesis hampers the performance of the NB the class. Figure 2 is the confusion matrix. classifier. To improve the performance of the TP is the number of people who died because of classifier, the Bayes networking algorithm is heart disease, and the model also predicted the applied. The network is an acyclic graph that same. Similarly, TN is the person who didn’t shows the joint probability distribution of the die of a heart ailment, and our model also random variables/features. Each node/vertex of predicted the same. False Positive (FP) is a the graph represents a feature, and the edge Type I error because the model predicted that represents the correlation between the features. the person died of an ailment, but actually, the This, in a way, reduces the effect of the patient didn’t. False-negative is a type II error. hypothesis that the features are independent of The model predicted that the person didn’t die each other. The independence of the features is of the alignment, but he/she did. then evaluated to reduce the number of parameters needed to calculate the probability The accuracy of the model is calculated through distribution and compute the posterior the formula given below: probabilities. The acyclic graph is a joint probability distribution of random variables, Accuracy = (TP+TN)/Total no. of instance say U. mathematically, we can say that it is an (5) Comparing the two models becomes difficult The recall is the measure of correctly predicted when the precision is low, and the recall value classes out of the total positive classes. The is high. In the case of vice versa, the two formula is as follows: parameters are not of much use for comparison of the models. F-score is used to compare the Recall= (TP)/(TP+FN) (6) models in such cases. F-score uses the harmonic mean of the two values. This helps to measure Precision is the measure of actual positive the recall and precision at the same time. classes out of all the correctly predicted positive Instead of the Arithmetic mean, the harmonic classes. The formula for the recall is as follows: mean is used because the Arithmetic mean is sensitive to extreme values. Precision = TP/(TP+FP) (7) F-score= (2*Recall*Precision) / (Recall + Precision) Actual class\Predicted class C Not in C C True Positives (TP) False Negatives (FN) Not in C False Positives (FP) True Negatives (TN) Figure2: Confusion Matrix 3.6 k-Fold Cross-Validation Dividing the dataset into k parts of equal size in which k-1 sets are used for training purposes and rest are used for evaluation is termed as k-fold cross-validation [23]. For instance, if we use 10-fold cross- validation, 90 percent of total data is used for training the classifier, and the rest 10 percent is used for evaluation. 4. Results and Discussion and the rest 34% data for evaluating the results. From the results, we can easily predict that The chosen five different classification Bayesian Network outperforms all with the algorithms were implemented on the heart highest accuracy, precision, f-measure, and disease dataset of the UCI repository. The recall in each method. Naive Bayes network experimental results have been obtained on the uses an acyclic graph where each node framework of WEKA 3.8.4. We used different represents a feature, and the edge represents its k as 5, 10, and 20 for cross-validation and relation with other features. In the present work, evaluated the above mentioned four parameters the features such as age, gender, blood pressure, using five classification algorithms on WEKA. diabetes, etc., contribute towards heart disease Table 2 tabulates the results obtained when [24]. Hence, the accuracy for this classifier taken 5-fold CV classification with five outperforms the other. This establishes our algorithms to evaluate the accuracy, F-measure, hypothesis that the features such as age, gender, precision, and recall. Similarly, table 3 and etc., when classified in the form of a graph table 4 show our experiment's simulation (where these are dependent on each other), results on weka with 10-fold and 20-fold CV means that the heart-related ailment depends on classification. Table5 tabulates the results when these factors. So we can use this technique for we have used 66% data for training the system the prediction of heart disease[25]. Table 2: Performance Comparison of classifiers (k=5) Algorithms Accuracy (%) F-Measure (%) Precision (%) Recall (%) LibLINEAR 74.24 73.1 73.1 74.2 Naïve Bayes 77.92 77.6 77.4 77.9 NB Tree 73.91 73.1 72.9 73.9 Bayes Net 78.59 78.7 78.8 78.6 Classification 71.90 70.1 70.2 71.9 via Regression Table3: Performance Comparison of classifiers (k=10) Algorithms Accuracy (%) F-Measure (%) Precision (%) Recall (%) LibLINEAR 76.58 75.6 75.7 76.6 Naïve Bayes 77.92 77.8 77.7 77.9 NB Tree 77.59 77.4 77.3 77.6 Bayes Net 79.26 79.5 79.8 79.3 Classification via Regression 75.25 73.4 74.2 75.3 Table 4: Performance Comparison of classifiers (k=20) Algorithms Accuracy (%) F-Measure (%) Precision (%) Recall (%) LibLINEAR 72.90 71.9 71.7 72.9 Naïve Bayes 75.25 75.2 75.1 75.3 NB Tree 74.91 74.7 74.5 74.9 Bayes Net 75.91 76.2 76.7 75.9 Classification via Regression 75.91 74.3 75 75.9 Table5: Performance Comparison of classifiers (percentage split= 66%) Algorithms Accuracy (%) F-Measure (%) Precision (%) Recall (%) LibLINEAR 74.50 73 74.8 74.5 Naïve Bayes 72.54 71.8 72 72.5 NB Tree 72.54 71.8 72 72.5 Bayes Net 74.50 74.4 74.3 74.5 Classification via Regression 73.52 73.7 74 73.5 5. Conclusion and Future Scope like smoking habit, diabetes, high BP, etc. Hence, we get better accuracy and prove that In this paper, five data mining classifiers these factors contribute to heart disease (LibLinear, Naive Bayes, Naive Bayes tree, occurrence. Bayes network, and classification via In the future, we could use these results to regression) on heart disease data taken from the design an effective prediction system that could UCI repository have been implemented. The help our medical practitioners diagnose and goal of this experimentation is to detect the treat heart disease. Also, we could implement accuracy in the prediction of heart disease of these data mining techniques for other diseases patients. We successfully achieved the highest like diabetes, etc. accuracy of 79.28% with the Bayesian network classifier followed by naive Bayes. The reason References behind excellent performance by the Bayesian [1] S. Gupta, D. Kumar, and A. Sharma, network is the use of graphs in it, as graphs can “Performance analysis of various data mining reflect the relationship better between classification techniques on healthcare data,” dependent variables as we have in our dataset Int. J. Comput. Sci. Inf. Technol., vol. 3, no. 4, pp. 155–169, 2011. [2] J. Soni, U. Ansari, D. Sharma, and S. predictive methods of data mining for coronary Soni, “Predictive data mining for medical artery disease prediction: a case study diagnosis: An overview of heart disease approach,” J. AI Data Min., vol. 7, no. 1, pp. prediction,” Int. J. Comput. Appl., vol. 17, no. 47–58, 2019. 8, pp. 43–48, 2011. [14] J. Patel, D. TejalUpadhyay, and S. [3] C. S. Dangare and S. S. Apte, Patel, “Heart disease prediction using machine “Improved study of heart disease prediction learning and data mining technique,” Hear. system using data mining classification Dis., vol. 7, no. 1, pp. 129–137, 2015. techniques,” Int. J. Comput. Appl., vol. 47, no. [15] I. A. Zriqat, A. M. Altamimi, and M. 10, pp. 44–48, 2012. Azzeh, “A comparative study for predicting [4] S. Sa, “Intelligent heart disease heart diseases using data mining classification prediction system using data mining methods,” arXiv Prepr. arXiv1704.02799, techniques,” Int. J. Healthc. Biomed. Res., vol. 2017. 1, pp. 94–101, 2013. [16] G. J. Davide Chicco, “Heart Failure [5] S. Nazir, S. Shahzad, S. Mahfooz, and Prediction,” 2015. M. Nazir, “Fuzzy logic based decision support https://www.kaggle.com/andrewmvd/heart- system for component security evaluation.,” failure-clinical-data (accessed Nov. 10, 2020). Int. Arab J. Inf. Technol., vol. 15, no. 2, pp. [17] F. E. Harrell, “Ordinal logistic 224–231, 2018. regression,” in Regression modeling strategies, [6] G. Guidi, M. C. Pettenati, P. Melillo, Springer, 2015, pp. 311–325. and E. Iadanza, “A machine learning system to [18] V. Vapnik, The nature of statistical improve heart failure patient assistance,” IEEE learning theory. Springer science & business J. Biomed. Heal. informatics, vol. 18, no. 6, pp. media, 2013. 1750–1756, 2014. [19] K. Larsen, J. H. Petersen, E. Budtz- [7] R. Detrano et al., “International Jørgensen, and L. Endahl, “Interpreting application of a new probability algorithm for parameters in the logistic regression model with the diagnosis of coronary artery disease,” Am. random effects,” Biometrics, vol. 56, no. 3, pp. J. Cardiol., vol. 64, no. 5, pp. 304–310, 1989. 909–914, 2000. [8] M. Gudadhe, K. Wankhade, and S. [20] L. Li, Y. Wu, and M. Ye, Dongre, “Decision support system for heart “Experimental comparisons of multi-class disease based on support vector machine and classifiers,” Informatica, vol. 39, no. 1, 2015. artificial neural network,” in 2010 International [21] P. Ahmad, S. Qamar, and S. Q. A. Conference on Computer and Communication Rizvi, “Techniques of data mining in Technology (ICCCT), 2010, pp. 741–745. healthcare: a review,” Int. J. Comput. Appl., [9] H. Kahramanli and N. Allahverdi, vol. 120, no. 15, 2015. “Design of a hybrid system for the diabetes and [22] S. S. Nikam, “A comparative study of heart diseases,” Expert Syst. Appl., vol. 35, no. classification techniques in data mining 1–2, pp. 82–89, 2008. algorithms,” Orient. J. Comput. Sci. Technol., [10] S. Palaniappan and R. Awang, vol. 8, no. 1, pp. 13–19, 2015. “Intelligent heart disease prediction system [23] V. Madaan and A. Goyal, "Predicting using data mining techniques,” in 2008 Ayurveda-Based Constituent Balancing in IEEE/ACS international conference on Human Body Using Machine Learning computer systems and applications, 2008, pp. Methods," in IEEE Access, vol. 8, pp. 65060- 108–115. 65070, 2020, doi: [11] E. O. Olaniyi, O. K. Oyedotun, and K. 10.1109/ACCESS.2020.2985717. Adnan, “Heart diseases diagnosis using neural [24] Vishu Madaan and Anjali Goyal, networks arbitration,” Int. J. Intell. Syst. Appl., “Analysis and Synthesis of a Human Prakriti vol. 7, no. 12, p. 72, 2015. Identification System Based on Soft Computing [12] A. K. Garate-Escamilla, A. H. E. L. Techniques”, Recent Patents on Computer Hassani, and E. Andres, “Classification models Science, 12(1), pp 1-10, 2019. DOI: for heart disease prediction using feature 10.2174/2213275912666190207144831 selection and PCA,” Informatics Med. [25] Prateek Agrawal, Vishu Madaan, Vikas Unlocked, p. 100330, 2020. Kumar, “Fuzzy Rule Based Medical Expert [13] M. Shamsollahi, A. Badiee, and M. System to Identify the Disorders of Eyes, ENT Ghazanfari, “Using combined descriptive and and Liver”, International Journal of Advanced Intelligence Paradigm (IJAIP), vol 7, issue3-4, pp. 352-367, Inderscience Publications, 2015.