Comparison of Classifiers for Predicting Heart Attack in Patients* Oliwia Cimała1,∗,†, Maria Bocheńska1,† 1 Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND Abstract Heart attack predictions play a pivotal role in patients health. While having two options of fast respond- ing to health issue, making many tests on patients to see whats wrong or compare information about the patients with others to classify a patient and narrow down the search to the right field. This study presents a comprehensive comparison of three classification algorithms — Soft Set Classifier, Naive Bayes, and K-Nearest Neighbors (KNN) — for predicting heart attack in patients. Through experi- mentation with different variations of these algorithms, including custom implementations, the project evaluates their effectiveness in recognizing high or low chance of heart attack. Methodologically, the project explores the nuances of each algorithm, discussing their underlying principles and implemen- tation details. Experimental results reveal insights into the performance of each algorithm, providing valuable considerations for practical applications. Additionally, the project discusses the significance of precision, recall, F1-score, and accuracy metrics in assessing algorithm performance. Overall this study contributes to advancing heart attack prediction technology, offering valuable insights into algorithmic approaches. Keywords Soft Set Classifier, Naive Bayes, K-Nearest Neighbors, Heart Attack Prediction, Machine Learning 1. Introduction The heart is vital to the body’s function, acting as a powerful pump that circulates blood, oxygen, and essential nutrients throughout the body. This cardiovascular system ensures that all bodily tissues receive the resources they need to operate effectively. Consequently, any issues with the heart can disrupt the normal functioning of other organs and systems, leading to widespread health problems [1]. Heart disease are the main responsible for one-third of all human deaths in the world [2], making accurate and timely diagnosis critical for effective treatment. Traditional diagnostic methods often rely on various tests and clinical evaluations, which can be time-consuming and costly. With the advancement of machine learning, there is an increasing interest in developing automated systems for predicting heart disease using patient data [3, 4, 5]. *IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania 1,∗ Corresponding author † CEUR ceur-ws.org These author contributed equally. Workshop Proceedings ISSN 1613-0073 oc307854@student.polsl.pl (O. Cimała); mb307847@student.polsl.pl (M. Bocheńska) 0009-0002-1923-0781 (O. Cimała); 0009-0001-3285-9229 (M. Bocheńska) ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Existing solutions leverage different algorithms to achieve this goal, including logistic regression, decision tree, random forest, voting and neural networks [6]. However, our study focuses on comparing three distinct classifiers: the Soft Set Classifier [7], Naive Bayes [8], and K-Nearest Neighbors (KNN) [9]. Each of these algorithms offers unique advantages and challenges, which we explore in the context of heart disease prediction. To get a closer look into the applied classifiers, the following paragraphs will briefly describe them to illustrate the differences between these calculation methods. The Soft Set classifier is a flexible and general mathematical tool used for handling uncer- tainty in data. It does not rely on predefined probabilities or distances, making it particularly useful in situations where traditional probabilistic or distance-based models like Naive Bayes or K-Nearest Neighbors (KNN) may not perform well. The classifier iteratively adjusts the membership values based on the training data, thus enabling it to handle imprecise and vague information effectively. The model’s adaptability to various forms of uncertainty makes it a valuable tool in fields where data ambiguity is prevalent. The Naive Bayes classifier is a probabilistic machine learning model based on Bayes’ theorem, which calculates the probability of a certain class given a set of features. It assumes that the features are conditionally independent, hence "naive." K-Nearest Neighbors (KNN) is a non-parametric supervised learning algorithm used for clas- sification and regression tasks. In KNN, the class of a new data point is determined by the majority class among its k nearest neighbors in the feature space. It’s simple to implement and understand but can be computationally expensive for large datasets, as it requires storing all training data and computing distances for each prediction. All three algorithms have varying time consumption, with K-Nearest Neighbors (KNN) being more computationally expensive due to its need to calculate distances for each prediction. While making the algorithms we follow the same build of the specific class. The class contains two functions the fit and predict, if needed also other functions like: distance or score of the given sample. Now, let’s delve into a brief explanation of each of the applied algorithms and the underlying thought process behind their selection. The first classifier is the Soft Set classifier that is independently create. Next, the Naive Bayes classifier is from the library, change a little to be built like a rest (it also have a fit, predict functions in Bayes class). The third classifier is a K- Nearest Neighbours algorithm but in this instance written by us. It was created following open-access models with an interest to achieve as high accuracy as possible. After performing the calculations, each algorithm displays a matrix and a table with the results of the effectiveness in defining of low or high probability of heart attack. 2. Methodology This section details the methodologies used for each classifier, including their mathematical foundations and implementation specifics. 2.1. SoP Set Classifier The Soft Set Classifier, from a mathematical perspective, assigns to each element of the set X a value from the interval <-1, 1>, representing the degree of membership of that element to the set X. A membership value of 1 indicates assignment to the negative class, while a membership value of -1 indicates assignment to the positive class. Algorithm 1: Soft Set Classifier Input: Training set 𝑋train , Training labels 𝑦train , Number of iterations 𝑛iters , Regularization parameter 𝜆param Output: Fitted model Y 1 Initialize weight vector Y to zeros of length equal to the number of features; 2 for iteration in range 𝑛iters do 3 for each sample 𝑥i, 𝑦i in 𝑋train, 𝑦train do 4 if 𝑦i * classify(𝑥i) ≤ 1 then 5 Update Y by Y ← Y + 𝑦i * 𝑥i - 2 * 𝜆param * Y 6 Return Fitted weight vector Y Algorithm 2: Soft Set Prediction Input: Test set 𝑋test, Fitted weight vector Y Output: Predicted labels 𝑦pred 1 for each sample 𝑥i in 𝑋test do 2 Compute classification score classification ← classify(𝑥i) ); 3 Assign label 𝑦pred ← sign (classification); 4 return Predicted labels 𝑦pred 2.2. Naive Bayes Classifier The Naive Bayes classifier is based on Bayes’ theorem and assumes that the features are conditionally independent given the class label. The implementation follows these steps: where 𝑃 (𝑦|𝑋) is the posterior probability of class 𝑦 given feature vector 𝑋. Algorithm 3: Naive Bayes Input: Training set 𝑋𝑡𝑟𝑎𝑖𝑛, Training labels 𝑦𝑡𝑟𝑎𝑖𝑛, Test set 𝑋𝑡𝑒𝑠𝑡 Output: Predicted labels 𝑦𝑝𝑟𝑒𝑑 1 Step 1: Initialize the Gaussian Naive Bayes model; 2 Step 2: Fit the model with the training data 𝑋𝑡𝑟𝑎𝑖𝑛 and 𝑦𝑡𝑟𝑎𝑖𝑛; 3 Step 3: Predict the labels for 𝑋𝑡𝑒𝑠𝑡 using the trained model; 2.3. K-Nearest Neighbors (KNN) Classifier The KNN classifier classifies a sample based on the majority label among its 𝑘-nearest neighbors in the training set. The distance metric used is typically the Euclidean distance: Algorithm 4: KNN Algorithm Input: Training set 𝑋𝑡𝑟𝑎𝑖𝑛, Training labels 𝑦𝑡𝑟𝑎𝑖𝑛, Test set 𝑋𝑡𝑒𝑠𝑡, Number of neighbors 𝑘 Output: Predicted labels 𝑦𝑝𝑟𝑒𝑑 1 for each sample 𝑥 in 𝑋𝑡𝑒𝑠𝑡 do 2 Compute distances between 𝑥 and all samples in 𝑋𝑡𝑟𝑎𝑖𝑛; 3 Identify the 𝑘-nearest neighbors; 4 Assign the label based on the majority vote of the neighbors; 3. Experiments 3.1. Dataset Description The dataset includes records of patients along with their medical attributes and the presence or absence of heart disease. The dataset contains 13 columns with different attributes: age, sex, number of major vessels, chest pain type, resting blood pleasure, cholesterol, maximum heart rate achieved, fasting blood sugar, resting electrocardiograph results, exercises, slope, thal rate and the last column that we compare to (target variable). All records were first normalized and then subjected to further tests. The normalization function operated on the basic min-max algorithm [10]. 3.2. Data Splitting and Testing To evaluate the performance of our classifiers, we split the dataset into a training set and a test set. This is a crucial step to ensure that the model can generalize well to unseen data. We used the ‘train-test-split‘ function from the ‘sklearn.model-selection‘ library for this purpose. X_train , X_test , y _t r a i n , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3 5 , r a n d o m _ s t a t e = 42 ) This function performs the following tasks: • Input Parameters: – X: the feature matrix containing the input data for all samples. – y: the target vector containing the labels for all samples. – test_size=0.35: specifies the proportion of the dataset to include in the test split. (Here, 35% of the data is allocated for testing, and the remaining 65% is used for training.) – random_state=42: this parameter ensures reproducibility of the results. By setting a specific random state, we ensure that the same split is generated every time the code is run. • Outputs: – X_train: the feature matrix for the training set. – X_test: the feature matrix for the test set. – y_train: the target vector for the training set. – y_test: the target vector for the test set. By splitting the data into training and testing sets, we can train the model on one subset of the data and evaluate its performance on another, independent subset. This approach helps in assessing how well the model can generalize to new, unseen data and is an essential part of model validation in machine learning. 3.3. Results Analysis To compare the different performance parameters of the used algorithms, we utilized the metrics module from the ’sklearn’ library. The dataset containing numerical values in 13 different types of attributes (medical data of the patient) with a total length of 303 records was divided into training and testing sets in a 65:35 ratio. For each algorithm, we compared parameters such as: • precision - it is a measure that determines the ratio of correctly predicted class elements to all those marked as the given class • recall - a measure that informs us how many elements from given class were correctly recognized • f1-score - it is the harmonic mean between precision and recall • support - a measure of the occurrences of each class in dataset • accuracy - it is the ratio of correctly classified samples to all cases in the test set Meaning of labels: • TP - true positive - cases that were correctly classified as positive by the classifier • TN - true negative - cases that were correctly classified as negative by the classifier • FP - false positive - an error where the test result incorrectly indicates the presence of a condition when it is not present • FN - false negative - an error where the test result incorrectly indicates the absence of a condition when it is actually present 3.4. Results As we can see in the results above in matrix we have 0 and 1 (Fig. 1) as the output were 0 is a low chance of heart attack and 1 is a higher chance of heart attack. And in the classification-report, that is from ’sklearn’ library, the 0 value is change to -1 (Tab.: 1, 2, 3). Analyzing the results shown in above matrix and table, we can observe that all three algorithm have lower precision in qualify the low chance of heart attack. Figure 1: Comparison of Different Classifiers As observed, the Soft Set algorithm struggles the most (have the lowest accuracy 70% (see Tab. 3)). With only 1% advantage at accuracy the K-Nearest Neighbors performs better then the Naive Bayes algorithm whose accuracy is at 83% (see Tab. 2). Table 1 Accuracy when model is trained with KNN: 84.11214953271028 Class Precision Recall F1-score Support -1.0 0.78 0.86 0.82 44 1.0 0.90 0.83 0.86 63 Accuracy 0.84 107 Macro avg 0.84 0.84 0.84 107 Weighted avg 0.85 0.84 0.84 107 Table 2 Accuracy when model is trained with Bayes: 83.17757009345794 Class Precision Recall F1-score Support -1.0 0.76 0.86 0.81 44 1.0 0.89 0.81 0.85 63 Accuracy 0.83 107 Macro avg 0.83 0.84 0.83 107 Weighted avg 0.84 0.83 0.83 107 Table 3 Accuracy when model is trained with Soft Set: 70.09345794392523 Class Precision Recall F1-score Support -1.0 0.60 0.82 0.69 44 1.0 0.83 0.62 0.71 63 Accuracy 0.70 107 Macro avg 0.71 0.72 0.70 107 Weighted avg 0.74 0.70 0.70 107 4. Conclusion This study presented a comparative analysis of three different classifiers for heart disease prediction. The Soft Set Classifier, while effective in handling uncertainty, showed moderate accuracy which equals 70%. The Naive Bayes classifier demonstrated high accuracy 83%, making it a strong candidate for medical diagnostics. The K-Nearest Neighbors classifier also performed well, with an accuracy of 84%. These results provide valuable insights into the strengths and limitations of each classifier, guiding future research and application in medical diagnostics. In all this pondering we need to remember that the Naive Bayes classifier wasn’t written by us. We can only assume what kind of results can give independently written the Naive Bayes algorithm and what results can bring us the K-Nearest Neighbors and Soft Set classifier written from the library. Improvements that we can make in the future are to write the Naive Bayes algorithm and check its accuracy then, remake the Soft Set algorithm so it reaches higher accuracy. In addition to boost the accuracy we can compare all of the three algorithms to the ones from library and eliminate the weak points because of which the accuracy isn’t as high as needed. References [1] H. Arghandabi, P. Shams, A comparative study of machine learning algorithms for the prediction of heart disease, International Journal for Research in Applied Science and Engineering Technology 8 (2020) 677–683. doi:10.22214/ijraset.2020.32591. [2] K. Uyar, A. Ilhan, Diagnosis of heart disease using genetic algorithm based trained recurrent fuzzy neural networks, Procedia Computer Science 120 (2017) 588–593. doi:10.1016/j. procs.2017.11.283. [3] I. Rojek, P. Kotlarz, M. Kozielski, M. Jagodziński, Z. Królikowski, Development of ai-based prediction of heart attack risk as an element of preventive medicine, Electronics 13 (2024). doi:10.3390/electronics13020272. [4] R. J. A. Laxamana, J. M. M. Vale, Heart attack prediction using machine learning algorithms, Journal of Electrical Systems 20 (2024) 1428–1436. doi:10.52783/jes.2474, license CC BY- ND 4.0. [5] S. K. Gupta, A. Shrivastava, S. P. Upadhyay, P. Chaurasia, A machine learning approach for heart attack prediction, International Journal of Engineering and Advanced Technology 10 (2021) 124–134. doi:10.35940/ijeat.F3043.0810621, mahatma Gandhi Central University Bihar, Babasaheb Bhimrao Ambedkar Central University Lucknow. [6] K. Oliullah, A. Barros, M. Whaiduzzaman, Analyzing the Effectiveness of Several Ma- chine Learning Methods for Heart Attack Prediction, 2023, pp. 225–236. doi:10.1007/ 978- 981-19-9483-8_19. [7] P. Majeed, H. A. Shareef, H. M. Darwesh, Three classes of soft functions via soft-open sets and soft-closed sets, Wasit Journal of Pure Sciences 3 (2024) 1–17. doi:10.31185/wjps. 288. [8] P. Langley, W. Iba, K. Thompson, et al., An analysis of bayesian classifiers 90 (1992) 223–228. [9] K. Prokop, Grey wolf optimizer combined with k-nn algorithm for clustering problem, in: IVUS 2022: 27th International Conference on Information Technology, 2022. [10] M. Shantal, Z. Othman, A novel approach for data feature weighting using correla- tion coefficients and min–max normalization, Symmetry 15 (2023) 2185. doi:10.3390/ sym15122185.