1. Introduction

IDDM-

Algorithms for Classification and Prediction of Heart Disease

Nataliya Boyko

nataliya.i.boyko@lpnu.ua 0

Iryna Dosiak

iryna.dosiak.knm.2018@lpnu.ua 0 0 Lviv Polytechnic National University , Profesorska Street 1, Lviv, 79013 , Ukraine

2021

4 19 21

The study aims to improve the effectiveness of health care in various ways. The paper considers ML algorithms that allow health professionals to allocate resources optimally and physicians to choose the best treatment options for patients. This approach will reduce the burden on doctors and increase and accelerate patients' access to health care, save resources and reduce costs. The paper presents the results of research that will allow the use of smaller data sets to develop transparent models. The report uses a naive Bayes classifier to predict heart disease. The advantage of this approach is that the sample size requirements are reduced from exponential to linear, which is very important. There is an overview of the classification model, its advantages and disadvantages. Materials and methods are also analyzed. Model, classification, machine learning, algorithm, Bayes classifier

1. Introduction

Machine Learning (ML) algorithms allow healthcare professionals to allocate resources optimally and physicians to choose the best treatment options for patients. This approach reduces the burden on doctors, increases and accelerates patients' access to health care, saves resources, and reduces costs. However, despite the achievements of ML research in medicine, its role is currently limited. Creating and testing a model may require large amounts of high-quality data. Besides, diagnostic models must be built individually for each disease. It is a lengthy process. In addition, the psychological aspect of trusting black box algorithms can also be difficult to perceive. However, continuing ML research may allow using smaller data sets and developing more transparent models [ 4, 13 ].

The nature of heart disease is complex. In addition, the diagnosis of heart disease in most cases depends on a complex combination of clinical and pathological data. The relationship between the real cause of the disorder and the effects of spontaneous symptoms in patients can often be hidden and not obvious [ 6 ]. to avoid medical error.

That is why the analysis of medical data in health care is considered an important but complex task that must be performed accurately and effectively. In addition, the study of medical data is necessary The basis of medical diagnosis is the problem of classification. The diagnosis comes down to the The study aims to apply and implement the original Naive Bayes model with two existing models: problem of displaying data to one of N different results. the Gaussian model and the Multinomial model. classifier with different models

This study will focus on comparative analysis, differences, capabilities, and effectiveness of the

2021 Copyright for this paper by its authors.

The purpose of classifying heart disease is to diagnose a disease in a patient based on specific diagnostic measurements included in the data set. In addition, the work will consist of searching for significant features and patterns between the various factors influencing the diagnosis.

2. Review of literature sources

For a detailed study of these tasks, you need to read and analyze the experience of scientists in this field. Since the problem is relevant, numerous studies have been conducted that have focused on diagnosing heart disease in combination with or without another condition.

• G. Parthiban, A. Rajesh, S.K. Srivatsa predicted the chances of people with diabetes having heart disease and highlighted the results in their article "Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes Method," published in the International Journal of Computer Applications [ 1 ]. The accuracy was 74%. • Mrs. Mr. Subbalakshmi, Mr. K. Ramesh M. Tech, Mr. M. Chinna Rao M.Tech developed a system that extracts hidden knowledge from a historical heart disease database using a Naive Bayes classification [ 2 ]. The article "Decision Support in Heart Disease Prediction System using Naive Bayes" was published in the Indian Journal of Computer Science and Engineering». • Jyoti Soni, Ujma Ansari, Dipesh Sharma, Sunita Soni conducted a study and compared KNN and the Naive Bayes classifier to predict heart disease [ 3 ]. However, the accuracy of the results reached 45.6% for KNN and 52.33% in the case of the Naive Bayes classifier. Their article "Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction" was published in the International Journal of Computer Applications. In the end, they added the need to improve the proposed study. • Vincy Cherian and Bindu M.S developed a heart disease prediction system using a Naive Bayes classifier and a Laplace smoothing technique [ 4 ]. They reported this in their article "Improved Study of Heart Disease Prediction System using Data Mining Classification Techniques." They achieved high accuracy. However, the system has a limit on the number of attributes - symptoms.

Unfortunately, searches for such studies among Ukrainian sources did not yield any results.

Thus, various studies only represent the effectiveness of predicting heart disease using ML methods. This study aims to find features and patterns between different factors that affect the diagnosis using a Naive Bayes classifier.

3. Methods overview

Classification solves the following problem: let there be a set of objects divided into classes on one or more grounds. Moreover, a finite set of objects is given, for which it is known to which classes they belong. Such a set is considered to be a training sample. It is unknown to which class the other objects belong. We need to build an algorithm that can classify any object of the source set - specify the number or name of the class to which it belongs [9, 11].

3.1. A mathematical formulation of the classification problem

Let X be a set of object descriptions, and Y be class numbers or names. There is an unknown target relationship - mapping , the values of which are known only on the elements of the finished training sample . We need to build an algorithm a , that can classify an arbitrary object [12].

3.2. Bayes classifier

Bayes classifier - provides a classification with a degree of confidence rather than simply issuing the most plausible class. Bayes' theorem is used to determine the degree of certainty.

Bayes' theorem describes the probability of an event, given the circumstances that may affect the event. Thus, you can more accurately calculate the probability, considering both already known information and data from new observations [14].

A Naive Bayes classifier is an assumption about the independence of traits. In other words, the NCB assumes that any attribute in the class is not related to the presence of any other feature.

3.3. Method overview

As mentioned, the Bayes classifier is based on the Bayes theorem, which describes the probability of an event, given the circumstances that may affect the event [14].

Suppose there is a symptom S. In addition, there are classes (diseases) C, which should include the symptom. It is necessary to find a class (disease) C in which the probability for this line would be maximum. The mathematical notation is given in Formula 1.

It is hard to calculate P(C|O). However, you can use Bayes' theorem and go to (Formula 2): , where P(С) - an a priori probability, the probability of meeting a class among all the data; P(O|C) - conditional probability, the probability of symptoms in each class; P(O) - total probability, probability of symptoms.

Usually, it makes no sense to work with one symptom. It is much more effective to detect the disease on several grounds. Thus Formula 2 will take the form (Formula 3):

Since you need to find the function's maximum, the denominator can be ignored (this is a constant). It is also necessary to include a "naive" assumption that the symptoms of S depend only on class C and do not depend on each other. Then the numerator will take the form (Formula 4): So, the final formula will look like (Formula 5): (1) (2) (3) (4) (5)

So it all comes down to calculating the probability P(C) and P(S|C). Calculating these parameters is called classifier training.

3.4. Multinomial Naive Bayes

Multinomial Naive Bayes implements a Naive Bayes algorithm for multinomial distributed data and is one of two classic variants of Naive Bayes [ 8 ].

This algorithm puts forward a second assumption of independence - the assumption of positional independence. Conditional probabilities of symptom onset are equally independent of its position in the data sample [9].

The data is usually presented as a vector. The basic idea is that each unique feature (symptom) that occurs is assigned a unique integer. Therefore the data can be represented as a sequence of numbers.

The distribution of the number of vectors is parameterized by vectors for each class, where n - number of features (symptom), and – the probability of the appearance in the sample of features belonging to class C.

The parameter is estimated by the smoothed version of the maximum probability. The relative frequency calculation (Formula 6): , (6) where

- the number of times the character appears in a class C sample in the training set.; - the total number of all features (symptoms) for class C;

A - Laplace smoothing.

4. Review and analysis of data

The data set about heart disease "heart.csv" is used for research [ 6 ]. It was taken from Kaggle. This database contains 76 attributes, but all published experiments involve using a subset of 14 of them, as the rest of the information is the identification of individuals. The total number is 303 rows and 14 columns, of which 165 have heart disease [ 7 ].

Attribute information: 1. age; 2. sex : (1 = a man; 0 = a woman); 3. cp: chest pain type (4 values); 4. trestbps: blood pressure at rest (in mm Hg on admission to the hospital); 5. chol: serum cholesterol in mg / dl; 6. fbs: (fasting blood sugar) (1 => 120 mg / dl; 0 = <120 mg / dl); 7. restecg: the results of electrocardiography at rest (values 0, 1, 2); 8. thalach: the maximum pulse; 9. exang: angina caused by exercise (1 = yes; 0 = no); 10. oldpeak: ST depression caused by exercise for rest; 11. slope: the slope of the peak segment of exercise ST; 12. ca: the number of major vessels (0–3) stained by fluoroscopy; 13. thal: thalassemia (1 = normal; 2 = fixed defect; 3 = reversible defect); 14. target: (1 = heart disease or 0 = no heart disease).

Fig. 1, Fig. 2, and Fig. 3 show a dataset.

4.1. Search for the correlation of heart disease with different parameters

To find the links of heart disease with different parameters, we need to build a correlation matrix (Fig. 7). 5. Trestbps (resting blood pressure) and fbs (fasting blood sugar) are negatively correlated.

Moreover, the correlation is lower for women compared to men.

For these observations, the accuracy of the conclusions should be checked, taking into account the distribution of data between men and women (Fig. 9). 11. On the other hand, a value of 0, the probable presence of hypertrophy, in itself does not indicate the presence of heart disease. 12. In itself, the feature - blood sugar levels, does not give confidence in the presence or absence of heart disease. However, we will not abandon this feature, as it can be helpful with other variables. 13. Chest pain also does not give an unambiguous answer. It is challenging to tell if a patient has heart disease that corresponds only to its symptoms.

To verify the accuracy of the conclusions, you should use PCA, which helps extract a set of variables from an existing large set of variables. These extracted variables are called essential components.

Because the data set is small and has no many features, only two components should be used to see how much variance it covers.

The study can explain approximately 90% of the variance in the data set using only two components. Fig. 14 presents each of these decomposed components:

4.2. Application of the Naive Bayes classifier

The next step is to divide the data into training and test in 80% to 20%. You should also normalize the data with OneHotEncoder and MinMaxScaler [10].

OneHotEncoder - a strategy in which each value of the category is converted into a new column, and it is assigned a value of 1 or 0 (notation for true/false). Fig. 15 shows an example of the strategy.

From the correlation matrix, you can determine the accuracy or positive predictive value (precision), the probability of detection (recall), and the completeness of the definition (f1_score). • TP - true-positive decision; • TN - true-negative decision; • FP - false-positive decision; • FN - false-negative decision.

The next step is to use the metrics for this method. The results are shown in Fig. 18.

4.3. Application of the Multinomial Naive Bayes classifier

To implement the classification, you should use MultinomialNB from the sklearn library with different states when sharing data.

The score function from the sklearn library is used to evaluate the results.

The obtained results are presented in Fig. 21.

Thus, the average estimate of the Multinomial Naive Bayes classifier for random states from 0 to 200 is 0.82. The highest score is 0.849039016334426.

The next step is to reduce the number of attributes to 7. The results of the experiments are shown in Fig. 25.

5. Discussion of experimental results

To study the accuracy of the two classification models, we use a set of data on heart disease.

Table 1 summarizes the characteristics of the data set used in the experiments.

GaussianNB MultinomialNB 14 features 0,84426229 0.85012901 7 features 0,765245901 0.830100260 Accuracy % Precision 0,828571 0,852941 F1_score 0,852941 0.865672 Figure 27: Comparison of estimates of two methods of the Naive Bayes classifier

Table 5 shows the execution time of the classification of different Naive Bayes models. On the same data set, MultinomialNB performs training faster, which again emphasizes its advantage for the selected data set.

It is also noticeable that as the number of features decreases, the time decreases (Fig. 28).

Analyzing Fig. 28, we can conclude that the Multinomial Bayes classifier is more accurate and faster for the selected data set.

So, the choice of using the Naive Bayes method depends on the data. The Multinomial Naive Bayes is appropriate if the data consists of calculations, and observations can only take non-negative integers. It is better to use the Gaussian NB for decimal features. GNB accepts features that correspond to the normal distribution.

For the selected data set, which contains features for diagnosing heart disease, the Multinomial Naive Bayes showed better results. Using this method, we can achieve greater accuracy and reduce the time to perform training.

Analyzing the study results, it is worth emphasizing the importance of choosing the correct method of the naive classifier. It helps achieve better classification results, which is critical in the medical field.

6. Conclusion

The paper considered the relevance of the topic: the use of data mining methods for diagnosing the disease in a patient on a set of indicators, such as symptoms, test results, and other indicators.

We used the Heart data set for the study, which we cleared of emissions, Null values, and normalized. We also performed a search and analysis of significant features and patterns between different factors influencing heart disease.

In addition, we used two algorithms in this work, which objectively showed the classification results on the selected dataset.

The parameters used for the analysis were the selection and deletion of the function. We first tested a classifier with all the features and then gradually reduced the set to determine which algorithm best classifies with fewer features.

The simulation results show that the Multinomial Naive Bayes classifier has better accuracy than the Gaussian method with the same data set and parameters. In addition, it reduces training time, which is very important because the annual growth of data in medicine is increasing very rapidly.

In future work, it is worth considering two aspects. Namely, we can compare more algorithms to achieve better results and potentially introduce a better algorithm in Naive Bayes. Moreover, we can try to evaluate the effectiveness of their work to justify their use in the health care system.

7. References

[9] S. Kharya, S. Soni, Weighted naive bayes classifier: A predictive model for breast cancer detection, in: International Journal of Computer Applications, 133(9) (2016): 32-37. [10] A. Ashari, P. Iman, and A. Min Tjoa, "Performance comparison between Naïve Bayes, decision tree and k-nearest neighbor in searching alternative design in an energy simulation tool", in: International Journal of Advanced Computer Science and Applications (IJACSA) (2013). [11] N.D. Uma, Extraction of action rules for chronic kidney diseas using Naive Bayes classifier,

IEEE Internstional Conference Comput Intelligence Comput Res (2016). [12] W. P. Castelli, Lipids, risk factors and ischaemic heart disease, Atherosclerosis (1996). doi: 10.1016/0021-9150(96)05851-0. [13] W. F. Wilson, W. B. Kannel, H. Silbershatz, Clustering of Metabolic Factors and Heart Disease.

159(10) (1999): 1104. doi: 10.1001/archinte.159.10.1104. [14] Stat Quest with Josh Starmer - Naive Bayes. URL: https://www.youtube.com/watch?v=O2L2Uv9pdDA&ab_channel=StatQuestwithJoshStarmerSta tQuestwithJoshStarmer%D0%9F%D1%96%D0%B4%D1%82%D0%B2%D0%B5%D1%80%D 0%B4%D0%B6%D0%B5%D0%BD%D0%BE. [15] N. Boyko, Kh. Shakhovska, L. Mochurad, J. Campos, “Information System of Catering Selection by Using Clustering Analysis”, in: Proceedings of the 1st International Workshop on Digital Content & Smart Multimedia (DCSMart 2019), Lviv, Ukraine, December 23-25, (2019): 94-106.

[1]

Parthiban ,

Rajesh ,

S.K.

Srivatsa , Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes Method , in: International Journal of Computer Applications , 24 ( 3 ) ( 2011 ). doi: 10 .5120/ 2933 - 3887 .

[2]

Subbalakshmi ,

Ramesh ,

Tech ,

M. Chinna

Rao ,

Tech , Decision Support in Heart Disease Prediction System using Naive Bayes , in: Indian Journal of Computer Science and Engineering , 2 ( 2 ) ( 2011 ): 170 - 176 .

[3]

Soni ,

Ansari ,

Sharma ,

Soni , Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction , in: International Journal of Computer Applications , Vol. 17 ( 8 ) ( 2011 ): 43 - 48 . DOI: 10 .5120/ 2237 - 2860 .

[4] Ch. Vincy , M.S. Bindu , Prediction Analysis of Cardiac Disease using Classification ( 2019 ). doi: 10 .22214/ijraset. 2019 . 6295 .

[5]

Kunanets ,

Vasiuta , N. Boikо, “ Advanced Technologies of Big Data Research in Distributed Information Systems” , in: Proceedings of the 14th International conference "Computer sciences and Information technologies" (CSIT 2019 ), Lviv, Ukraine, September 17 - 20 ( 2019 ): 71 - 76 . DOI: 10 .1109/STC-CSIT. 2019 . 8929756 .

[6]

Heart

Database . URL: https://www.kaggle.com/zhaoyingzhu/heartcsv.

[7]

Clinic

Manufactory - Cardiovascular Diseases . URL: https://manufacturaclinica.com/blog/sertsevo-sudinni-zahvoryuvannya.

[8]

W.J.

Loesche , Periodontal disease as a risk factor for heart disease , in: Compendium , 15 ( 8 ): 978 .