Analysis of Machine Learning Algorithms for Classification and Prediction of Heart Disease

Analysis of Machine Learning Algorithms for Classification and Prediction of Heart Disease NataliyaBoyko nataliya.i.boyko@lpnu.ua Lviv Polytechnic National University

Profesorska Street 1 79013 Lviv Ukraine

IrynaDosiak Lviv Polytechnic National University

Profesorska Street 1 79013 Lviv Ukraine

Analysis of Machine Learning Algorithms for Classification and Prediction of Heart Disease B52CA246F5797631B7BCB96FD6D4B772 GROBID - A machine learning software for extracting information from scholarly documents Model classification machine learning algorithm Bayes classifier 0000-0002-6962-9363 (N. Boyko); 0000-0002-5488-4468 (I. Dosiak)

The study aims to improve the effectiveness of health care in various ways. The paper considers ML algorithms that allow health professionals to allocate resources optimally and physicians to choose the best treatment options for patients. This approach will reduce the burden on doctors and increase and accelerate patients' access to health care, save resources and reduce costs. The paper presents the results of research that will allow the use of smaller data sets to develop transparent models. The report uses a naive Bayes classifier to predict heart disease. The advantage of this approach is that the sample size requirements are reduced from exponential to linear, which is very important. There is an overview of the classification model, its advantages and disadvantages. Materials and methods are also analyzed.

Introduction

Machine Learning (ML) algorithms allow healthcare professionals to allocate resources optimally and physicians to choose the best treatment options for patients. This approach reduces the burden on doctors, increases and accelerates patients' access to health care, saves resources, and reduces costs. However, despite the achievements of ML research in medicine, its role is currently limited. Creating and testing a model may require large amounts of high-quality data. Besides, diagnostic models must be built individually for each disease. It is a lengthy process. In addition, the psychological aspect of trusting black box algorithms can also be difficult to perceive. However, continuing ML research may allow using smaller data sets and developing more transparent models [4,13].

The nature of heart disease is complex. In addition, the diagnosis of heart disease in most cases depends on a complex combination of clinical and pathological data. The relationship between the real cause of the disorder and the effects of spontaneous symptoms in patients can often be hidden and not obvious [6].

That is why the analysis of medical data in health care is considered an important but complex task that must be performed accurately and effectively. In addition, the study of medical data is necessary to avoid medical error.

The basis of medical diagnosis is the problem of classification. The diagnosis comes down to the problem of displaying data to one of N different results.

The study aims to apply and implement the original Naive Bayes model with two existing models: the Gaussian model and the Multinomial model. This study will focus on comparative analysis, differences, capabilities, and effectiveness of the classifier with different models

The purpose of classifying heart disease is to diagnose a disease in a patient based on specific diagnostic measurements included in the data set. In addition, the work will consist of searching for significant features and patterns between the various factors influencing the diagnosis.

Review of literature sources

For a detailed study of these tasks, you need to read and analyze the experience of scientists in this field. Since the problem is relevant, numerous studies have been conducted that have focused on diagnosing heart disease in combination with or without another condition.

• G. Parthiban, A. Rajesh, S.K. Srivatsa predicted the chances of people with diabetes having heart disease and highlighted the results in their article "Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes Method," published in the International Journal of Computer Applications [1 ]. The accuracy was 74%. • Mrs. Mr. Subbalakshmi, Mr. K. Ramesh M. Tech, Mr. M. Chinna Rao M.Tech developed a system that extracts hidden knowledge from a historical heart disease database using a Naive Bayes classification [2]. The article "Decision Support in Heart Disease Prediction System using Naive Bayes" was published in the Indian Journal of Computer Science and Engineering». • Jyoti Soni, Ujma Ansari, Dipesh Sharma, Sunita Soni conducted a study and compared KNN and the Naive Bayes classifier to predict heart disease [3]. However, the accuracy of the results reached 45.6% for KNN and 52.33% in the case of the Naive Bayes classifier. Their article "Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction" was published in the International Journal of Computer Applications. In the end, they added the need to improve the proposed study. • Vincy Cherian and Bindu M.S developed a heart disease prediction system using a Naive Bayes classifier and a Laplace smoothing technique [4]. They reported this in their article "Improved Study of Heart Disease Prediction System using Data Mining Classification Techniques." They achieved high accuracy. However, the system has a limit on the number of attributes -symptoms. Unfortunately, searches for such studies among Ukrainian sources did not yield any results. Thus, various studies only represent the effectiveness of predicting heart disease using ML methods. This study aims to find features and patterns between different factors that affect the diagnosis using a Naive Bayes classifier.

Methods overview

Classification solves the following problem: let there be a set of objects divided into classes on one or more grounds. Moreover, a finite set of objects is given, for which it is known to which classes they belong. Such a set is considered to be a training sample. It is unknown to which class the other objects belong. We need to build an algorithm that can classify any object of the source set -specify the number or name of the class to which it belongs [9,11].

A mathematical formulation of the classification problem

Let X be a set of object descriptions, and Y be class numbers or names. There is an unknown target relationship -mapping , the values of which are known only on the elements of the finished training sample . We need to build an algorithm a , that can classify an arbitrary object [12].

Bayes classifier

Bayes classifier -provides a classification with a degree of confidence rather than simply issuing the most plausible class. Bayes' theorem is used to determine the degree of certainty.

Bayes' theorem describes the probability of an event, given the circumstances that may affect the event. Thus, you can more accurately calculate the probability, considering both already known information and data from new observations [14].

A Naive Bayes classifier is an assumption about the independence of traits. In other words, the NCB assumes that any attribute in the class is not related to the presence of any other feature.

Method overview

As mentioned, the Bayes classifier is based on the Bayes theorem, which describes the probability of an event, given the circumstances that may affect the event [14].

Suppose there is a symptom S. In addition, there are classes (diseases) C, which should include the symptom. It is necessary to find a class (disease) C in which the probability for this line would be maximum. The mathematical notation is given in Formula 1.

(1)

It is hard to calculate P(C|O). However, you can use Bayes' theorem and go to (Formula 2):

, (2)

where P(С) -an a priori probability, the probability of meeting a class among all the data; P(O|C) -conditional probability, the probability of symptoms in each class; P(O) -total probability, probability of symptoms. Usually, it makes no sense to work with one symptom. It is much more effective to detect the disease on several grounds. Thus Formula 2 will take the form (Formula 3):

(3)

Since you need to find the function's maximum, the denominator can be ignored (this is a constant). It is also necessary to include a "naive" assumption that the symptoms of S depend only on class C and do not depend on each other. Then the numerator will take the form (Formula 4):

(4) So, the final formula will look like (Formula 5):

)

So it all comes down to calculating the probability P(C) and P(S|C). Calculating these parameters is called classifier training.

Multinomial Naive Bayes

Multinomial Naive Bayes implements a Naive Bayes algorithm for multinomial distributed data and is one of two classic variants of Naive Bayes [8].

This algorithm puts forward a second assumption of independence -the assumption of positional independence. Conditional probabilities of symptom onset are equally independent of its position in the data sample [9].

The data is usually presented as a vector. The basic idea is that each unique feature (symptom) that occurs is assigned a unique integer. Therefore the data can be represented as a sequence of numbers.

The distribution of the number of vectors is parameterized by vectors for each class, where n -number of features (symptom), and the probability of the appearance in the sample of features belonging to class C.

The parameter is estimated by the smoothed version of the maximum probability. The relative frequency calculation (Formula 6):

, (6)

where -the number of times the 𝑖 character appears in a class C sample in the training set.;

-the total number of all features (symptoms) for class C; A -Laplace smoothing.

Review and analysis of data

The data set about heart disease "heart.csv" is used for research [6]. It was taken from Kaggle. This database contains 76 attributes, but all published experiments involve using a subset of 14 of them, as the rest of the information is the identification of individuals. The total number is 303 rows and 14 columns, of which 165 have heart disease [7].

Attribute information: 1. age; As can be seen from this section, most values are usually categorized. All columns have no spaces, contain 303 rows of data.

An analysis of atypical emissions should also be conducted. To do this, use a standardized Z-Score score, which shows how many standard deviations is the scatter of the value relative to the observed average value. If the Z-Score value is greater than or less than 3 or -3, respectively, this data point will be defined as non-standard (Fig. 4).

Figure 4: Z-Score for atypical data

Fig. 5 shows that this data set contains two emissions. Let's try to visualize them. For this purpose, it is necessary to construct the box diagram to visualize atypical data (Fig. 5).

Figure 5: Visualization of atypical data in a dataset

Because only two sets of data that differed from the others were identified, so they were removed from the sample. This will help achieve better results in predicting heart disease.

The next step is to review the number of existing or absent diseases. To do this, determine the average number of different values for prediction by columns (Fig. 6). Target variable: whether the patient has heart disease or not (value 0 -yes; value 1 -no). Fig. 6 shows that the distribution is balanced.

Search for the correlation of heart disease with different parameters

To find the links of heart disease with different parameters, we need to build a correlation matrix (Fig. 7).

Figure7: Correlation matrix

Fig. 7 shows certain relationships between the features. It is first necessary to determine the difference between the correlation coefficients in men and women. The results are shown in Fig. 8.

Figure 8: Difference of correlation coefficients for different sexes

Figure 8 shows that all coefficients, except for the target variable, differ between men and women. The most noticeable difference for trestbps. This is the resting blood pressure in millimeters of mercury.

Most people have normal blood pressure in certain groups (these can be healthy adults, adults taking medication, the elderly). It also appears that very high blood pressure may indicate heart disease [14,15].

Observations follow from the obtained results: 1. Age is negatively correlated with heart disease. Because older people are more likely to have heart disease, they are more likely to have a health check-up, even if they have mild or no symptoms. Young people go for a health check only when they have apparent symptoms. That is why they are more often diagnosed with heart disease. 2. Cholesterol and fasting blood glucose levels have little correlation with heart disease. 3. Chest pain (cp), maximal pulse (thalach), a tilt of the ST segment in the ECG are positively correlated with heart disease. 4. Exercise angina (exang), ST depression caused by exercise (oldpeak), the number of major vessels (0-3) stained with fluoroscopy (ca) are negatively correlated with heart disease. Moreover, in all these ratios, the correlation is lower for men than for women.

5. Trestbps (resting blood pressure) and fbs (fasting blood sugar) are negatively correlated. Moreover, the correlation is lower for women compared to men. For these observations, the accuracy of the conclusions should be checked, taking into account the distribution of data between men and women (Fig. 9).

Figure 9: Data distribution between women and men

Fig. 9 shows that women account for about half of the observations than men. You can also see that gender is a risk factor. Also, to verify the above statements, you should visualize the presence or absence of the disease depending on the age range (Fig. 10). In Fig. 12, the x-axis represents the resting blood pressure in millimeters of mercury. The y-axis represents the density estimate. Yellow indicates the absence of the disease, red -the presence. The relationship between blood pressure and the female sex is on the left, on the right -between blood pressure and male.

In Fig. 13. presents the presence or absence of the disease, taking into account only one featureone attribute. From Fig. 13, the following observations follow: 6. The number of major vessels stained with fluoroscopy refers to the number of narrow vessels seen, so the higher the value of this feature, the greater the likelihood of heart disease. 7. A very invasive process for patients obtains the results of blood flow observed through the radioactive dye. But in themselves, they are excellent evidence of heart disease or not. 8. The slope of the ST segment can help determine if you have heart disease or not if it is flat or growing. 9. Angina is a good indicator of heart disease. However, we can also see that knowing what angina is and what it is not an easy task can be confused with other pains or atypical angina. 10. When someone has heart disease, the first symptom is usually stable angina (angina during exercise). When angina occurs even at rest, the condition worsens (typically narrowing the coronary arteries). That is why so few patients find abnormal heart rates at rest, and the vision of this anomaly is very indicative of the presence of heart disease.

11. On the other hand, a value of 0, the probable presence of hypertrophy, in itself does not indicate the presence of heart disease. 12. In itself, the feature -blood sugar levels, does not give confidence in the presence or absence of heart disease. However, we will not abandon this feature, as it can be helpful with other variables. 13. Chest pain also does not give an unambiguous answer. It is challenging to tell if a patient has heart disease that corresponds only to its symptoms. To verify the accuracy of the conclusions, you should use PCA, which helps extract a set of variables from an existing large set of variables. These extracted variables are called essential components.

Because the data set is small and has no many features, only two components should be used to see how much variance it covers.

The study can explain approximately 90% of the variance in the data set using only two components. Fig. 14 presents each of these decomposed components: Component 1: Fig. 14 shows that the weight is considerable and positive for the feature of chol, slightly positive for sex and cp. This means that patients with a high rate of this component will have a meager chance of being diagnosed with heart disease. At the same time, people with more elevated serum cholesterol are more likely to be diagnosed with heart disease.

Component 2: Fig. 14 shows that the weight is considerable and negative for thalach (maximum heart rate reached) and slightly negative for cp (type of chest pain), chol (serum cholesterol), and slope (slope of the peak segment of exercise ST).

Thus, a high rate of thalach, cp, slope and chol, mainly does not cause heart disease. People who have high levels of these components are much less likely to have heart disease. In contrast, age and high resting blood pressure (trestbps) may be the first features of heart disease. In Fig. 14, they are positive.

Application of the Naive Bayes classifier

The next step is to divide the data into training and test in 80% to 20%. You should also normalize the data with OneHotEncoder and MinMaxScaler [10].

OneHotEncoder -a strategy in which each value of the category is converted into a new column, and it is assigned a value of 1 or 0 (notation for true/false). Fig. 15 shows an example of the strategy. For each value in the object, MinMaxScaler subtracts the minimum value and then divides it by range. The range is the difference between the initial maximum and the initial minimum. MinMaxScaler retains the shape of the original distribution.

After normalization, the classification should be performed. To implement the classification, you need to use GuassianNB from the sklearn library with different types of states when sharing data.

The score function from the sklearn library is used to evaluate the results, which returns the average accuracy of the given test data and labels. The results obtained are presented in Fig. 16.

Figure 16: Average classification scores using all attributes

Fig. 16 x-axis indicates the number of random states, y-axis -the average score for this method. Fig. 16 shows that the estimate ranges from 0.5 to 1.

Thus, the average estimate of the Naive Bayes classifier for random states from 0 to 200 is 0.844262295081968.

To illustrate the performance of the algorithm should build a matrix of inconsistencies (confusion matrix).

Figure 17: Matrix of discrepancies

Figure 17 shows four different results: true positive, false positive, true negative, and false negative.

From the correlation matrix, you can determine the accuracy or positive predictive value (precision), the probability of detection (recall), and the completeness of the definition (f1_score).

• TP -true-positive decision;

• TN -true-negative decision;

• FP -false-positive decision;

• FN -false-negative decision. The next step is to use the metrics for this method. The results are shown in Fig. 18.

Figure 18: Measures of accuracy

We need to reduce the number of attributes to 10. To do this, we need to remove the parameters that have the most negligible impact on heart disease and apply the Naive Bayes classifier again. The results of the experiment are shown in Fig. 19. Thus, the average estimate of the Naive Bayes classifier for random states from 0 to 200 is 0.765245901639342.

Application of the Multinomial Naive Bayes classifier

To implement the classification, you should use MultinomialNB from the sklearn library with different states when sharing data.

The score function from the sklearn library is used to evaluate the results.

The obtained results are presented in Fig. 21.

Figure 21: Average classification scores using all attributes

In Fig. 21 x-axis indicates the number of random states, y-axis -the average score for this method. The figure shows that the score ranges from 0.7 to 1.

Thus, the average score of the Multinomial Naive Bayes classifier for random states from 0 to 200 is 0.850129016334426.

To illustrate the algorithm's performance, you need to build a matrix of inconsistencies (confusion matrix) (Fig. 22).

Figure 22: Matrix of discrepancies

It is also necessary to determine the accuracy of Fig. 23. Therefore, the average estimate of the Multinomial Naive Bayes classifier for random states from 0 to 200 is 0.83010.

Discussion of experimental results

To study the accuracy of the two classification models, we use a set of data on heart disease. Table 1 summarizes the characteristics of the data set used in the experiments. Table . 1 mentions the features that have been tested to training algorithms. The data set was initially studied using 14 features. Subsequently, ten features were selected and rejected those that had the least impact on heart disease. And finally, seven features. The training of the three classification models are given in Table . 2. We chose an 80:20 ratio because the Naive Bayes classifier could not benefit from retraining the data. From Fig. 26, it is noticeable that both methods show worse results when the number of features decreases. This is because we first rejected the symptoms, which had little effect on heart disease. Therefore, the accuracy is similar.

In addition, we can notice that the Multinomial classifier shows much better results when reducing the number of features. This advantage is because this method makes the second assumption of positional independence. Conditional probabilities of symptom onset are equally independent of its position in the data sample.

Let's take into account the nature of the chosen problem in the study, namely the values of "0" and "1" in the answers of the classifiers. We can conclude that the correct classification of first-class objects is, in our case, more critical. After all, it is better to do all the tests once again for a healthy person than not to recognize the disease in a sick person.

That is why it is worth emphasizing the recall score when comparing models with each other and choosing the best one. It estimates the proportion of correctly classified first-class objects. In addition, it is necessary to reach the positive predictive value (precision) and the completeness of the definition (f1_score) (Table 4). Analyzing the figure, we can conclude that with the help of the Multinomial Bayes classifier, the number of sick patients in whom the disease will be detected is more significant. Using this classification method, more people will receive a correct diagnosis and, therefore, will have a chance for treatment and recovery.

We also compare the operating time of the two classification methods. Namely, determine the time of training (Table 5).

Table 5 Division of data into test and training

Method

Time, s 14 features 10 7 features GaussianNB 0,01301 0,00902 0,00603 MultinomialNB 0,01196 0,00881 0.00399 Table 5 shows the execution time of the classification of different Naive Bayes models. On the same data set, MultinomialNB performs training faster, which again emphasizes its advantage for the selected data set.

It is also noticeable that as the number of features decreases, the time decreases (Fig. 28). Analyzing Fig. 28, we can conclude that the Multinomial Bayes classifier is more accurate and faster for the selected data set.

So, the choice of using the Naive Bayes method depends on the data. The Multinomial Naive Bayes is appropriate if the data consists of calculations, and observations can only take non-negative integers. It is better to use the Gaussian NB for decimal features. GNB accepts features that correspond to the normal distribution.

For the selected data set, which contains features for diagnosing heart disease, the Multinomial Naive Bayes showed better results. Using this method, we can achieve greater accuracy and reduce the time to perform training.

Analyzing the study results, it is worth emphasizing the importance of choosing the correct method of the naive classifier. It helps achieve better classification results, which is critical in the medical field.

Conclusion

The paper considered the relevance of the topic: the use of data mining methods for diagnosing the disease in a patient on a set of indicators, such as symptoms, test results, and other indicators.

We used the Heart data set for the study, which we cleared of emissions, Null values, and normalized. We also performed a search and analysis of significant features and patterns between different factors influencing heart disease.

In addition, we used two algorithms in this work, which objectively showed the classification results on the selected dataset.

The parameters used for the analysis were the selection and deletion of the function. We first tested a classifier with all the features and then gradually reduced the set to determine which algorithm best classifies with fewer features.

The simulation results show that the Multinomial Naive Bayes classifier has better accuracy than the Gaussian method with the same data set and parameters. In addition, it reduces training time, which is very important because the annual growth of data in medicine is increasing very rapidly.

In future work, it is worth considering two aspects. Namely, we can compare more algorithms to achieve better results and potentially introduce a better algorithm in Naive Bayes. Moreover, we can try to evaluate the effectiveness of their work to justify their use in the health care system.

2. sex : (1 = a man; 0 = a woman); 3. cp: chest pain type (4 values); 4. trestbps: blood pressure at rest (in mm Hg on admission to the hospital); 5. chol: serum cholesterol in mg / dl; 6. fbs: (fasting blood sugar) (1 => 120 mg / dl; 0 = <120 mg / dl); 7. restecg: the results of electrocardiography at rest (values 0, 1, 2); 8. thalach: the maximum pulse; 9. exang: angina caused by exercise (1 = yes; 0 = no); 10. oldpeak: ST depression caused by exercise for rest; 11. slope: the slope of the peak segment of exercise ST; 12. ca: the number of major vessels (0-3) stained by fluoroscopy; 13. thal: thalassemia (1 = normal; 2 = fixed defect; 3 = reversible defect); 14. target: (1 = heart disease or 0 = no heart disease). Fig. 1, Fig. 2, and Fig. 3 show a dataset.

Figure 1 :1Figure 1: Image of the first five rows of data

Figure 2 :2Figure 2: Attributes overview From Fig. 2, we can see that the categorical data are missing. There are numeric data of type int and float.

Figure 3 :3Figure 3: The main data characteristicsAs can be seen from this section, most values are usually categorized. All columns have no spaces, contain 303 rows of data.An analysis of atypical emissions should also be conducted. To do this, use a standardized Z-Score score, which shows how many standard deviations is the scatter of the value relative to the observed average value. If the Z-Score value is greater than or less than 3 or -3, respectively, this data point will be defined as non-standard (Fig.4).

Figure 6 :6Figure 6: An image of the mean values for the column, which determines the presence or absence of the diseaseTarget variable: whether the patient has heart disease or not (value 0 -yes; value 1 -no). Fig.6shows that the distribution is balanced.

Figure 10 :10Figure 10: The amount of data corresponding to each age In Fig. 10 x-axis indicates the age of patients, y-axis -the number of patients of a certain age. The graph shows that the age of the youngest patient is 22, the oldest is 77. The most common patients are aged 58. There are few patients under 40 or after 70. Therefore, the age distribution is shown in Fig. 11.

Figure 11 :11Figure 11: The presence or absence of the disease in different age categories Fig. 11 shows the age distribution: from 0 to 40 years, from 40 to 50, from 50 to 60, from 60 to 70, from 70 to 100. Green shows the presence of the disease, red the absence. The age range is arranged along the x-axis and the number of patients along the y-axis.In Fig.12, the x-axis represents the age of the patients. The y-axis represents the density estimate. Yellow indicates the absence of the disease, red -the presence. The relationship between age and female gender on the left. Between age and male -on the right.

Figure 12 :12Figure 12: Relationship between blood pressure and sexIn Fig.12, the x-axis represents the resting blood pressure in millimeters of mercury. The y-axis represents the density estimate. Yellow indicates the absence of the disease, red -the presence. The relationship between blood pressure and the female sex is on the left, on the right -between blood pressure and male.In Fig.13. presents the presence or absence of the disease, taking into account only one featureone attribute.

Figure 13 :13Figure 13: Relationship between the presence of the disease and other attributesIn Fig.13, x-axis shows features: gender, chest pain, blood sugar, electrocardiogram results, angina, ST-segment tilt during the most difficult part of the exercise, the number of major vessels stained with fluoroscopy, and thalassemia. The y-axis shows the number of patients. Yellow indicates the presence of the disease, red -the absence.From Fig.13, the following observations follow: 6. The number of major vessels stained with fluoroscopy refers to the number of narrow vessels seen, so the higher the value of this feature, the greater the likelihood of heart disease. 7. A very invasive process for patients obtains the results of blood flow observed through the radioactive dye. But in themselves, they are excellent evidence of heart disease or not. 8. The slope of the ST segment can help determine if you have heart disease or not if it is flat or growing. 9. Angina is a good indicator of heart disease. However, we can also see that knowing what angina is and what it is not an easy task can be confused with other pains or atypical angina. 10. When someone has heart disease, the first symptom is usually stable angina (angina during exercise). When angina occurs even at rest, the condition worsens (typically narrowing the coronary arteries). That is why so few patients find abnormal heart rates at rest, and the vision of this anomaly is very indicative of the presence of heart disease.

Figure 14 :14Figure 14: Analysis of the main componentsComponent 1: Fig.14shows that the weight is considerable and positive for the feature of chol, slightly positive for sex and cp. This means that patients with a high rate of this component will have a meager chance of being diagnosed with heart disease. At the same time, people with more elevated serum cholesterol are more likely to be diagnosed with heart disease.Component 2: Fig.14shows that the weight is considerable and negative for thalach (maximum heart rate reached) and slightly negative for cp (type of chest pain), chol (serum cholesterol), and slope (slope of the peak segment of exercise ST).Thus, a high rate of thalach, cp, slope and chol, mainly does not cause heart disease. People who have high levels of these components are much less likely to have heart disease. In contrast, age and high resting blood pressure (trestbps) may be the first features of heart disease. In Fig.14, they are positive.

Figure 15 :15Figure 15: Example of OneHotEncoder operation Fig. 15 shows an example of the OneHotEncoder operation. The pain column, which contained three classes: medium, strong, and weak, was divided into three new columns: severe pain, moderate pain, and mild pain. All columns contain only two values: 1 if the information is confirmed, 0 if not.For each value in the object, MinMaxScaler subtracts the minimum value and then divides it by range. The range is the difference between the initial maximum and the initial minimum. MinMaxScaler retains the shape of the original distribution.After normalization, the classification should be performed. To implement the classification, you need to use GuassianNB from the sklearn library with different types of states when sharing data.The score function from the sklearn library is used to evaluate the results, which returns the average accuracy of the given test data and labels. The results obtained are presented in Fig.16.

Figure 19 :19Figure 19: Average classification scores using 10 attributes Fig. 19 x-axis indicates the number of random states, y-axis -the average score for this method. The figure shows that the score ranges from 0.4 to 1. Thus, the average estimate of the naive Bayes classifier for random states from 0 to 200 is 0.830327868852459. Again, we need to reduce the number of attributes to 7. The results of the experiment are shown in Fig. 20.

Figure 20 :20Figure 20: Average classification scores using 7 attributes Fig. 20 x-axis indicates the number of random states, y-axis -the average score for this method. The figure shows that the score ranges from 0.5 to 1.Thus, the average estimate of the Naive Bayes classifier for random states from 0 to 200 is 0.765245901639342.

Figure 23 :23Figure 23: Measures of accuracy Fig. 23 shows the accuracy or positive predictive value (precision), probability of detection (recall), and completeness of determination (f1_score). The next step is to reduce the number of attributes to 10. You need to remove the parameters that have the most negligible impact on heart disease and apply the Multinomial Naive Bayes classifier again. The results of the experiment are shown in Fig. 24.

Figure 24 :24Figure 24: Average classification scores using 10 attributesIn Fig.24x-axis indicates the number of random states, y-axis -the average score for this method. The figure shows that the score ranges from 0.7 to 1.

Figure 25 :25Figure 25: Average classification scores using 7 attributes Fig. 25 x-axis indicates the number of random states, y-axis -the average score for this method. The figure shows that the score ranges from 0.7 to 1.Therefore, the average estimate of the Multinomial Naive Bayes classifier for random states from 0 to 200 is 0.83010.

3 shows the results of classification, the accuracy of different models of Naive Bayes. In the study, Multinomial Naive Bayes achieved the highest average accuracy with 0.85%. This shows that the multinomial classifier surpassed the Gaussian model. Fig.26shows a comparison of the accuracy of the two methods.

Figure 26 :26Figure 26: Comparison of the accuracy of two methods of the Naive Bayes classifier In Fig. 26, orange depicts a Naive Bayes classifier with a Gaussian distribution. The Naive Multinomial Bayes classifier is shown in yellow. The x-axis indicates the number of features used for the experiments. The y-axis shows the achieved accuracy.From Fig.26, it is noticeable that both methods show worse results when the number of features decreases. This is because we first rejected the symptoms, which had little effect on heart disease. Therefore, the accuracy is similar.In addition, we can notice that the Multinomial classifier shows much better results when reducing the number of features. This advantage is because this method makes the second assumption of positional independence. Conditional probabilities of symptom onset are equally independent of its position in the data sample.Let's take into account the nature of the chosen problem in the study, namely the values of "0" and "1" in the answers of the classifiers. We can conclude that the correct classification of first-class objects is, in our case, more critical. After all, it is better to do all the tests once again for a healthy person than not to recognize the disease in a sick person.That is why it is worth emphasizing the recall score when comparing models with each other and choosing the best one. It estimates the proportion of correctly classified first-class objects. In addition, it is necessary to reach the positive predictive value (precision) and the completeness of the definition (f1_score) (Table4).

Figure 27 :27Figure 27: Comparison of estimates of two methods of the Naive Bayes classifier In Fig. 27, orange depicts a Naive Bayes classifier with a Gaussian distribution. The Multinomial classifier is shown in yellow. The x-axis indicates the selected estimates used for the experiments. The y-axis shows the achieved accuracy.Analyzing the figure, we can conclude that with the help of the Multinomial Bayes classifier, the number of sick patients in whom the disease will be detected is more significant. Using this classification method, more people will receive a correct diagnosis and, therefore, will have a chance for treatment and recovery.We also compare the operating time of the two classification methods. Namely, determine the time of training (Table5).

Figure 28 :28Figure 28: Comparison of learning time of two methods of the Naive Bayes classifier In Fig. 28, orange depicts a Naive Bayes classifier with a Gaussian distribution. The Multinomial classifier is shown in yellow. The x-axis indicates the number of features that were used for the experiments. The y-axis indicates the training time.

Table 11Data set characteristicsDatasetExamplesTrain dataClassNo. of featuresHeart303240214

Table . 1.shows the use of the Heart data set, which contains 303 data sets, of which 240 are used for training. The data set includes two classes and 14 characteristics.

Table 22Dividing data to test and training

MethodTrainDataset (Heart)TestGaussianNB80%20%MultinomialNB80%20%Table. 2 shows the results of data sharing for training and testing. Both algorithms obtained datathat were equally separated.

Table 33Comparison of accuracy of two classifier modelsMethod14 featuresAccuracy % 10 features7 featuresGaussianNB0,844262290,830327860,765245901MultinomialNB0.850129010.8490390160.830100260

Table ..

Table 44Comparison of evaluations of two classifier models

MethodRecallAccuracy % PrecisionF1_scoreGaussianNB0,8687880,8285710,852941MultinomialNB0,8787880,8529410.865672

Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes Method GParthiban ARajesh SKSrivatsa 10.5120/2933-3887 International Journal of Computer Applications 24 3 2011 Decision Support in Heart Disease Prediction System using Naive Bayes GSubbalakshmi KRamesh MTech MRao MTech Indian Journal of Computer Science and Engineering 2 2 2011 Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction JSoni UAnsari DSharma SSoni 10.5120/2237-2860 International Journal of Computer Applications 17 8 2011 Prediction Analysis of Cardiac Disease using Classification Ch MSVincy Bindu 10.22214/ijraset.2019.6295 2019 Advanced Technologies of Big Data Research in Distributed Information Systems NKunanets OVasiuta NBoikо 10.1109/STC-CSIT.2019.8929756 Proceedings of the 14th International conference "Computer sciences and Information technologies the 14th International conference "Computer sciences and Information technologies

Lviv, Ukraine

September 17-20 (2019 CSIT 2019) Heart Database Cardiovascular Diseases Clinic Manufactory Periodontal disease as a risk factor for heart disease WJLoesche Compendium 15 8 978 Weighted naive bayes classifier: A predictive model for breast cancer detection SKharya SSoni International Journal of Computer Applications 133 9 2016 Performance comparison between Naïve Bayes, decision tree and k-nearest neighbor in searching alternative design in an energy simulation tool AAshari PIman AMinTjoa International Journal of Advanced Computer Science and Applications 2013 Extraction of action rules for chronic kidney diseas using Naive Bayes classifier NDUma IEEE Internstional Conference Comput Intelligence Comput Res 2016 Lipids, risk factors and ischaemic heart disease WPCastelli 10.1016/0021-9150(96)05851-0 Atherosclerosis 1996 Clustering of Metabolic Factors and WFWilson WBKannel HSilbershatz 10.1001/archinte.159.10.1104 Heart Disease 159 10 1104 1999 Stat Quest with Josh Starmer -Naive Bayes Information System of Catering Selection by Using Clustering Analysis NBoyko Kh LShakhovska JMochurad Campos Proceedings of the 1st International Workshop on Digital Content & Smart Multimedia (DCSMart 2019) the 1st International Workshop on Digital Content & Smart Multimedia (DCSMart 2019)

Lviv, Ukraine

December 23-25. 2019