1. Introduction

Tree-based Algorithms for Cardiovascular Disease Prediction

Mateusz Filipek

0 0 Silesian University Of Technology, Faculty of Applied Mathematics , Kaszubska 23, 44-100 Gliwice , Poland

42 51

One of the issues data scientists run into the most frequently is the classification issue. We can separate the available data into discrete values with the use of classification. Numerous algorithms exist that enable us to solve this issue efectively. This article focuses on tree-based algorithms: Decision Tree Algorithm, and Random Forest Algorithm. The problem that is going to be approached with these algorithms is cardiovascular disease prediction, using the kaggle dataset containing records of patients data [1].

eol>Decision Trees CART Random Forest Bagging algorithms

1. Introduction

These numbers might go even higher in incoming years. The ongoing COVID pandemic, widespread lockArtificial intelligence methods [ 2, 3, 4 ] play an increas- downs and the increase in people working from home ingly important role in various types of information can lead to increased numbers of people living sedensystems. The numerous applications of artificial intelli- tary lifestyles – and these can increase the likelihood of gence methods are based on several of its most important sufering from heart diseases. branches. One of the most important are methods based Poor diet, inactivity, dangerous alcohol and tobacco on fuzzy sets [ 5, 6 ]. In the papers [ 7, 8, 9 ] the authors use - these are just a few examples of lifestyle choices proposed a system based on the second type fuzzy in- that can increase a person’s chance of developing heart ference detecting anomalies on the roads. In the work disease. Adult obesity is on the rise, and it’s getting worse [ 10 ], artificial intelligence methods based on fuzzy sys- than ever. According to the CDC, US obesity prevalence tems are responsible for the proper airing of rooms. The increased from 30% to roughly 42%. Because cardiovascusecond very important branch of artificial intelligence lar diseases are responsible for roughly a third of global algorithms are the [ 11, 12 ] heuristic algorithms, which deaths, it is of the utmost importance to find a way to are applicable wherever we strive to minimize or maxi- cure and help people who are sufering from heart dismize functionals with diferent interpretations resulting eases - but before such diseases can be treated properly, from the specificity of the issue under consideration. At we need a way to detect them, hopefully long before they this point, it is worth paying attention to the work on can do great harm. reducing energy consumption [ 13, 14 ]. A very important Cardiovascular disease detection is a categorization group is the third branch of artificial intelligence meth- problem. The results can often be split into two groups: ods based on neural networks [ 15, 16, 17, 18, 19 ]. They healthy patients and patients with heart problems. Due are used in many areas of life, including the detection to the fact that there are only two major result classes of certain desirable features [ 20, 21 ], care for the elderly that can be simply defined using a binary system, such as [ 22, 23, 24 ], diagnostics [25, 26, 27]. 1 - a sick patient and 0 - a healthy patient, this particular classification task is known as binary classification.

1.1. Cardiovascular Diseases 1.2. Corrado Gini 1.3. Leo Breiman

American statistician Leo Breiman was born in New York, USA, in 1928. At the University of California, Leo Breiman pursued his education. Breiman is primarily recognized for his work on CART, Bootstrap Aggregation, which owes its name to him and is now known as Bagging. Leo Breiman is also the inventor of Random Forest method.

1.4. Binary Classification

The CART algorithm builds decision trees utilizing the Gini’s Impurity Index to create best possible splits of data.

2.1.1. Gini’s Impurity Index

The Gini’s Ratio is a statistical dispersion metric that is frequently used to assess income disparity between countries. Gini’s Impurity is a measure of the likelihood of choosing a certain feature that is incorrectly classified in decision trees. If all element in a dataset are of a single class, then Gini’s Index takes the value of 0, meaning that that the dataset is pure. Similarly, if all elements of dataset are of diferent classes, then Gini’s Index takes the value of 1, which indicates that the dataset is fully impure. If Gini’s Index is of value 0.5, then the dataset is shows an equal distribution of elements over available classes.

Gini Index can be represented as:

Gini Index = 1 − ∑︁ ()2, =1

Where represents the probability of each element being classified for its distinct class.

In CART decision trees Gini’s Index is used to calculate the best possible split at each level of the tree.

2.1.2. Algorithm

The process of classifying the components of a set made up of only two classes is known as binary classification.

The main applications of binary classification are in quality control and medical testing to check if a patient is ill or not, and to assess whether a produced thing fulfills the specification.

Some of the most common methods used for solving the binary classification problems are Decision Trees, Random Forests and Logistic Regression.

The Decision Tree Algorithm makes use of a binary tree data structure, where each node is either a decision node that is divided based on the best potential Gini Index, or a terminal node which does not further split, and decide about the predictions made by the decision tree. Decision Trees are often described using flowcharts.

The best possible split is calculated for each node individually, by checking all the possible values in each available feature. The pair of value + feature for which the best Gini’s Index gain was achieved is used for split2. Proposed Classifiers ting the dataset further into two parts. Decision Tree is built and read recursively, thanks to 2.1. Decision Tree the underlying data structure. Building the entire tree for a classifier involves using the training data that has Decision Trees are among the most commonly used mod- been provided to determine the appropriate splits. It is els for classification and regression tasks. They can be important to adequately adjust the classifier parameters, described as a model whose purpose is to ask a dataset a such as maximum depth, or the minimum number of list of if/else questions, and based on the responses the samples required for performing a split. decision can be then made. The maximum depth parameter specifies how deep

Decision Trees are often divided into two categories: the decision tree can get - it is the number of nodes from classification and regression trees. Regression Trees pro- the root down to the furthest leaf node - the height of duce numeric output, and classification trees produce underlying tree structure. categorical output. The latter is the main interest of this Theoretically the maximum depth of the decision tree article, more specifically the CART implementation of could be almost as high as the number of training samples, decision trees. however it is not recommended to let the Decision Tree

Input TR: Training Samples, MaxDepth: Maximum Depth of the Decision Tree

Output Decision Tree built based on provided training samples Building Decision Tree 1: if stopping conditions are met then 2: a leaf node with adequate class assigned

to it 3: else 4: = best possible Gini Gain 5: if > 0 then 6: recursively build left side of the current node 7: recursively build right side of the current node ber of decision trees. Random Forests rely on a lot of relatively unrelated - thanks to the randomly choosing of samples - trees, classifying the provided sample using each one and performing a majority vote to get the best possible result.

2.2.1. Ensemble Algorithms

The evaluation of the sample provided by ensemble algorithms typically requires more computing resources than it would for a single model, but the ’Wisdom of Crowds’ obtained by using multiple models leads to increased accuracy. 8:

current node with both sides as- 2.2.2. Bagging signed to it

Bagging, also known as bootstrap aggregation, is an enPredicting sample labels semble learning technique that is used to improve stabilInput xTest: testing samples ity and accuracy, and to reduce variance within a dataset Output Predicted labels - decreasing the chance of overfitting models. Using bagging, Random Forests produce a variety of 1: for do trees by letting each one randomly select a sample from a 2: if currently checked node has a class value as- given dataset. Creating a large number of decision trees signed to it then helps in reducing overfitting. 3: assigned class value

4: else 2.2.3. Algorithm

5: if currently tested node has key feature value greater than checked sample then 6: recursively check left child of node 7: recursively check right side of node Building a Random Forest Input Training Samples, Number of trees

Output Random Forest built based on provided samples classifier grow to depth that high, because then it will overfit.

Data scientists use the term "overfitting" to indicate when the results of an analysis fit a set of data too closely. Such algorithm can perform very well on training data, but when exposed to an unknown sample, it will attempt to categorize it using highly specific criteria that may not be appropriate for classifying the unknown samples. The depth parameter should not be set too high because a model that has been trained too precisely on a given dataset will be fitted to that dataset exactly, which means it will learn not only how to make decisions based on the important features and their values but also how to take into account the existing "noise" - irrelevant information.

If there is more then a single sample present at a leaf node, then the outcome is predicted using the Majority Voting technique, where the class which has the highest number of representing samples is chosen.

2.2. Random Forest

Random Forest is an ensemble classification algorithm that performs classification using a predetermined num1: for range(Number of trees) do 2: choose random subsample 3: create a Decision Tree using the chosen subsample Classifying using Random Forest Input Test Samples

Output Classified sample labels 1: for Every test sample do 2: for Every built tree do 3: Classify sample using currently checked Decision Tree 4: Perform majority voting based on results of classifying the chosen sample using all decision trees 5: Classified samples

Each tree in Random Forest is built using randomly chosen data from dataset. When predicting the outcome of provided sample, the sample is provided to every available tree, and then the results from all the classifications are subjected to the Majority Voting technique, where the

3. The Cardiovascular Disease Dataset

The Cardiovascular Disease Dataset consists of 70000 records of patients data, consisting of 11 features each: 1. Age 2. Height 3. Weight 4. Gender 5. Systolic blood pressure 6. Diastolic blood pressure 7. Cholesterol 8. Glucose 9. Smoking 10. Alcohol intake 11. Physical activity

3.1. Data Cleaning

The process of detecting and fixing even removing, corrupted, duplicate, or incomplete data is known as data cleaning.

In the used dataset there is a number of invalid records, such as records of patients with systolic blood pressure that is negative or exceding 16000. Removal of such records allowed for reducing the total number of samples by 1413 records.

3.2. Dimensionality reduction

Dimensionality reduction is the process of minimizing the number of dimensions - features present in a dataset, while preserving the greatest amount of variety. Reducing the number of features accessible can increase performance, eliminate redundancy, and reduce overfitting.

Dimensionality reduction works by identifying and deleting elements that have little to no impact on the outcomes.

4. Classification 4.1. Splitting data

Dataset needs to be split into the training data, and testing data. Most of the data should be used for training purposes.

After a model has been trained using the training set, it’s accuracy can then be validated using data from testing set. Because class values in testing set are already known, the accuracy of classifiers can be correctly calculated.

5. Classifier Evaluation 5.1. Correctness measures

5.1.1. P P is the number of real positive conditions

N is the number of real negative conditions 5.1.2. N 5.1.3. TP 5.1.4. TN 5.1.5. FP 5.1.6. FN TP is the number of correctly predicted presence of a condition.

TN is the number of correctly predicted absence of a condition FP is the number of wrongly predicted presence of a condition FN is the number of wrongly predicted absence of a condition

These four correctness measure metrics are the parameters of confusion matrix, they are used to evaluate specificity, sensitivity and accuracy of classifiers.

5.1.7. Accuracy

The simplest evaluation metric is accuracy. It measures how accurately projected classes compare to the whole testing dataset size. The number of labels that were successfully assigned is known as accuracy.

ACC = + +

5.1.8. Precision

Precision is defined as the ratio of true positives to the sum of true positives and false positives. Precision describes how efectively the model predicts the positive cases out of all the cases it predicts as being true.

PPV =

+ 5.1.9. Recall The proportion of genuine positives to the total of true positives and false negatives is known as recall. Recall demonstrates how well the model separates out the positive cases from all the positive cases in the dataset.

TPR =

+ 5.1.10. F1 The harmonic mean of recall and precision is the 1 score.

* * +

5.1.11. Confusion Matrix

Confusion Matrix is a performance measurement technique, as the name suggests - it is a matrix, representing four diferent combinations of predicted and actual values. Its name comes from the fact that using this matrix makes it easier to determine whether the model is incorrectly classifying classes.

︂( )︂

6. Testing the Decision Tree Classifier 6.0.1. Conclusions

Single Decision Trees quickly began to overfit, increasing the maximum depth not only decreased the correctness of it’s predictions, but also increased the total time needed for building the tree. The problem of overfitting can be ifxed by utilizing the bagging technique.

7. Testing the Random Forest Classifier

The main parameters in Random Forest Classifier that needs to be adjusted are the total number of trees, as well as the maximum depth of every single tree. 42–51

7.0.2. Conclusions

Increasing the total number of trees increased the correctness of Random Forest classifier. Even with low total amount of trees Random Forest classifier has better correctness than a single Decision Tree, bagging helps with overfitting, the choosing of random samples helps the classifier to learn the training dataset better.

7.0.3. Testing various maximum depth

Correctness measured for Random Forest Classifier with total number of trees equal to 25, and the maximum depth of trees equal to 2 were: = 0.66 = 0.74 = 0.64

Correctness measured for Random Forest Classifier with total number of trees equal to 25, and the maximum depth of trees equal to 4 were: = 0.67 = 0.79 = 0.63 Correctness measured for Random Forest Classifier with total number of trees equal to 25, and the maximum depth of trees equal to 16 were: = 0.6 = 0.61 = 0.59

Correctness measured for Random Forest Classifier with total number of trees equal to 25, and the maximum depth of trees equal to 128 were: = 0.62 = 0.61 = 0.61

7.0.4. Conclusions

Changes in the maximum depth of individual trees did not afect the Random Forest classifier as much as the total amount of trees.

8. Conclusions

The nature of disease prediction problem makes it suitable to use tree-based algorithms for patient classification. The proposed algorithms show good correctness. The presented tests have shown that the proper selection of classifiers has a great efect on the classification results. Decision Trees alone can predict reasonably well, however utilizing bagging algorithms such as Random Forest can increase the correctness of acquired results, sacrificing a little execution speed.

[1]

Ulianova , Cardiovascular disease dataset, 2019 . URL: https://www.kaggle.com/datasets/sulianova/ cardiovascular-disease-dataset.

[2] Q.- b . Zhang,

Wang ,

Z.-h.

Chen , An improved particle filter for mobile robot localization based on particle swarm optimization , Expert Systems with Applications 135 ( 2019 ) 181 - 193 .

[3]

M. A.

Sanchez ,

Castillo ,

J. R.

Castro , Generalized type-2 fuzzy systems for controlling a mobile robot and a performance comparison with interval type2 and type-1 fuzzy systems , Expert Systems with Applications 42 ( 2015 ) 5904 - 5914 .

[4]

Ponzi ,

Russo ,

Bianco ,

Napoli ,

Wajda , Psychoeducative social robots for an healthier lifestyle using artificial intelligence: a case-study , in: CEUR Workshop Proceedings , volume 3118 , 2021 , pp. 26 - 33 .

[5]

Li ,

Dong ,

Yang ,

Jiang ,

Ni , J. Liu, Automatic impedance matching method with adaptive network based fuzzy inference system for wpt , IEEE Transactions on Industrial Informatics 16 ( 2019 ) 1076 - 1085 .

[6]

Sun ,

Qiang ,

Xu , G. Lin, Internet of thingsbased online condition monitor and improved adaptive fuzzy control for a medium-low-speed maglev train system , IEEE Transactions on Industrial Informatics 16 ( 2020 ) 2629 - 2639 . doi: 10 .1109/TII. 2019 . 2938145 .

[7]

Woźniak ,

Zielonka ,

Sikora , Driving support by type-2 fuzzy logic control model , Expert Systems with Applications 207 ( 2022 ) 117798 .

[8]

Brandizzi ,

Russo ,

Brociek ,

Wajda , First studies to apply the theory of mind theory to green and smart mobility by using gaussian area clustering , in: CEUR Workshop Proceedings , volume 3118 , 2021 , pp. 71 - 76 .

[9]

Brandizzi ,

Russo , G. Galati,

Napoli , Addressing vehicle sharing through behavioral analysis: A solution to user clustering using recency-frequencymonetary and vehicle relocation based on neighborhood splits , Information (Switzerland) 13 ( 2022 ). doi: 10 .3390/info13110511.

[10]

Woźniak ,

Zielonka ,

Sikora ,

M. J.

Piran ,

Alamri , 6g-enabled iot home environment control using fuzzy rules , IEEE Internet of Things Journal 8 ( 2020 ) 5442 - 5452 .

[11]

Qiu ,

Li ,

Zhou ,

Song ,

Lee ,

Lloret , A novel shortcut addition algorithm with particle swarm for multisink internet of things , IEEE Transactions on Industrial Informatics 16 ( 2019 ) 3566 - 3577 .

[12]

Yu ,

C. P.

Chen , Smooth transition in communication for swarm control with formation change , IEEE Transactions on Industrial Informatics 16 ( 2020 ) ing , in : CEUR Workshop Proceedings , volume 2472 , 6962 - 6971 . 2019 , pp. 41 - 47 .

[13]

Woźniak ,

Sikora ,

Zielonka ,

Kaur , M. S. [24]

Woźniak ,

Wieczorek ,

Siłka ,

Połap , Body Hossain,

Shorfuzzaman , Heuristic optimization pose prediction based on motion sensor data and of multipulse rectifier for reduced energy consump- recurrent neural network, IEEE Transactions on tion , IEEE Transactions on Industrial Informatics Industrial Informatics 17 ( 2020 ) 2101 - 2111 . 18 ( 2021 ) 5515 - 5526 . [25]

Lo Sciuto ,

Russo ,

Napoli , A cloud-based

[14]

Ponzi ,

Russo ,

Wajda ,

Brociek , C.

Napoli, lfexible solution for psychometric tests validation, Analysis pre and post covid-19 pandemic rorschach administration and evaluation, in: CEUR Workshop test data of using em algorithms and gmm mod-

Proceedings , volume 2468 , 2019 , pp. 16 - 21 . els, in: CEUR Workshop Proceedings , volume 3360 , [26]

H. G.

Hong ,

M. B.

Lee ,

K. R.

Park , Convolutional 2022 , pp. 55 - 63 . neural network-based finger-vein recognition using

[15]

V. S.

Dhaka ,

S. V.

Meena ,

Rani ,

Sinwar , M. F. nir image sensors , Sensors 17 ( 2017 ) 1297 . Ijaz , M.

Woźniak , A survey of deep convolutional [27] S.

Illari , S.

Russo , R.

Avanzato , C.

Napoli , A cloudneural networks applied for prediction of plant leaf oriented architecture for the remote assessment diseases , Sensors 21 ( 2021 ) 4749 . and follow-up of hospitalized patients , in: CEUR

[16]

Pepe ,

Tedeschi ,

Brandizzi ,

Russo , L. Ioc- Workshop Proceedings, volume 2694 , 2020 , pp. 29 - chi , C. Napoli, Human attention assessment us- 35. ing a machine learning approach with gan-based data augmentation technique trained using a custom dataset , OBM Neurobiology 6 ( 2022 ). doi: 10 . 21926/obm.neurobiol. 2204139 .

[17]

Bonanno , G. Capizzi,

G. Lo

Sciuto , A neuro wavelet-based approach for short-term load forecasting in integrated generation systems , in: 2013 International Conference on Clean Electrical Power (ICCEP) , IEEE, 2013 , pp. 772 - 776 .

[18]

Capizzi ,

Napoli ,

Paternò , An innovative hybrid neuro-wavelet method for reconstruction of missing data in astronomical photometric surveys , in: Artificial Intelligence and Soft Computing: 11th International Conference, ICAISC 2012 , Zakopane, Poland, April 29-May 3, 2012 , Proceedings, Part I 11 , Springer, 2012 , pp. 21 - 29 .

[19]

Lo Sciuto , G. Susi, G. Cammarata,

Capizzi , A spiking neural network-based model for anaerobic digestion process , in: 2016 International Symposium on Power Electronics , Electrical Drives, Automation and Motion (SPEEDAM) , IEEE, 2016 , pp. 996 - 1003 .

[20]

Dehzangi ,

Taherisadr , R. ChangalVala, Imubased gait recognition using convolutional neural networks and multi-sensor fusion , Sensors 17 ( 2017 ) 2735 .

[21]

Aureli ,

Brandizzi , G. Magistris,

Brociek , A customized approach to anomalies detection by using autoencoders , in: CEUR Workshop Proceedings , volume 3092 , 2021 , pp. 53 - 59 .

[22]

Russo ,

Illari ,

Avanzato , C. Napoli, Reducing the psychological burden of isolated oncological patients by means of decision trees , in: CEUR Workshop Proceedings , volume 2768 , 2020 , pp. 46 - 53 .

[23]

Russo ,

Napoli , A comprehensive solution for psychological treatment and therapeutic path planning based on knowledge base and expertise shar-