<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Random Forest as a Method of Predicting the Presence of Cardiovasculars Diseases</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Yurii</forename><surname>Kryvenchuk</surname></persName>
							<email>yurkokryvenchuk@gmail.com</email>
						</author>
						<author>
							<persName><forename type="first">Alina</forename><surname>Yamniuk</surname></persName>
							<email>alinayamniuk@gmail.com</email>
						</author>
						<author>
							<persName><forename type="first">Iryna</forename><surname>Protsyk</surname></persName>
							<email>iryna.s.protsyk@lpnu.ua</email>
						</author>
						<author>
							<persName><forename type="first">Lesia</forename><surname>Sai</surname></persName>
							<email>lesia.p.sai@lpnu.ua</email>
						</author>
						<author>
							<persName><forename type="first">Andriana</forename><surname>Mazur</surname></persName>
							<email>andriana.v.mazur@lpnu.ua</email>
						</author>
						<author>
							<persName><forename type="first">Olena</forename><surname>Sydorchuk</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Lviv Polytechnic National University</orgName>
								<address>
									<addrLine>Stepana Bandery Street 12</addrLine>
									<postCode>79013</postCode>
									<settlement>Lviv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">International Conference on Computational Linguistics and Intelligent Systems</orgName>
								<address>
									<addrLine>May 12-13</addrLine>
									<postCode>2022</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Random Forest as a Method of Predicting the Presence of Cardiovasculars Diseases</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">0484AA27C5C4002A09C9027579B6B72D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T12:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Machine Learning, Classification, Random Forest, Decision Tree, Begging, Boosting, Diagnosis of Disease 0000-0002-2504-5833 (Yu. Kryvenchuk)</term>
					<term>0000-0003-1886-5902 (A. Yamniuk)</term>
					<term>0000-0002-6270-1344 (I. Protsyk)</term>
					<term>0000-0002-5081-4235 (L.Sai)</term>
					<term>0000-0002-5985-5674 (A.Mazur)</term>
					<term>0000-0002-9357-8690 (O.Sydorchuk)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Heart diseases are considered to be one of the main reasons that lead to the death. The correct prediction of heart disease can prevent life threats, however incorrect prediction can be fatal at the same time. The paper considers the ways to solve the problem of prediction cardiovasculars diseases. For this purpose, Random Forest method is considered. Both the advantages and disadvantages of using this method of predicting are investigated. The main problems that shape the work are defined. The paper provides step-by-step creation of a system and identifies the main requirements that it must meet. For training the model it is used classifier RandomForestClassifier. The results of the training are given as well. For improving the precision of the training model Grid Search is used. The metricserror matrices and ROC-AUC curves are used in order to visualize the results of the research. To compare the results Gradient boosting algorithm is considered. Such a model can be very useful in hospitals as an additional check of the diagnosis before prescribing treatment. The object of the research is medical indicators and their importance for successful diagnosis of the disease. The data is taken from the open source. The subject of the research is the method of Random Forest for classification based on statistical data.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Medicine is thought to be one of the most investigated areas for all the time. Due to the development of the information and computer technologies, this area and diagnostic process are rapidly being modernized <ref type="bibr" target="#b0">[1]</ref>. Early detection of the disease in humans is an extremely important procedure, as it can prevent serious consequences. And the success or failure of treatment directly depends on timely and accurate diagnosis.</p><p>The scope of Machine Learning algorithms are increasing in predicting various diseases. Machine learning algorithms are often used for cardio disease prediction systems. Machine learning techniques help in identifying the data and automatically make the predictions.</p><p>Integrating Machine Learning into the healthcare ecosystem allows for a multitude of benefits, including automating tasks and analyzing big patient data sets to deliver better healthcare faster and at a lower cost. Quickly obtaining patient insights helps the healthcare ecosystem discover key areas of patient care that require improvement.</p><p>There is a growing awareness of the importance of machine learning as a platform that can gather information from multiple sources into an integrated system. It significantly facilitates decisionmaking processes for highly skilled employees. Improvements in computing resources, as well as the storage and exchange of data over the last decade, have been a significant factor in harnessing the potential of machine learning systems in medicine.</p><p>Machine learning is a fast-growing trend in the health care industry, thanks to the advent of wearable devices and sensors that can use data to assess a patient's health in real time. The technology can also help medical experts analyse data to identify trends or red flags that may lead to improved diagnoses and treatment.</p><p>According to the World Health Organization, as of 2020, the leading cause of deaths is heart diseases. This disease is responsible for 16% of the world's total deaths. Since 2000, the largest increase in deaths has been for this disease, rising by more than 2 million to 8.9 million deaths in 2020. Also the number of deaths will add up by 24.5 million in 2030, because of the growth of cardiovascular risk factors such as high blood pressure, diabetes, obesity, and smoking <ref type="bibr">[2]</ref>.</p><p>Machine learning algorithms can help doctors diagnose, analyse X-rays, predict patient's health and so on. The exact and accurate analysis is normally attributed to the successful treatment. When doctors fail to make accurate decisions while examining a patient's disease, disease prediction systems that use ML algorithms can help.</p><p>An important aspect that distinguishes medical data from most others is that objectivity, accuracy, quality and timeliness of results are critical and must be constantly questioned. Thus, the problem of medical diagnosis can be solved with the help of the classification problem.</p><p>One of the classification methods is Random Forest <ref type="bibr">[3]</ref>. It is used to create a classifier model that can predict disease with higher rates and accuracy.</p><p>By using classification methods, it is possible not only to cure people, but also to prevent the deterioration of their health in time. Appropriate and accurate prediction of cardiovascular disease has quite significant value.</p><p>The cost of error is quite high, so research in this area is always needed. This research proposes a prediction model to predict whether patient have a heart disease or not by using entered indicators and to give an awareness of heart disease.</p><p>An article primarily includes the following sections: 1. Introductionprovided actual problem, clear motivation of doing the research and possible improvements for this area. 2. Related work -introduced common works and theirs results, related to the theme. 3. Materials and methodsconsidered methods of the experiment, Random Forest, Decision Tree, Bagging or Bootstrap Aggregating, Gradient Boosting. 4. Experimentsinvestigated the data using Random Forest in order to predict the presence of cardiovascular disease for people with different health conditions. 5. Resultsrepresented analysed data in figures and results of the study. 6. Conclusionsummarized overall findings. 7. Referencessource materials.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>To do the research on this specific topic, it is quite important to gain the results of the works that have been done earlier. For this it is necessary to comprehensively analyse the literary different type of sources.</p><p>Many research articles have been carefully studied in order to investigate the problem and systematize knowledge in this area.</p><p>In article <ref type="bibr" target="#b1">[4]</ref> was proposed a model for prediction of cardiovascular disease using machine learning algorithm hybrid random forest with linear mode. Authors obtained 88.7% accuracy for prediction. The dataset was collected from UCI repository site. Authors have chosen Cleveland dataset for this proposed study.</p><p>In article <ref type="bibr" target="#b2">[5]</ref> authors used knn, decision tree, linear regression, support vector machine algorithms for prediction of heart disease and compared their accuracy. From the experimental result authors obtained best accuracy of 87% by using k-nearest neighbor algorithm followed by support vector machine 83%, decision tree 79% and linear regression of 78% accuracy among all these algorithms for prediction of heart disease.</p><p>In article <ref type="bibr" target="#b3">[6]</ref> a study conducted to compare statistical, ML and data mining methods in terms of their ability to assist in predicting heart failure risks. The researchers compared the performance of statistical evaluation, Decision Trees, Random Forest, and convolutional neural networks, and they obtained prediction accuracy results of 85%, 80.1%, 85.38%, and 93%, respectively.</p><p>In article <ref type="bibr" target="#b4">[7]</ref> was used different varieties of unsupervised clustering algorithms to determine their accuracy in terms of cardiac disease search and diagnosis. The algorithms were applied to the Cleveland dataset. The study results highlighted k-means as the most appropriate algorithm for cardiac disease diagnosis.</p><p>In article <ref type="bibr" target="#b5">[8]</ref> was suggested model using ensemble approaches (boosting and bagging) with feature extraction algorithms (LDA and PCA) for predicting heart disease. The authors compared ensemble techniques (bagging and boosting) with five classifiers (SVM, KNN, RF, NB, and DT) on selected features from the Cleveland heart disease dataset. The results of the experiments indicated that the bagging ensemble learning method with DT and PCA feature extraction obtained the most outstanding performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Materials and methods</head><p>Random Forest is an ensemble machine learning method based on decision trees that involves creating multiple trees and then combining their results to improve model generalization capabilities.</p><p>Main features of Random Forest Algorithm:  It's more accurate than the decision tree algorithm.</p><p> It provides an effective way of handling missing data.</p><p> It can produce a reasonable prediction without hyper-parameter tuning.  It solves the issue of overfitting in decision trees.  In every random forest tree, a subset of features is selected randomly at the node's splitting point. Decision trees are the building blocks of a Random Forest algorithm. Decision trees <ref type="bibr" target="#b6">[9]</ref> are a decision-making tool that uses a tree-like graph or decision model and their possible consequences. Decision trees seek to find the best distribution for a subset of data, and they are usually learned using a classification tree algorithm.</p><p>Decision trees are very easy to overfit. In the process of building a tree, so that its size does not become too large, is used special procedures that allow you to create optimal trees. The creation process continues until the stop criteria are met.</p><p>The most well-known measures of entropy and Gini index are used in the work <ref type="bibr" target="#b7">[10]</ref>.</p><p>During the study, the best criterion was chosen -the measure of entropy <ref type="bibr" target="#b8">[11]</ref>.</p><p>The measure of entropy in the construction of decision trees is a measure of the diversity of classes in the node. As a result of breakdown nodes with smaller variety of states of an initial variable are formed. Thus, the entropy decreases and the amount of internal information in the node increases. Formally, the entropy of a certain node T of the decision tree is determined by the formula:</p><formula xml:id="formula_0">𝐼𝑛𝑓𝑜(𝑇) = − ∑ 𝑝 𝑗 𝑛 𝑗=1</formula><p>⋅ log(𝑝 𝑗 ), 𝑝 𝑗 = 𝑁 𝑗 𝑁part of class j in the node T pthe probability of system being in the state i, Tcurrent node, nnumber of classes,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Nnumber of objects in the node</head><p>The entropy of the whole breakdown is the sum of the entropies of all the nodes multiplied by possibility of records of all nodes in the total number of records:</p><formula xml:id="formula_1">𝐼𝑛𝑓𝑜(𝑆) = ∑ 𝑁 𝑗 𝑁 ⋅ 𝐼𝑛𝑓𝑜(𝑇 𝑗 ) 𝑛 𝑗=1</formula><p>To select the split attribute, a criterion called information gain or entropy decrease is used:</p><formula xml:id="formula_2">𝐺𝑎𝑖𝑛(𝑆) = 𝐼𝑛𝑓𝑜(𝑇) − 𝐼𝑛𝑓𝑜 𝑠 (𝑇)</formula><p>The best attribute to be used in the S breakdown is the one that provides the largest increase in Gain (S) information.</p><p>To reduce deviations, one of the main methods of aggregation in machine learning is used -Bagging or Bootstrap Aggregating <ref type="bibr" target="#b9">[12]</ref>.</p><p>Begging is a simple technique in which we build independent models and combine them using some model of averaging. Because, classification is carried out, aggregation occurs by majority voting. For this test observation, we can record the class predicted by each of the trees and take the majority of votes: the overall forecast is the most common class.</p><p>Begging has been particularly useful for decision trees. This is because begging avoids the high correlation between decision trees that occurs when training them using the same data.</p><p>One of the most popular boosting algorithms Gradient Boosting is used in the research <ref type="bibr" target="#b10">[13]</ref>. Like begging, the main task of boosting is to transform a set of weak classifiers (that is, those that make many mistakes in the test sample) into a stronger one. Gradient Boosting works consistently, adding new ones to past models to correct mistakes made by previous predictors. This algorithm tries to teach new models on the residual error of the past (moving to a minimum loss function).</p><p>Let's evaluate the use of Random Forest and Gradient Boosting. Like a Random Forest, Boosting Trees are a set of Decision Trees. The main differences between the algorithms are the way the trees are built and the results combined.</p><p>How trees are built: random forests build each tree independently; Gradient Boosting creates one tree at a time. This additive model (ensemble) works in stages, representing a weak student to improve the shortcomings of existing weak students.</p><p>Combining results: random forests combine results at the end of the process (by averaging or "majority rules"), while Gradient Boosting combines results along the way.</p><p>If you adjust the parameters carefully, Gradient Boosting can lead to better performance than random forests. However, Gradient Boosting may not be the best choice if noise is present, as it can lead to overfitting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>For the practical part of this work, it was decided to investigate a dataset that contains data about patients and whether they have been diagnosed with cardiovascular disease. So that, the problem of classification will be solved.</p><p>This dataset consists of 70,000 patient records, which include (Figure <ref type="figure">1</ref>): 1. age; 2. height; 3. weight; 4. gender; 5. blood pressure; 6. cholesterol; 7. blood glucose; 8. whether the patient smokes; 9. whether the patient drinks alcohol; 10. whether the patient is physically active.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 1. Medical indicators</head><p>The first stage is to check the dataset for duplicate rows. Rejection of duplicates is necessary, because during training, the model will learn from the original data, and then re-study their duplicate. Therefore, the model will relearn the same sample of data. As a result, the model may be poorly generalized.</p><p>The next stage is to check the relationships between the target variable and other variables. Figure <ref type="figure">2</ref>. shows the number of patients who were diagnosed (yellow column) and were not diagnosed (green column) cardiovascular disease relative to their age (in years).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 2: Dependence of cardiovascular disease on age</head><p>As a result, the ratio of patients to healthy patients increases with age. The next stage is to check whether there are linear relationships between variables. To do this, we construct a correlation matrix that will contain the correlation values for all pairs of variables. If there is a high correlation between the variable and the target variable, it will be possible to find a linear relationship between these variables. If non-target variables have high correlation values, it means that they contain very similar information, and therefore one of these variables can be neglected, and thus reduce the complexity of the model.</p><p>In the Figure <ref type="figure">3</ref>. the correlation matrix for the dataset is shown.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 3: Correlation matrix.</head><p>There is no direct linear relationship between variables. However, it can be noted that the highest values of correlation relative to the target variable (cardio) have indicators that correspond to blood pressure (ap_hi, ap_lo).</p><p>The next stage is to analyse the data in order to find possible emissions (not typical data, extreme). To do this, statistic data about the columns in the dataset are displayed. To be more precise -the mean value, standard deviation, minimum and maximum values and quantiles for each indicator.</p><p>From this data, it can be analysed that blood pressure indicators (both ap_hi and ap_lo) have extreme points because the maximum value is very different, which means that the mean and median values are different as well.</p><p>To see the distribution of data visually, the box charts for this data are built. (Figure <ref type="figure" target="#fig_0">4</ref>). As a result, these indicators have extreme data, which are indicated on the graph by points that go beyond the interquartile range.</p><p>The next step is to delete the lines that contain the extreme points. On the graph you can see the emissions, which are indicated by points beyond the quarterly interval. According to the research, emissions are non-standard indicators of a patient's health. To minimize the risks of incorrect prediction, it is necessary to clean the dataset from emissions.</p><p>The categorical data is encoded with one hot encoding. Such indicators in dataset 2: "cholesterol" and "gluc". Each of them has three unique values, so after unary coding, each of them will turn into three different indicators with boolean values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Training Random Forest Model</head><p>Before training, the dataset need to be divided into a train set and a test set. A ratio of 80% -train, 20% -test.</p><p>The RandomForestClassifier class from the scikit-learn library is chosen to train the random forest model. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control overfitting.</p><p>Advantages of using the Random Forest Classifier <ref type="bibr" target="#b11">[14]</ref>:</p><p>1. The random forest algorithm is significantly more accurate than most of the non-linear classifiers. 2. Ability to efficiently process data with a large number of attributes and classes. 3. The random forest classifier doesn't face the overfitting issue because it takes the average of all predictions, canceling out the biases and thus, fixing the overfitting problem. 4. You can use this algorithm for both regression and classification problems, making it a highly versatile algorithm. 5. Both continuous and discrete features are treated equally well. There are methods of constructing trees according to data with omitted values of features. 6. This algorithm offers you relative feature importance that allows you to select the most contributing features for your classifier easily.</p><p>7. Ability to work in parallel in many threads. 8. Internal assessment of the model's ability to generalize (out-of-bag test). 9. Scalability.</p><p>Disadvantages of using Random Forest Classifier <ref type="bibr" target="#b11">[14]</ref>:</p><p>1. This algorithm is slower than other classification algorithms because it uses multiple decision trees to make predictions. When a random forest classifier makes a prediction, every tree in the forest has to make a prediction for the same input and vote on the same. This process can be very time-consuming. 2. The algorithm tends to relearn on some tasks, especially with a lot of noise. 3. Large size of the received models. Requires O (NK) memory to store the model, where K is the number of trees.</p><p>In order to improve the accuracy of the model, the search for optimal Grid Search hyperparameters is used. It sorts the combinations from the given hyperparameters and chooses the best combination.</p><p>Hyperparameters that are used in the search:  n_estimators -the number of trees used to build a random forest;  criterion -the criterion of breaking a tree;  max_depth -the maximum depth that can be reached by the tree, after which the construction stops; As a result, after searching for hyperparameters, the model that showed the best predictions on a given metric will be returned.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Work results</head><p>Since the problem of binary classification has been solving, the most objective metric is roc-auc (area under the error curve) <ref type="bibr" target="#b12">[15]</ref>. The ROC curve provides detailed information about the behavior of the classifier. The curve is the result of the True Positive Rate (TPR) and False Positive Rate (FPR) depending on the threshold.</p><p>As a result of Grid Search, the best combination of hyperparameters is: 1. the number of trees -1000 2. the maximum depth of the tree -10</p><p>To see the number of true and false predictions of the classifier, the error matrices for the train and test sets, accordingly are shown on the plot (Figure <ref type="figure">5</ref>, Figure <ref type="figure">6</ref>). As a result, the relative distribution of errors has been preserved from the train to test sets, and therefore the model is well generated.</p><p>The model makes fewer FN errors (when the correct prediction is 1 and the model gives 0). False Negative errors -the algorithm did not recognize the disease and recognized the sick person healthy. The cost of error is very important, especially in medicine.</p><p>Obviously, ideally, we aim for the classification algorithm is to give zero errors of the FP and FN classes, but in real life this is rare, but each model should minimize the number of errors.</p><p>In the Figure <ref type="figure">7</ref>. and Figure <ref type="figure">8</ref>. the error curves (ROC) and the calculated area under them (roc-auc) are shown for the training and testing set, accordingly. The ROC curve <ref type="bibr" target="#b13">[16]</ref> illustrates the sensitivity of the classifier, showing how many correctly classified objects can be obtained, allowing more and more FP cases.</p><p>This metric shows the dependence of the fullness (recall) of predictions -the proportion of class 1 objects from all class 1 objects, that were correctly predicted, to the proportion of class 0 objects that were incorrectly predicted. The metric has possible values [0; 1]. As a result, the metric values are similar on the training and testing datasets, so the model is well generated.</p><p>1. It reduces overfitting in decision trees and helps to increase accuracy. 2. It is flexible for both classification and regression problems. 3. It works well with both categorical and continuous values. 4. Data normalization is not required because a rule-based approach is used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions</head><p>To summarize the research, it should be said the goal of the study has successfully been achieved. Namely, with the help of one of the most popular methods -the method of Random Forest to predict the presence of cardiovascular disease in people with different health conditions. The research was conducted on the basis of an open dataset cardio.csv, taken from the Internet.</p><p>Using the best classification model obtained through Grid Search from the scikit-learn library, the Random Forest training was committed. Then the trained model was used to classify patients. The accuracy of predictions is 80%. The same operations were performed for another ensemble algorithm -Gradient Boosting. When using this algorithm, the correct predictions were 73%.</p><p>The AUC metric for the test data showed a high prediction score of 0.79. So, the classifier worked quite well.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Box charts.</figDesc><graphic coords="7,154.05,72.00,287.16,268.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 5 : 6 :</head><label>56</label><figDesc>Figure 5: Error matrix for the train set Figure 6: Error matrix for the test set</figDesc><graphic coords="8,77.40,548.53,203.27,204.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 7 : 8 :</head><label>78</label><figDesc>Figure 7: ROC-AUC for training dataset Figure 8: ROC-AUC for testing dataset</figDesc><graphic coords="9,77.40,211.14,237.79,176.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="5,140.65,357.16,313.72,289.95" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="6,143.65,97.30,307.81,292.50" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The science of clinical practice: Disease diagnosis or patient prognosis? Evidence about &quot;what is likely to happen&quot; should shape clinical practice</title>
		<author>
			<persName><forename type="first">P</forename><surname>Croft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">G</forename><surname>Altman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Deeks</surname></persName>
		</author>
		<idno type="DOI">10.1186/s12916-014-0265-4</idno>
	</analytic>
	<monogr>
		<title level="j">BMC Medicine</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Heart disease prediction using data mining techniques</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rairikar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kulkarni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sabale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lamgunde</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2017 International Conference on Intelligent Computing and Control(I2C2)</title>
				<imprint>
			<date type="published" when="2017-06">2017. June</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Heart Disease Prediction Using Machine Learning Algorithms</title>
		<author>
			<persName><forename type="first">A</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kumar</surname></persName>
		</author>
		<idno type="DOI">10.1109/ice348803.2020.9122958</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Electrical and Electronics Engineering</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Comparison of Prediction Models for Heart Failure Risk: A Clinical Perspective</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Tiwaskar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gosavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Dubey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jadhav</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Iyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)</title>
				<meeting>the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)<address><addrLine>Pune, India</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-08">August 2018</date>
			<biblScope unit="page" from="16" to="18" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Comparative analysis of clustering algorithms with heart disease datasets using data mining weka tool</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kodati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vivekanandam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ravi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Adv. Intell. Syst. Comput</title>
		<imprint>
			<biblScope unit="volume">900</biblScope>
			<biblScope unit="page" from="111" to="117" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note>CrossRef</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Improving the accuracy for analyzing heart diseases prediction based on the ensemble method</title>
		<author>
			<persName><forename type="first">X.-Y</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Ali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">S</forename><surname>Hassan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Anwar</surname></persName>
		</author>
		<idno>Article ID 6663455</idno>
	</analytic>
	<monogr>
		<title level="j">Complexity</title>
		<imprint>
			<biblScope unit="page">10</biblScope>
			<date type="published" when="2021">2021. 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">What are decision trees?</title>
		<author>
			<persName><forename type="first">C</forename><surname>Kingsford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Salzberg</surname></persName>
		</author>
		<idno type="DOI">10.1038/nbt0908-1011</idno>
		<ptr target="https://doi.org/10.1038/nbt0908-1011" />
	</analytic>
	<monogr>
		<title level="j">Nat Biotechnol</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="page" from="1011" to="1013" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction</title>
		<author>
			<persName><forename type="first">N</forename><surname>Dobashi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Saito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Nakahara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Matsushima</surname></persName>
		</author>
		<idno type="DOI">10.3390/e23060768</idno>
		<ptr target="https://doi.org/10.3390/e23060768" />
	</analytic>
	<monogr>
		<title level="j">Entropy</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page">768</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Applications of python to evaluate the performance of bagging methods</title>
		<author>
			<persName><forename type="first">Akhil</forename><surname>Kadiyala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ashok</forename><surname>Kumar</surname></persName>
		</author>
		<idno type="DOI">10.1002/ep.13018</idno>
		<ptr target="https://doi.org/10.1002/ep.13018" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Accelerated gradient boosting</title>
		<author>
			<persName><forename type="first">G</forename><surname>Biau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cadre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Rouvière</surname></persName>
		</author>
		<idno type="DOI">10.1007/s10994-019-05787-1</idno>
		<ptr target="https://doi.org/10.1007/s10994-019-05787-1" />
	</analytic>
	<monogr>
		<title level="j">Mach Learn</title>
		<imprint>
			<biblScope unit="volume">108</biblScope>
			<biblScope unit="page" from="971" to="992" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A novel hybrid intelligent method based on C4.5 decision tree classifier and oneagainst-all approach for multi-class classification problems</title>
		<author>
			<persName><forename type="first">K</forename><surname>Polat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Polat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gunes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">An International Journal of Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="587" to="1592" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">Pavan</forename><surname>Vadapalli</surname></persName>
		</author>
		<ptr target="https://www.upgrad.com/blog/random-forest-classifier/" />
		<title level="m">Random Forest Classifier: Overview, How Does it Work, Pros &amp; Cons</title>
				<imprint>
			<date type="published" when="2021-06-18">June 18, 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Rough Sets / Pawlak Z</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Pawlak</surname></persName>
		</author>
		<idno>-No11</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer and Information Science</title>
		<imprint>
			<biblScope unit="page" from="341" to="356" />
			<date type="published" when="1982">1982</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Understanding AUC -ROC Curve</title>
		<author>
			<persName><forename type="first">Sarang</forename><surname>Narkhede</surname></persName>
		</author>
		<ptr target="https://medium.com/@narkhedesarang" />
		<imprint>
			<date type="published" when="2018-06">Jun 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Machine learning classification and structure-functional analysis of cancer mutations reveal unique dynamic and network signatures of driver sites in oncogenes and tumor suppressor genes</title>
		<author>
			<persName><forename type="first">S</forename><surname>Agajanian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Odeyemi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bischoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ratra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">M</forename><surname>Verkhivker</surname></persName>
		</author>
		<idno type="DOI">10.1021/acs.jcim.8b00414</idno>
	</analytic>
	<monogr>
		<title level="j">J. Chem. Inf. Model</title>
		<imprint>
			<biblScope unit="volume">58</biblScope>
			<biblScope unit="page" from="2131" to="2150" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Chaudhary Distributed decision tree v.2.0</title>
		<author>
			<persName><forename type="first">S</forename><surname>Desai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2017 IEEE International Conference on Big Data (Big Data) Presented at the 2017 IEEE International Conference on Big Data (Big Data)</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="929" to="934" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Unbiased split selection for classification trees based on the gini index</title>
		<author>
			<persName><forename type="first">C</forename><surname>Strobl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Boulesteix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Augustin</forename></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename></persName>
		</author>
		<idno type="DOI">10.1016/j.csda.2006.12.030</idno>
	</analytic>
	<monogr>
		<title level="j">Comput. Stat. Data Anal</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page" from="483" to="501" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">IntelliHealth: A medical decision support application using a novel weighted multilayer classifier ensemble framework</title>
		<author>
			<persName><forename type="first">S</forename><surname>Bashir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bashir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Qamar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Khan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Biomedical Informatics</title>
		<imprint>
			<biblScope unit="volume">59</biblScope>
			<biblScope unit="page" from="185" to="200" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Receiver operating characteristic (roc) curves: review of methods with applications in diagnostic medicine</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Obuchowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Bullen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Phys Med Biol</title>
		<imprint>
			<biblScope unit="volume">63</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="7" to="8" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Effective Heart Disease Prediction using Hybrid Machine Learning Techniques</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mohan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Thirumalai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Srivastava</surname></persName>
		</author>
		<idno type="DOI">10.1109/access.2019.2923707</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Diabetes disease prediction using data mining</title>
		<author>
			<persName><forename type="first">Deeraj</forename><surname>Shetty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kishor</forename><surname>Rit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sohail</forename><surname>Shaikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikita</forename><surname>Patil</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">A review on prediction and diagnosis of heart failure</title>
		<author>
			<persName><forename type="first">B</forename><surname>Gnaneswar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jebarani</forename><surname>Ebenezar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1" to="3" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Heart Disease Prediction using Machine Learning Techniques</title>
		<author>
			<persName><forename type="first">D</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Patel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Bharti</surname></persName>
		</author>
		<idno type="DOI">10.1007/s42979-020-00365-y</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1007/s42979-020-00365-y" />
	</analytic>
	<monogr>
		<title level="j">SN Computer Science</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">6</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Heart disease prediction using machine learning techniques: a survey</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">V</forename><surname>Ramalingam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dandapath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Raja</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Int J Eng Technol</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="684" to="687" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">A comprehensive investigation and comparison of machine learning techniques in the domain of heart disease</title>
		<author>
			<persName><forename type="first">S</forename><surname>Pouriyeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vahid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sannino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">De</forename><surname>Pietro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Arabnia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gutierrez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE symposium on computers and communications (ISCC). IEEE</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="204" to="207" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Data mining approach to detect heart diseases</title>
		<author>
			<persName><forename type="first">V</forename><surname>Chaurasia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pal</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Int J Adv Comput Sci Inf Technol (IJACSIT)</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="56" to="66" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
