<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Conference on Digital Technologies in Education, Science and
Industry, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Comprehensive Study on Machine Learning Applications for Heart Disease Risk Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lashyn Adiat</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aizhan Altaibek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marat Nurtas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aigerim Altayeva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Al-Farabi Kazakh National University</institution>
          ,
          <addr-line>al-Farabi Avenue 71, Almaty, 050040</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Ionosphere</institution>
          ,
          <addr-line>Gardening community IONOSPHERE 117, Almaty, 050020</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>International Information Technology University</institution>
          ,
          <addr-line>Manas St. 34/1, Almaty, 050000</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>0</volume>
      <fpage>6</fpage>
      <lpage>07</lpage>
      <abstract>
        <p>This research investigates an analytical approach to machine learning applications for cardiovascular disease (CVD) risk prediction on a publicly accessible database. The dataset contains crucial information about patients' individual characteristics, such as age, blood pressure, ECG at rest, heart rate, and four types of chest pain. The purpose of this research is to choose the most suitable model for heart attack analysis. Descriptive analytics and exploratory data analysis based on various factors were done to predict risk by employing machine learning algorithms and techniques, including k-nearest neighbours, logistic regression, support vector machines (SVM) and random forests. The research involves thorough data analysis and rigorous model training processes.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Heart disease prediction</kwd>
        <kwd>comprehensive study</kwd>
        <kwd>machine learning</kwd>
        <kwd>cardiovascular risks</kwd>
        <kwd>exploratory data analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The World Health Organization (WHO) classifies and reports on various causes of death
worldwide through its International Classification of Diseases (ICD) system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        According to WHO, Cardiovascular diseases (CVDs) are the leading cause of death globally,
taking an estimated 17.9 million lives each year. More than four out of five CVD deaths are due to
heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70
years of age [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Cardiovascular diseases (diseases of the heart or blood vessels) have become a significant
public health concern in economically advanced countries. This is primarily due to the difficulties
in making an early diagnosis and patients' unwillingness to seek medical assistance when the first
symptoms occur. A fast-paced lifestyle, an unhealthy diet, a lack of physical activity, alcohol and
tobacco addictions, and insufficient sleep all contribute to the harmful influence on the
cardiovascular system [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        According to epidemiological data, in 2018, cardiovascular disease was the leading cause of
death in China. The number of patients with cardiovascular disease is 330 million in China,
including 11 million stroke and more than 270 million diseases related to the heart [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In the United States, heart disease causes more than 600,000 deaths annually, accounting for
approximately one in every four deaths [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Predicting and diagnosing heart disease is the biggest challenge in the medical industry and it
is based on factors like physical examination, symptoms, and signs of the patient [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Body
cholesterol levels, smoking habits, obesity, family history of diseases, blood pressure, and
working environment are all factors that influence heart disease [10].
      </p>
      <p>Early detection and prevention of heart attacks can improve patient outcomes and reduce the
strain on healthcare systems.</p>
      <p>
        Traditional risk assessment methods, such as the Framingham Risk Score [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], focus on clinical
and demographic data that may not fully capture the complexity of the underlying causes of
cardiovascular risk.
      </p>
      <p>
        Machine learning has transformed disease detection by enabling the creation of predictive
models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that analyze massive datasets to uncover subtle trends and anomalies, thereby
assisting in early diagnosis and intervention. Machine learning techniques have been extensively
used in recent years to forecast the likelihood of heart attacks based on these parameters [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Although traditional risk assessment models are effective, their scope and predictive power
are frequently constrained. In contrast, machine learning offers a data-driven method that can
more accurately forecast cardiac disease and find complex correlations between numerous risk
factors.</p>
      <p>As healthcare organizations strive to acquire patient records, it is estimated that one trillion
bytes of data are generated every day. This information is of the utmost importance and must be
properly extracted to yield valuable insights [13]. Patients may not always accurately describe
their medical conditions, and laboratory test results can be subject to errors. Healthcare
specialists may struggle to make informed decisions about a patient's illness because of their
limited expertise in all areas [12]. To address this challenge, the development of a disease
prediction system that integrates medical knowledge with a comprehensive system is necessary
to produce the most effective results and benefit society [14]. Previous investigations have
attempted to use patient laboratory tests [15-17] and medication [18] to predict disease onset.
Some prototypes have also been used to identify unknown risk factors while simultaneously
improving the sensitivity and specificity of detection. Recent studies have demonstrated success
in predicting diseases through several methods, including support vector machines [19-21],
logistic regression [22], random forests [23], neural networks [17], and time series modelling
techniques [24].</p>
      <p>Machine learning models can be flexible and tailored to fit the range of data sources that are
becoming increasingly accessible in healthcare. The advancement in technology has greatly
enhanced the capacity to forecast a wide range of diseases, such as cancer, cardiovascular
conditions, and infectious outbreaks. This development has a two-fold impact: it empowers
healthcare providers to identify high-risk individuals for early intervention, which could
potentially save lives, and it supports public health agencies in proactive surveillance and
resource allocation, helping to curb the spread of diseases on a larger scale. As machine learning
continues to advance, its contribution to disease prediction is expected to improve healthcare
outcomes and minimize the avoidable economic and human costs associated with preventable
illnesses [9].</p>
      <p>The goal of this research is to build and test a machine-learning model for forecasting the risk
of heart disease. This will be accomplished using a dataset containing patient information and
clinical measurements. A comparative analysis is performed to evaluate and contrast the efficacy
of various machine learning algorithms that have been utilized in this context, including logistic
regression and more advanced models and feature selection strategies in the domain of heart
attack prediction. The selection of the most appropriate algorithm is of the utmost importance,
as it has a direct impact on the model's ability to process the data, recognize complex
relationships, and produce trustworthy predictions.</p>
      <p>Considering the study's findings, it is critical to provide relevant insights and evidence-based
recommendations to healthcare providers and policymakers. The scope of the investigation is
limited to the assessment of a single dataset containing patient information and clinical
measurements. The model developed in this study was built primarily for research purposes and
should not be used in place of clinical diagnosis or therapy.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The exploratory data analysis was performed using publicly accessible data on heart disease. The
dataset comprised 303 records with 14 attributes, including age, blood pressure, blood glucose
level, ECG at rest, heart rate, and four types of chest pain.</p>
      <p>The nearly balanced data on the proportion of people experiencing heart attacks (54%)
suggests that there is no need to further balance them.</p>
      <p>The majority of individuals fall within the age range of 50-60 years old, have relatively low
chest pain, have blood pressure within the range of 120-140, have cholesterol levels between
200-300, have blood sugar levels below 120, and are male. The majority of these individuals had
a heart rate within the range of 150-175.</p>
      <p>Individuals aged 40-60 are more likely to have heart disease, whereas those with a higher
resting heart rate are at a higher risk of experiencing a heart attack.</p>
      <p>People having cholesterol of 120-250 and blood between 110 to 140 are more likely to have a
heart attack.</p>
      <p>A large percentage of men are more likely to experience heart attacks than women, with 73%
of men and 45% of women suffering from heart attacks.</p>
      <p>If someone experiences chest pain, it is highly probable that they will suffer from a heart
attack.</p>
      <p>The impact of blood sugar level on the likelihood of a heart attack is relatively small. In other
words, whether a person has high blood sugar levels does not necessarily determine whether
they will have a heart attack.</p>
      <p>People who do not regularly exercise their cardiovascular system are highly likely to suffer
from heart attack.</p>
      <p>The higher the chest pain and the higher the person's heart rate, the more likely they are to
suffer a heart attack.</p>
      <p>Individuals with a low level of exercise-induced angina are more likely to experience heart
disease, even though age does not significantly contribute to the risk of heart attack.</p>
      <p>The graph and table below show that there is a positive correlation between heart attack and
chest pain, heart rate, and slope. However, there was a negative correlation between heart attacks
and age, induced angina, and major vessels.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>Among the aforementioned machine learning algorithms, the utilization of logistic regression,
Knearest neighbours (KNN) [11], support vector classifiers, and random forest classifiers are
recommended because they demonstrate comparative accuracy with traditional methods. Our
preliminary examination suggests that the search for optimal coefficients may be accommodated
through iterative coefficient selection, specifically, Logistic Regression. However, the accuracy of
the models can only be ascertained after a thorough evaluation.</p>
      <p>
        Logistic Regression, a statistical method used for binary classification, was first introduced to
address the problem of binary classification. It assumes that the data follow a Bernoulli
distribution and solves for the optimal parameters through maximum likelihood estimation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The first Logistic Regression model shows values of around 89 per cent accuracy, which is quite
good for this model.
      </p>
      <p>KNN demonstrates an accuracy of around 70%, which unfortunately means that finding the
optimal K requires additional calculations up to K=100(th neighbour). SVC is showing progress
and is approaching 76% but it is still not the best model at the moment. Finally, Random Forest
shows progress around 82%, which is clearly better than the previous ones, but still falls short of
the best model.</p>
      <p>We can conclude that the Logistic Regression model shows good progress in comparison with
other machine learning algorithms, thereby suggesting that the neural network will demonstrate
a more optimal result since Logistic Regression is based on gradient descent in finding the
necessary parameters.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Acknowledgements</title>
      <p>We would like to express our sincere gratitude to Rashik Rahman, the data provider on Kaggle
for providing valuable data resources that were essential for the successful completion of this
study. We extend our appreciation to the Kaggle community for their efforts to make diverse
datasets accessible and to promote collaboration among data enthusiasts and researchers.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>During our study, we conducted an exploratory data analysis and comparative analysis of
machine learning algorithms' accuracies. After conducting a thorough analysis and applying
several widely used machine learning algorithms for forecasting heart disease, we discovered
that logistic regression demonstrated outstanding performance. In fact, it was the algorithm that
attained the highest level of accuracy, allowing us to confidently classify patients with heart
disease. The application of machine learning algorithms in predicting heart disease is an ongoing
research area with significant promise. Integrating advanced machine learning methods in this
domain is likely to significantly alleviate the burden on health care and improve the prognosis of
diseases, leading to improved overall patient health.</p>
      <p>Therefore, in future research, we aim to develop personalized forecasting methods that
consider additional risk factors and model adaptations to changing conditions and patient needs.
This approach will enhance the accuracy and effectiveness of heart disease forecasts, leading to
earlier diagnosis and treatment, and ultimately improving patient health and quality of life.
6. References
[9] Aighuraibawi, A. H. B., Manickam, S., Abdullah, R. (2023). Feature Selection for Detecting
ICMPv6-Based DDoS Attacks Using Binary Flower Pollination Algorithm." Computer Systems
Science &amp; Engineering, 47(1). DOI: 10.32604/csse.2023.037948.
[10] X. Su. (2020). Prediction for cardiovascular diseases based on laboratory data: An analysis of
random forest model." J. Clin. Lab. Anal., 301. 34(9): 1–10, DOI: 10.1002/jcla.23421.
[11] Walid Sherif. (2018). Optimization of K-NN algorithm by clustering and reliability
coefficients: application to breast-cancer diagnosis." Procedia Computer Science 127: 293–
299. DOI: 10.1016/j.procs.2018.01.125.
[12] G. Saranya, A. Pravin. (2020). A comprehensive study on disease risk predictions in machine
learning." International Journal of Electrical and Computer Engineering (IJECE), 10(4):
42174225. DOI: 10.11591/ijece.v10i4.pp4217-4225.
[13] R. Snyderman. (2012). Personalized health care: from theory to practice. Biotechnology</p>
      <p>Journal, 7(8): 973–979, Aug. 2012. DOI: 10.1002/biot.201100297.
[14] M. Jiang, Y. Chen, M. Liu, S. T. Rosenbloom, S. Mani, J. C. Denny, and H. Xu. (2011). A study of
machine-learning-based approaches to extract clinical entities and their assertions from
discharge summaries.” J. Am Med Inform Assoc, 18(5): 601–606. DOI:
10.1136/amiajnl2011-000163.
[15] N. Razavian and D. Sontag. (2015). Temporal convolutional neural networks for diagnosis
from lab tests. arXiv:1511.07938v4.
[16] R. Ranganath, J. S. Hirsch, D. Blei, and N. Elhadad. (2015). Risk prediction for chronic kidney
disease progression using heterogeneous electronic health record data and time series
analysis.” Journal of the American Medical Informatics Association: JAMIA, 22(4): 872–880.</p>
      <p>DOI: 10.1093/jamia/ocv024.
[17] N. Tangri, L. A. Stevens, J. Griffith, H. Tighiouart, O. Djurdjev, D. Naimark, A. Levin, and A. S.</p>
      <p>Levey. (2011). A predictive model for progression of chronic kidney disease to kidney
failure.” JAMA. 305(15): 1553–1559. DOI: 10.1001/jama.2011.451.
[18] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun. (2016). Doctor AI: Predicting
clinical events via recurrent neural networks.” Proceedingsofthe1st Machine Learning for
Healthcare Conference, ser. Proceedings of Machine Learning Research, pp. 301–318.
[19] N. Barakat, A. P. Bradley, and M. N. H. Barakat. (2010). Intelligible Support Vector Machines
for Diagnosis of Diabetes Mellitus. IEEE Transactions on Information Technology in
Biomedicine, 14(4): 1114– 1120. DOI: 10.1109/TITB.2009.2039485.
[20] Wu Jionglin M. S., et al. (2010). Prediction Modeling Using EHR Data: Challenges, and a
Comparison of Machine Learning Approaches. Journal of the Medical Care section, American
Public Health Association, 48(6): S106-S113.
[21] W. Yu, T. Liu, R. Valdez, M. Gwinn, and M. J. Khoury. (2010). Application of support vector
machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes.”
BMC Medical Informatics and Decision Making, 10(16): 1-7. DOI:
10.1186/1472-6947-1016.
[22] N. Razavian, S. Blecker, A. M. Schmidt, A. Smith-McLallen, S. Nigam, and D. Sontag. (2015).</p>
      <p>Population-Level Prediction of Type 2 Diabetes from Claims Data and Analysis of Risk
Factors. Big Data, 3(4): 277–287. DOI: 10.1089/big.2015.0020.
[23] A. V. Lebedev, E. Westman, G. J. P. Van Westen, M. G. Kramberger, A. Lundervold, D. Aarsland,
H. Soininen, I. Kłoszewska, P. Mecocci, M. Tsolaki, B. Vellas, S. Lovestone, and A. Simmons.
(2014). Random Forest ensembles for detection and prediction of Alzheimer’s disease with
a good between-cohort robustness.” NeuroImage: Clinical, 6: 115–125. DOI:
10.1016/j.nicl.2014.08.023.
[24] A. Perotte, R. Ranganath, J. S. Hirsch, D. Blei, and N. Elhadad. (2015). Risk prediction for
chronic kidney disease progression using heterogeneous electronic health record data and
time series analysis. Journal of the American Medical Informatics Association: JAMIA, 22(4):
872–880. DOI: 10.1093/jamia/ocv024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>A. U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saeed</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saeed</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>A framework for susceptibility analysis of brain tumours based on uncertain analytical cum algorithmic modeling</article-title>
          .
          <source>Bioengineering</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ),
          <fpage>147</fpage>
          . DOI:
          <volume>10</volume>
          .3390/bioengineering10020147.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>World</given-names>
            <surname>Health</surname>
          </string-name>
          <string-name>
            <surname>Organization</surname>
          </string-name>
          ,
          <article-title>"cardiovascular diseases (CVDs)"</article-title>
          , URL: https://www.who.
          <article-title>int/health-topics/cardiovascular-diseases#tab=tab_1.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Marat</given-names>
            <surname>Nurtas</surname>
          </string-name>
          , Baishemirov Zharasbek, Zhanabekov Zhandos, (
          <year>2020</year>
          ),
          <article-title>Applying Neural Network for predicting cardiovascular disease risk."</article-title>
          <source>News of the National Academy of sciences of the Republic of Kazakhstan</source>
          ,
          <volume>4</volume>
          (
          <issue>332</issue>
          ):
          <fpage>28</fpage>
          -
          <lpage>34</lpage>
          . https://doi.org/10.32014/
          <year>2020</year>
          .2518-
          <fpage>1726</fpage>
          .
          <fpage>62</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Siyi</given-names>
            <surname>Wang</surname>
          </string-name>
          . (
          <year>2023</year>
          ).
          <article-title>Research on the heart attack prediction based on logistic regression</article-title>
          .
          <source>Highlights in Science, Engineering and Technology</source>
          , Volume
          <volume>65</volume>
          . DOI:
          <volume>10</volume>
          .1016/j.tele.
          <year>2018</year>
          .
          <volume>11</volume>
          .007.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D</given-names>
            <surname>. for H. D.</surname>
          </string-name>
          and
          <string-name>
            <surname>S. P.</surname>
          </string-name>
          ,
          <article-title>National Center for Chronic Disease Prevention</article-title>
          and Health Promotion, “Heart Disease Facts,”
          <year>2021</year>
          . https://www.cdc.gov/heartdisease/facts.htm#:
          <article-title>~:text=Coronary heart disease is the</article-title>
          , killing
          <volume>375</volume>
          %2C476 people in
          <year>2021</year>
          .&amp;
          <article-title>text=About 1 in 20 adults, have CAD (about 5%25).&amp;text=In 2021%2C about 2 in, less than 65 years old</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Manikantan</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Latha.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Predicting the Analysis of Heart Disease Symptoms Using Medicinal Data Mining Methods"</article-title>
          ,
          <source>International Journal on Advanced Computer Theory and Engineering</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tommy</given-names>
            <surname>Pocana</surname>
          </string-name>
          and
          <string-name>
            <surname>Michael Fuchs.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>The Cardiovascular Link to Nonalcoholic Fatty Liver Disease: A Critical Analysis</article-title>
          .
          <source>Clinics in Liver Disease</source>
          ,
          <volume>16</volume>
          (
          <issue>3</issue>
          ):
          <fpage>599</fpage>
          -
          <lpage>613</lpage>
          . DOI:
          <volume>10</volume>
          .1016/j.cld.
          <year>2012</year>
          .
          <volume>05</volume>
          .008.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Marqas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mousa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Özyurt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            and
            <surname>Salih</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>A Machine Learning Model for the Prediction of Heart Attack Risk in High-Risk Patients Utilizing Real-world Data</article-title>
          .
          <source>Academic Journal of Nawroz University</source>
          ,
          <volume>12</volume>
          (
          <issue>4</issue>
          ),
          <fpage>286</fpage>
          -
          <lpage>301</lpage>
          . DOI:
          <volume>10</volume>
          .25007/ajnu.v12n4a1974Aighuraibawi.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>