A Machine Learning Model for the Atherosclerosis Prediction
Based on Clinical Data
Kateryna Kolesnikova, Dariya Mochalova and Vladyslav Lavrynovych
Taras Shevchenko National University of Kyiv, Volodymyrska str., 60, Kyiv, 01033, Ukraine

                Abstract
                The paper outlines the relevance of diagnostics of atherosclerotic disease from the standpoint
                of the integrity of the research. It determines that the ahterosclerotic disease is the fist step to
                more sirious cardiovascular diseases, so it is verry importatnt to diagnose it in early stages. It
                proposes the solution - the development of technology to diagnose atherosclerotic disease. The
                definition of such technology has been given. It has been established that in terms of
                technology it is important to develop an effective and optimized model for the prediction of
                atherosclerosis from the standpoint of all stages of the research. It has been discovered that
                there are more than one efficient algorithm that can be used for such purpose. The outcome
                technology of atherosclerotic disease has been compiled and researched based on a dataset of
                1000 patient. The solution is implemented using machine learning methods, using Python
                programming language as a base for the software product. The research resulted in a
                technology based on models with an accuracy between 98.75% and 100%. The prospects of
                further research - the implementation of the diagnistic system itself, which can be integrated
                with overviewed techniques along with computer vision and other technologies that may
                improve diagnosis and treatment of atherosclerosis. The paper identifies the challenges and
                perspectives of the research.

                Keywords 1
                data science, machine learning, deep learning, binary classification, atherosclerosis, heart
                disease

1. Introduction
    Nowadays, cardiovascular diseases (CVDs) are one of the leading causes of death all over the world.
According to the World Health Organization’s data [1, 2], 32% of all global deaths are caused by CVDs,
which is around 17.9 million lives every year. Atherosclerosis, which is the subject of this study, tends
to be one of the main underlying causes of CVDs, also playing a key role in heart stroke and peripheral
artery disease (PAD). Atherosclerosis is very common, and usually followed by a set of risk factors like
high cholesterol, obesity, inactivity, diabetes, etc. Atherosclerosis is a complex process, usually slow
and progressing in the long-term perspective. Even today it’s not completely clear what exactly causes
this process and why. Atherosclerosis is often characterized by narrowing and hardening arteries, and
the symptoms depend on what artery is narrowed or blocked. Atherosclerosis starts with damage to the
endothelium of blood vessels, frequently caused by high cholesterol, blood pressure, inflammation and
smoking. Entering the damaged area of the artery, cholesterol and other cell parts become plaque in the
artery wall. As long as atherosclerosis progresses, plaque gets bigger and may create a blockage when
it’s big enough, causing severe consequences [3].
    There were several studies for different computer-aided approaches to this issue in recent years, but
for all that, the problem remains highly challenging today. Data mining and machine learning is a state-
of-art technology, which allows us to discover connections between attributes of large scaled data and
train models to make predictions more accurately. Machine learning has already found its application

1
 Information Technology and Implementation (IT&I-2021), December 01–03, 2021, Kyiv, Ukraine
amberk4@gmail.com (Kateryna Kolesnikova); daria.mochalova.02@gmail.com (Dariya Mochalova); vlad.lavrynovych@icloud.com
(Vladyslav Lavrynovych)
          ©️ 2022 Copyright for this paper by its authors.
          Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
          CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                 134
in different medicine realms for disease prediction [4 – 6] and proved to be quite efficient, depending
on tasks, algorithms and data. It is worth mentioning that a lot of studies have been performed doing
classification for prediction of diagnosis of heart disease using various models and methods.
Nonetheless, most of them were limited with data and were relying on a 13-feature Cleveland dataset
[7], that includes only 303 records. Such an approach may not only narrow the research field, but also
result in a row of inaccurate conclusions [8], caused by the lack of data needed for model training. For
our model we will use a real dataset, provided by Amosov National Institute of Cardiovascular Surgery
which will help us to talk not only about theoretical results but find different approaches for real cases.
Unlike the other studies, we will not limit to prediction of cardiovascular disease, but will try to predict
atherosclerosis, which is a precondition of CAD and is not as well researched due to the reasons
described above. Therefore, the study carries both scientific and practical interest for the audience and
may bring some light to crucial medicine problems.

2. Related studies and algorithms overview
    Currently, a lot of available atherosclerosis prediction studies using machine learning approach rely
on Cleveland heart disease dataset [7] and thus are not completely representative, as this dataset contains
data only about coronary heart disease which is a bit different from the subject of our study. However,
such studies claim to be describing prediction of atherosclerosis [9], which is actually a wrong
statement. Indeed, a lot of studies of coronary heart disease prediction using machine and deep learning
techniques were performed, but prediction of atherosclerosis is relatively poorly researched due to the
lack of data. However, there are several interesting studies [10, 11] in this field and most of the
techniques applied for CAD prediction may also be applied here.

   2.1. Support vector machine
    SVM is a set of supervised machine learning algorithms used for classification and regression
analysis. This method relies on a hyperplane or a set of hyperplanes in multidimensional space which
separate the data into classes. Algorithm finds points closest to the hyperplane from both classes as
illustrated on Figure 1.


Figure 1: SVM algorithm visualization
    These points are called support vectors and computing maximized distance between support vectors
and hyperplane we find the optimal hyperplane (the more distance between classes, the better result).
This algorithm was already used for the prediction of CAD [12] and reached up to 96.67% accuracy,
thus in this paper we will also try to apply it to predict atherosclerosis.

   2.2. Naive Bayes
   Naive Bayes classifier is a supervised machine learning algorithm based on Bayes theorem, which
“naively” assumes that all the features independently contribute to the probability of belonging to some
class. Formulas (1, 2) express the calculation of posterior probability of class.

                                                                                                        135
                                                           𝑃(𝑥|𝑐)𝑃(𝑐)
                                            𝑃(𝑐|𝑥) =                                                       (1)
                                                               𝑃(𝑥)
                 𝑃(𝑐|𝑋) = 𝑃(𝑥1 |𝑐) × 𝑃(𝑥2 |𝑐) × . . .× 𝑃(𝑥𝑛 |𝑐) × 𝑃(𝑐)                            (2)
   Where P(c|x) - posterior probability of class, c - target, x - attributes, P(c) - prior probability of class,
P(x|c) - probability of predictor given class, P(x) - prior probability of predictor [13]. Naive Bayes
algorithm is known for well performance on large scaled data and sometimes outperforms even more
sophisticated methods. Along with its simplicity, the algorithm found its way into prediction of CAD
and atherosclerosis [14] reaching 98.60% of accuracy, and we will also test this method against our
dataset for comparison purposes.

   2.3. Decision tree
    The next algorithm we will consider in this study is a decision tree. This algorithm is widely used
for classification and regression purposes and represents a tree-structured classifier. Leaves represent
target classes, each node - a test case for a particular attribute of data and edges are the result of a test
case. Decision trees form nested if-else statements and the deeper the tree - the fitter the model. A brief
illustration of the algorithm is shown on Figure 2.


   Figure 2: Schematic visualization of decision tree algorithm
        Decision tree was successfully applied for atherosclerosis prediction with obtained accuracy
82.6% [15]. In this study we will analyze the accuracy of the algorithm applied to our dataset and
evaluate the reasonableness of its use in this area.

  2.4. Random Forest
   Random forest represents ensemble learning, which means combining many classifiers to obtain a
solution (classification trees). This is achieved by averaging the prediction of each classifier. Random
Forest technique allows the model to learn complex relations and increase accuracy for predictions.
Due to flexibility of the algorithm, it produces good results even without hyper-parameter tuning.
Random forest was not yet widely applied to atherosclerosis prediction, but has shown relatively good
results in CAD prediction (87.64%) [16].

   2.5. XGBoost
  XGBoost is a relatively new machine learning algorithm, which stands for eXtreme Gradient
Boosting. This algorithm, as well as the random forest, is based on decision-tree ensemble learning, but

                                                                                                           136
the difference is that it uses a gradient boosting framework and is highly optimized and uses less
resources than plain gradient boosting. The method was introduced in 2016 and showed good
performance on different tasks and also recently was applied to atherosclerosis prediction based on
electronic health records [17] showing accuracy 74%, and CAD prediction with accuracy 91.8% [18].

   2.6. Deep neural network
    Deep learning is a machine learning technique that uses multiple layers to extract high-level
relationships and features from data. Nowadays deep learning is recognized as a state-of-art technology,
which is flexible and provides good results allowing to optimize the accuracy during the train process.
However, deep learning requires large and well prepared datasets as well as computational resources
for the training process. Neural network takes inputs and modifies the neuron weights in accordance
with the error rate calculated between actual and predicted value. Hidden layers allow the network to
learn nonlinear relations in the dataset. The output layer represents only one output for binary
classification (which is our case), where 0 is absence of atherosclerosis and 1 is its presence. The scheme
for such a neural network is represented on Figure 3.


   Figure 3: Visualization of binary classification ANN architecture
    Different neural network architectures were applied to cardiovascular disease prediction, in
particular, study of cardiovascular disease prediction using deep learning techniques [19] introduced
ANN with prediction accuracy 85%. In this study we will create and apply our own ANN architecture
and try to improve the precision score.

3. Methodology
  3.1. Data description
   The dataset used for this study was provided by Amosov National Institute of Cardiovascular
Surgery. The dataset contains 14 columns and 1000 records of patients. Most of the latest research of
heart diseases refer to the UCI dataset which dates back to 1988. Having such a new and accurate dataset
provides a unique opportunity for atherosclerosis prediction based on already existing methods and
applying new, which opens new doors to the application of machine learning algorithms in medicine.
The sample of this dataset is shown on Figure 4.
   The dataset does not contain empty cells, all attributes are filled and have normal distribution. The
dataset includes next attributes:
        ● Progress - display presence (1) or absence (0) of atherosclerosis.
        ● OP - surgical intrusion, 1 - present, 0 - absent;
        ● Shunt - cardiac shunt, 1 - present, 0 - absent
        ● age - age, years;
        ● height - height in cm;
        ● weight - weight in kilos;
        ● IMT - BMI, body index mass;
        ● sex - 0 - male, 1 - female;
        ● ChSS - heart rate, beats in one minute;

                                                                                                       137
       ● AD sist. - systolic blood pressure;
       ● AD diast - diastolic blood pressure
       ● AG therapia - antihypertensive therapy, 1- present, 0 - absent
       ● cholesterin - total cholesterol levels;
       ● diabetus melitus - diabetes, 1 - present, 0 - absent;


Figure 4: The sample of dataset records
    The Progress attribute is a target value of this study and the algorithms’ predictions will be compared
to it. The dataset contains the data of 591 patients with present atherosclerosis, and 409 rows of data
from healthy people. Also it is important to check sex distribution in the dataset, which is presented on
Figure 5.


   Figure 5: Distribution of data by sex, age mean and atherosclerosis presence.
   As we can see on Figure 5, the dataset contains a little bit more records of females than males. Also
the age mean of healthy people is 30.96, while the mean of age of patients with atherosclerosis is 54.78,
and this tendency is true for both males and females. Given that information we can conclude that
atherosclerosis is more common for older people, and to prove that, we will check the age distribution
of patients in our dataset. As we can see on Figure 6, atherosclerosis mostly is not present in young
people, and on the contrary, was diagnosed in the majority of middle-aged and older people. This
conclusion is also proved by the result of a recent study [20] which states that atherosclerosis rapidly
develops between ages 40 to 50. The Figure shows a large gap in disease occurrences between 37 and
38 years , and after that, the tendency of disease increasing occurs, which represents the general heart
disease statistics and risk factors impact.


                                                                                                       138
   Also the number of healthy patients constantly decreases after the age around 30, which means that
people are often diagnosed with disease when it is too late. As was mentioned in the introduction,
atherosclerosis is usually a long-term progressing disease, however sometimes it might progress more
aggressively. The disease usually shows its symptoms in the later stages, when arteries are narrowed or
blocked which is followed by pain in chest or other manifestations. Early atherosclerosis diagnosis may
help to avoid a set of heart diseases and save a lot of lives as a consequence.
                                 Variation of Age for each target class


Figure 6: Disease distribution by age
   It is important to define relations between the attributes and their impact on the target value before
we start application of machine learning algorithms. For that purpose we will calculate correlation
between all attributes and build a heat map, which is displayed at Figure 7.


Figure 7: Attribute correlation heatmap
   Based on the heatmap, we can define correlation between attributes of our data set. The first thing
we should look at is the first row of the diagram, which represents correlation of each separate attribute
with our target data. The next thing is to define which attributes impact the target via other attributes
indirectly. Thus, cholesterin, age and AG therapia have strong positive correlation with target. OP,
Shunt, weight and AD sist. have moderate positive relationships, no strong indirect correlated attributes
were found. Now, let’s take the most correlated parameters and display data distribution in this 3D
plane. Because AG thrapia has binary values we will replace it with weight which has moderate

                                                                                                      139
correlation, but more diverse value distribution. The visualisation of the data for such a plane is
displayed on Figure 8. In Figure 8, we can see that most patients with atherosclerosis have high
cholesterol level, are middle-aged and older, and overweight, while healthy patients mostly have weight
under 80 kilos, low cholesterol level and are under 40. This conclusion reflects the general idea of
atherosclerosis and corresponds to the risk factors: atherosclerosis is more common for older people,
people with overweight and high cholesterol levels. Based on the plot above we can see that
atherosclerosis and non-atherosclerosis records can be distinguished by these three parameters, however
some atherosclerosis cases occur even in people with normal weight and low cholesterol levels, but
those are minor.


Figure 8: Data distribution in 3d plane by most correlated attributes and target class

    1.1. Applied software technology
    For this study we used Python programming language of 3.8 version and its ecosystem. Nowadays
Python is the most popular programming language for data analysis and machine learning, and offers a
lot of libraries and solutions for solving such issues. Python provides a lot of utilities which reduces
development time and provides highly efficient results.
   We used the next set of python libraries for this study:
      ● pandas - the library which provides functionality for creating and operating with datasets;
      ● numpy - allows to perform sophisticated calculations on high-performance
          multidimensional arrays, and operate with them;
      ● matplotlib - offers a software interface different visualizations of data;
      ● seaborn - data visualization library based on matplotlib, which provides high-level interface
          and a lot of presets for drawing more user-friendly plots as well as a variety of diagrams,
          heatmaps, color themes, etc;
      ● scikit-learn - offers various unsupervised and supervised ready-to-use machine learning
          algorithms, built upon numpy, pandas and plotlib;
      ● xgboost - optimized distributed library that provides gradient boosting algorithm
          implementations;
      ● keras - high-level neural network API that provides functionality for developing and
          evaluating deep learning models;
      ● tensorflow - open-source platform that provides a backend engine for keras
      ● ann_visualizer - visualization library that is used to work with keras, uses graphviz library
          to create a graph of the neural network;
      ● graphviz - an open source graph visualization software that provides functionality to
          represent structural information;


                                                                                                    140
   1.2. Application of algorithms
    The goal of this research is to predict whether patients have atherosclerosis or not. The research was
done using supervised machine learning techniques: naive bayes, decision tree, random forest, XGBoost
and neural network as a deep learning technique. We will elaborate on the neural network, as it is more
complicated in configuration and tuning. For the neural network we used a set of dense layers with
dropout to avoid overfitting and ReLU as an activation function. For the output layer we used sigmoid
function, binary cross entropy loss function, because the task of the model is binary classification, and
adam optimizer. The dataset was divided into test and training sets, 20% and 80% accordingly. The
training process consisted of 500 epochs to reach better accuracy of the result. The visualizations of
model loss and model accuracy improvement are shown on Figure 9. As we can see, the neural network
training process was balanced without overfitting and the model reached good accuracy. One more
thing that should be mentioned before comparison of results is the decision tree structure. Decision tree
classifier creates rules based on parameters that allow it to classify data. Thus, this structure may help
figure out what parameters affect the classification result most. The structure of the received decision
tree is displayed on Figure 10. Based on the Figure 10 we can conclude that most important parameters
for classification result are cholesterin, AD sist. and weight. The general performance results
comparison are reflected on Table 1. To evaluate the precision score of all algorithms, a confusion
matrix was used. Among all applied algorithms, Random Forest and Neural Network showed best
performance for both training and test process. Also we should notice that all applied models reached a
very high accuracy score, which makes them applicable for atherosclerosis prediction in medical
institutions.


                         a                                                       b
   Figure 9: Neural network training plot for loss (a) and accuracy (b)
Table 1
Algorithms’ accuracy scores

           Algorithm                     Training accuracy                    Test accuracy

             SVM                                0.9975                                1.0
          Naive Bayes                           0.9875                                1.0
         Decision Tree                            1.0                                0.995
        Random Forest                             1.0                                 1.0
           XGBoost                                1.0                                0.995
        Neural network                         0.99875                                1.0


2. Conclusion
    In this paper 5 machine learning methods had been analyzed for atherosclerosis prediction. Our team
trained and tested all the algorithms against the clinical data. It achieved promising results after what
the accuracy of models have been compared. All the models showed extremely high performance
scores, and performed better in this study in comparison with overviewed application cases with CAD
disease dataset.

                                                                                                      141
   Figure 10: Decision tree structure
    We used a confusion matrix for comparison of ML algorithms’ performance for training and testing
sets. Many researchers note that ML algorithms show better performance for not large datasets, whereas
deep learning neural networks are better for large scaled data. However with right hyperparameter
tuning and architecture can be reached good results even for small-sized datasets which was proved in
this research.
    Considering limitations of the research, there are a lot of broad opportunities for applying mentioned
methods to the data of larger size, which however may lead to more technical challenges such as
complex data preprocessing and algorithms tuning. Also a lot of other neural network architectures may
be applied as well as ML methods for achieving better results. Though there is a very limited number
of datasets that are available for atherosclerosis analysis nowadays (which makes the field attractive for
many researchers), there are a lot of possible integrations of overviewed techniques along with
computer vision and other technologies that may improve diagnosis and treatment of atherosclerosis.


                                                                                                      142
3. References
  [1]  Fact sheets Cardiovascular diseases (CVDs) / World Health Organization, 2021. Mode of
       access: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).
  [2] Cardiovascular diseases Overview / World Health Organization, 2021. Mode of access:
       https://www.who.int/health-topics/cardiovascular-diseases.
  [3] Rafieian-Kopaei M, Setorki M, Doudi M, Baradaran A, Nasri H. Atherosclerosis: Process,
       Indicators, Risk Factors and New Hopes. Int J Prev Med (2014) 5(8):927–46.
  [4] Konstantina Kourou, Themis P. Exarchos, Konstantinos P. Exarchos, Michalis V.
       Karamouzis, Dimitrios I. Fotiadis “Machine learning applications in cancer prognosis and
       prediction.” Computational and Structural Biotechnology Journal, Volume 13, 8-17, 2015.
  [5] D. P. Yadav and S. Rathor, “Bone Fracture Detection and Classification using Deep Learning
       Approach,” 2020 International Conference on Power Electronics & IoT Applications in
       Renewable Energy and its Control (PARC), 2020, pp. 282-285.
  [6] Verma AK, Pal S, Kumar S. Classification of Skin Disease using Ensemble Data Mining
       Techniques. Asian Pac J Cancer Prev. 2019 Jun 1;20(6):1887-1894.
  [7] UCI Machine Learning Repository, “Heart disease data set,” 2021, Mode of access:
       http://archive.ics.uci.edu/ml/datasets/heart+disease.
  [8] Singh P, Singh S, Pandi-Jain GS. Effective heart disease prediction system using data mining
       techniques. Int J Nanomedicine. 2018 Mar 15;13(T-NANO 2014 Abstracts):121-124.
  [9] O. Terrada, B. Cherradi, A. Raihani and O. Bouattane, "Atherosclerosis disease prediction using
       Supervised Machine Learning Techniques," 2020 1st International Conference on Innovative
       Research in Applied Science, Engineering and Technology (IRASET), 2020, pp. 1-5.
  [10] Munger E, Hickey JW, Dey AK, Jafri MS, Kinser JM, Mehta NN. Application of machine
       learning in understanding atherosclerosis: Emerging insights. APL Bioeng. 2021 Feb
       16;5(1):011505.
  [11] Y.Khlevna, D.Mochalova. Prediction of atherosclerosis disease with artificial neural
       network. Sciences of Europe. Technical sciences. VOL 1, No 50 (2020) pp. 53 –58.
  [12] Zhu Y, Wu J, Fang Y. [Study on application of SVM in prediction of coronary heart disease].
       Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2013 Dec 30(6):1180-5. Chinese.
  [13] 6 Easy Steps to Learn Naive Bayes Algorithm with codes in Python and R / Sunil Ray, 2017.
       Mode of access: https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained
  [14] Oumaima Terrada, Bouchaib Cherradi, Abdelhadi Raihani, Omar Bouattane, A novel medical
       diagnosis support system for predicting patients with atherosclerosis diseases, Informatics in
       Medicine Unlocked, Volume 21, 2020.
  [15] Qawqzeh, Y.K.; Otoom, M.M.; Al-Fayez, F.; Almarashdeh, I.; Alsmadi, M. and Jaradat, G.
       A Proposed Decision Tree Classifier forAtherosclerosis Prediction and Classification.
       IJCSNS, 2019,19(12), p.197.
  [16] Akella A, Akella S. Machine learning algorithms for predicting coronary artery disease:
       efforts toward an open source solution. Future Sci OA. 7(6):FSO698. March 2021.
  [17] Fan, J., Chen, M., Luo, J. et al. The prediction of asymptomatic carotid atherosclerosis with
       electronic health records: a comparative study of six machine learning models. BMC Med
       Inform Decis Mak 21, 115 (2021).
  [18] Kartik Budholiya, Shailendra Kumar Shrivastava, Vivek Sharma, An optimized XGBoost
       based diagnostic system for effective prediction of heart disease, Journal of King Saud
       University - Computer and Information Sciences, 2020.
  [19] Syed Nawaz Pasha, Dadi Ramesh, Sallauddin Mohmmad, A. Harshavardhan and Shabana /
       Cardiovascular disease prediction using deep learning techniques, Mode of access:
       https://iopscience.iop.org/article/10.1088/1757-899X/981/2/022006
  [20] Journal Article, Beatriz López-Melgar, Leticia Fernández-Friera, Belén Oliva, José Manuel
       García-Ruiz, Fátima Sánchez-Cabo, Héctor Bueno, José María Mendiguren, Enrique Lara-
       Pezzi, Vicente Andrés, Borja Ibáñez, Antonio Fernández-Ortiz, Javier Sanz, Valentín Fuster
       Short-Term Progression of Multiterritorial Subclinical Atherosclerosis, 2020, Journal of the
       American College of Cardiology, 1617-1627


                                                                                                 143