Determining the Probability of Heart Disease using Data
                    Mining Methods

Kseniia Bazilevych[0000-0001-5332-9545], Ievgen Meniailov[0000-0002-9440-8378],
Kirill Fedulov[0000-0001-9619-0299], Sergey Goranina[0000-0001-8988-3935],
     Dmytro Chumachenko[0000-0003-2623-3294] and Pavlo Pyrohov[0000-0002-6100-4406]
 National Aerospace University "Kharkiv Aviation Institute", Chkalova str., 17, Kharkiv, 61070,
                                          Ukraine
               k.bazilevych@khai.edu, j.menyailov@khai.edu,
            fedulov.kirill172@gmail.com, sgoranin@gmail.com,
                              dichumachenko@gmail.com


         Abstract. The article suggests methods for estimating the parameters of logistic
         regression for different conditions. In the case of a single polytomic input varia-
         ble with a minimum number of categories - a method for assessing chances and
         probabilities. In this case, the quality of classification can be evaluated separately
         for each input variable: the assessment does not depend on the connectedness of
         the input variables, which allows not to check the correlation and preliminary
         selection of significant variables. For several variables, it is proposed to use a
         Bayesian classifier, which, if there is no correlation between the attributes, as-
         signs specific individuals of the population to a certain class for health reasons.
         If there is a correlation of factor attributes and complex dependencies between
         input variables, it is proposed to use the maximum likelihood estimation. As a
         result of the analysis, a ready-made mathematical apparatus will be obtained,
         which makes it possible in practice to obtain the values of the the probabilities of
         diseases under various initial data..


         Keywords: Classification, Probability Assessment, Logistic Regression, Bayes-
         ian Classifier, Odds Assessment Method, Maximum Likelihood Method


1        Introduction

Diagnostic methods [1-2] in medicine play a crucial role. The accuracy of the diagnosis
and the speed with which it can be made depends on many factors: the condition of the
patient, the available data on the symptoms and signs of the disease, the results of la-
boratory tests, but most importantly, the qualifications of the doctor himself [3-6]. An
accurately diagnosed diagnosis as soon as possible allows increasing the chance of cur-
ing the patient [7]. Based on all these considerations, it is natural to try to determine the
conditions under which the diagnosis can be made as quickly and accurately as possible.
    For many centuries, doctors have been trying to solve this problem with varying
degrees of success. However, in recent years, thanks to the use of modern methods of

Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0)
2019 IDDM Workshops.
2


treatment and diagnostics based on the latest achievements of science and technology,
the chances of obtaining successful results have increased significantly. Therefore, it is
important to find the exact methods [8-9] for description, research, evaluation and mon-
itoring of the diagnosis process, which makes the task of determining the likelihood of
disease based on existing data on the patient’s condition relevant.
    If the study is associated with a large number of interdependent factors that exhibit
significant natural variability, then for a sufficiently effective description of the com-
plex pattern of their influence, there is only one way - using the appropriate statistical
method [10]. If there is a need to determine the probability of falling into one of two
classes of the disease, one of the simplest and most effective methods is the binary
classifier. The quality of the classification can be evaluated for each input variable sep-
arately. If the number of factors or the number of data categories is very large, it is
necessary to use the computing power of the computer [11] so that the desired results
can be obtained in a fairly short time, which will reduce the likelihood of errors in the
diagnosis, and will also make it as quick and efficient as possible.
    Thus, the aim of the study is to determine the likelihood of a patient's disease [12-
13] with specified diagnostic characteristics based on Data Mining methods, which will
improve the accuracy of diagnosis.
   The health status of each individual is influenced by a number of factors, such as:
age, gender, illness, place of residence, temperature, blood condition, etc. [14]. The
objective of the study is to identify and analyze methods that allow us to assess the
likelihood of illness [15] of a patient with specified diagnostic characteristics.
    This task is referred to the classification tasks “with the teacher”, during which the
test system is trained using the “stimulus-reaction” examples. It is required to find de-
pendency that shows which patients belong to the "Healthy" class and which patients
belong to the “Sick” class. For such a task, it is rational to use logistic regression, which
is widely used to find the probabilities of an event with given characteristics [16].


2      Estimation of logistic regression parameters based on the
       method of assessing chances and probabilities

Consider a sample of patients based on data from the source [17]. For each patient,
health information is known. The explanatory variable in this case is the result of an
electrocardiogram (ECG) at rest. This variable is polytomic. Each patient can belong to
the three classes “Normal”, “Hyp” and “Abnormal” according to ECG results. Two
events are also considered: the patient is sick (y = 1) and healthy (y = 0).
    It is necessary to evaluate the parameters of the logistic equation for this problem
and determine the probability with which it will belong to the “Sick” class, i.e. evaluate
his state of health. Probability that the output variable y  1 for the given value of the
explanatory variable x will be P ( y  1| x)    x  , and the probability that y  0 at
a given value x will be equal to P ( y  0 | x)  1    x  .
   The conditional average for logistic regression in this case is determined as in for-
mula (1):
                                                                                           3


                                             e g ( x)
                                  ( x)                                                 (1)
                                            1 e   g ( x)

where g  x   0  1C1  2C2 ; C1, C2 is variables for quantizing values in three in-
tervals; x is explanatory variable; 0 , 1, 2 is desired parameters; c  x  is event prob-
ability.
    The function is defined on an infinite interval and takes values in a range [0, 1].
Required to find the best estimates of parameters 0 , 1, 2 . We will organize the in-
formation about patients based on the data [17] in the form of a Table 1.

                                 Table 1. ECG Patient Data

 Outcome            Normal            Hyp                   Abnormal       Total
 y=0                96                68                    1              165
 y=1                56                79                    3              138
 Total              152               147                   4              303

In the Table 1 in the line with the “Normal” class, the quantization variables will be
equal to: С1  С2  0 . In line with class “Hyp” С1  1, С2  0 . In the line with the
class “Abnormal” С1  С2  1 .
    The chances of being a patient with a sick heart for all categories of conditions of
the electrocardiogram are estimated by the formulas (2-4):

                                             56
                               Chy 1,C1        0.58                                   (2)
                                             96
                                             79
                               Chy 1,C2        1.16                                   (3)
                                             68
                                     Chy 1,C3  3                                       (4)

The odds ratio for the “Hyp” categories to the “Normal” category is estimated by the
formula (5):

                               С  Chy 1,C2                                            (5)
                           OR  2             2.01
                               С1  Chy 1,C1
The odds ratio for the “Abnormal” to the “Normal” categories is estimated by the for-
mula (6):

                                С  Chy 1,C3                                           (6)
                            OR  3             5.17
                                С1  Chy 1,C1
4


The experimental probability of a disease for the “Normal” category can be found (7)
by dividing the number of positive outcomes by the total number of outcomes:

                               сexp  56 / 152  0.37                                  (7)

From here the coefficient 0 can be found as (8)

                                    сexp                                             (8)
                           0  ln             0.532
                                    1  сexp 
                                             
For the “Hyp” category, the experimental probability of the disease can be estimated
by the formula (9):

                                сexp  79 / 147  0.54                                 (9)

From here, the coefficient 1 can be found as (10):

                                  сexp                                             (10)
                         1  ln               0.692
                                  1  сexp  0
                                           
For the category "Abnormal" the experimental probability of the disease can be esti-
mated by the formula (11):

                                 сexp  3 / 4  0.75                                 (11)

From here the coefficient 2 can be found as (12)

                             сexp 
                    2  ln                 0.939                             (12)
                             1  сexp  0 1
                                      
The probability that the output variable y will be equal to one (that is, the patient will
be ill) for the category "Normal" is calculated by the formula (13):

                                     e0           e0.532
                   P( y  1| x)                                0.37               (13)
                                    1  e0       1  e0.532
The probability that the output variable y = 1 for the “Hyp” category is calculated by
the formula (14):

                                e0 1      e0.5320.692
              P( y  1| x)                                   0.54                 (14)
                               1  e0 1 1  e0.532 0.692
                                                                                            5


The probability that the output variable y = 1 for the “Abnormal” category is calculated
by the formula (15):

                           e0 1 2           e0.532 0.692 0.939
        P( y  1| x)                                                      0.75         (15)
                         1  e0 1 2       1  e0.532 0.692 0.939
It can be concluded that if the result of the ECG is “Abnormal”, then the probability of
the disease is highest, if “Hyp”, then less, and the probability of being healthy is highest
if the result is “Normal”.


3      Estimating the likelihood of a disease using a Bayesian
       classifier

Consider a sample of 30 patients with input variables defined in the nominal scale (Ta-
ble 2) based on data from the source [17]. For analysis, we use the following signs: age
(in years), blood sugar, patient gender, ECG result. According to the Table 2, the pair
correlation coefficients were calculated, the values of which are in the interval
[-0.303; 0.078], which indicates a low correlation between the input variables.

                      Table 2. Patient data for selected characteristics

 №              Age             Blood             Gender         ECG            Disease
                                sugar      <                     result         state
                                120
 1              60-69           No                Man            Hyp            Yes
 2              40-49           No                Man            Normal         No
 3              60-69           No                Man            Hyp            No
 4              60-69           No                Man            Normal         Yes
 5              50-59           Yes               Man            Hyp            Yes
 6              50-59           Yes               Woman          Normal         No
 7              50-59           No                Man            Normal         Yes
 8              50-59           No                Man            Normal         Yes
 9              60-69           Yes               Man            Normal         No
 10             60-69           No                Man            Hyp            Yes
 11             60-69           No                Man            Hyp            Yes
 12             30-39           No                Man            Normal         No
 13             40-49           No                Woman          Normal         No
 14             50-59           No                Man            Normal         No
 15             60-69           No                Woman          Normal         Yes
 16             50-59           No                Woman          Hyp            No
 17             50-59           No                Man            Normal         No
 18             50-59           No                Woman          Normal         No
 19             50-59           Yes               Man            Hyp            Yes
6


    20           40-49          No                Man           No             No
    21           50-59          No                Woman         No             No
    22           50-59          No                Woman         Normal         No
    23           60-69          No                Woman         Normal         No
    24           40-49          No                Man           Normal         No
    25           40-49          No                Man           Hyp            Yes
    26           60-69          No                Woman         Normal         No
    27           60-69          Yes               Man           Hyp            Yes
    28           60-69          No                Man           Normal         Yes
    29           50-59          No                Man           Normal         No
    30           40-49          No                Man           Hyp            No

Thus, in this case, you can use the Bayesian classifier, the application of which for this
case is considered in detail in [18].
    We denote by C1 the class “Sick” for whom the state of the disease is present (the
value of the resulting variable is “yes”). Through C2, we can designate the class of
patients “Healthy”, which have no signs of illness (the value of the resulting variable is
“no”). The use of the Bayesian classifier does not make it possible to obtain the form
of a statistical dependence based on the training sample, however, it makes it possible
to determine the probability that a patient with given characteristics will fall into one
or another class. For example, we define that a patient aged 50 to 59 years, with blood
sugar less than 120 units, a man and with the result of ECG “Hyp” will fall into the
class “Sick”.
    It is necessary to maximize the product of probabilities P ( X | Ck ) P  Ck  for k  2 ,
because there are only two classes in this problem. The prior probability of the appear-
ance of class C1 is calculated by the formula (16):

                                                12
                                   P  C1         0.4                                 (16)
                                                30
The prior probability of the appearance of a class C2 is calculated by the formula (17):

                                                18
                                  P  C2          0.6                                 (17)
                                                30
There are 30 observed examples, 18 of them are “Healthy”, 12 are “Sick”. Conditional
probabilities for determining P ( X | Ck ) calculated in Table 3. Calculate the general-
ized probabilities P ( X | Ck ) for events of the formula (18-19):

                     P( X | C1)  0.33  0.25  0.92  0.58  0.044                      (18)

                    P( X | C2 )  0.44  0.11  0.55  0.17  0.0045                     (19)

Than probabilities P( X | Ck ) P  Ck  will be respectively equal (20-21):
                                                                                          7


                        P( X | C1) P  C1   0.044  0.6  0.0264                     (20)

                        P( X | C2 ) P  C2   0.0045  0.4  0.0018                   (21)

                     Table 3. Conditional Probabilities for Patient Data

             Probability description                                    Estimation
              P(Age 50–59|C2)                                            8/18=0.44
              P(Age 50–59|C1)                                            4/12=0.33
              P(Blood Sugar<120|C2)                                      2/18=0.11
              P(Blood Sugar<120|C1)                                      3/12=0.25
              P(Man|C2)                                                 10/18=0.55
              P(Man|C1)                                                 11/12=0.92
              P(Hyp|C2)                                                  3/18=0.17
              P(Hyp|C1)                                                  7/12=0.58

The class whose probability is greater is selected, i.e. the patient in question belongs to
the class “Sick”.
    The normalization of probabilities is as follows of formula (22-23):

                                                   0.0264
                     P '( X | C1) P  C1                      0.94                  (22)
                                               0.0264  0.0018
                                                   0.0018
                    P '( X | C2 ) P  C2                      0.06                  (23)
                                               0.0018  0.0264
Thus, a patient with the described characteristics will be sick with a probability of 0.94
(will fall into the “Sick” class), and with a probability of 0.06 will be healthy (will fall
into the “Healthy” class).


4      Estimation of logistic regression parameters based on the
       maximum likelihood estimation

Consider a sample of 303 patients with input characteristics shown in Table 4 based on
data from the source [17]. The resulting trait is measured in a dichotomous scale, and
factor traits in metric and other types of scales. It is necessary to determine the likeli-
hood of a patient's disease with this many characteristics.
    Since the maximum likelihood estimation (MLE) is quite resource-intensive, we
will use the software from IBM – SPSS Statistics for the demonstration. This software
allows us not only to find the parameters of logistic regression, but also to evaluate the
parameters of the model and probability, and also analyze the quality of the model.
8


                               Table 4. Patient sampling data

                age                         metric scale                      29…77
                sex                        nominal scale                  Male/Female
          chest pain type                  nominal scale              Asymptomatic, Abnor-
                                                                    mal Angina, Angina, No-
                                                                             Tang
        blood pressure                  metric scale                         94…200
           cholesterol                  metric scale                        126…564
      fasting blood sugar              nominal scale                        true/false
            <120
           resting ecg                 nominal scale                  Normal/Hyp
     maximum heart rate                 metric scale                     71…202
             angina                    nominal scale                     true/false
              peak                      metric scale                       0…6,2
              slope                    nominal scale                 Flat, Down, Up
       #colored vessels                 metric scale                       0,1,2,3
               thal                    nominal scale                Normal, Rev, Fix
             class*                    nominal scale                  Sick, Healthy
*Row “Class” is necessary for analysis of simulation results performing.

The most significant results are visible in the tables below. In the Table 5 presents the
quality factors of the model.

                              Table 5. Model Summary Table

          Step                  -2 Log             R-squared               R-squared
                            probability          Cox & Snell              Nagelkerke
           1                   348.461               0.210                   0.281

Criterion -2 Log probability corresponds to the correspondence between the models and
the source data. The smaller this indicator, the more adequate the model.
    R-squared Cox & Snell and R-squared of the Nagelkerke are stably statistically con-
sistent, which are used in the logit. The value of an equal object is achievable. In the
second sign, this drawback is eliminated. These criteria shows the share of all factor
characteristics. More detailed information can be taken from the source [17]. In the
table 6 presents the values of the Chi-square test.

                    Table 6. Universal criterion for model coefficients

           Step                Chi-squared           Degrees of            Relevance
                                                     freedom
      1          Step             71,586                  3                   ,000
                 Block            71,586                  3                   ,000
                 Model            71,586                  3                   ,000
                                                                                               9


Table 7-8 presents the Hosmer-Lemeshov criterion. In our case, part of the variance is
0.7%. This indicates a high degree of consistency in the model.
    The Hosmer-Lemeshov criterion - shows an assessment of the agreement between
the frequencies in the sample and the model [19-20]. It shows whether there is "gar-
bage", which leads to a decrease in the quality of the model in the model.

                                  Table 7. Hosmer-Lemeshov criteria

              Step              Chi-squared             Degrees of            Relevance
                                                        freedom
                  1                 9.800                    2                   0.007


                   Table 8. Conjugation table for checking Hosmer-Lemeshov consent

                            Illness = yes                    Illness = no


                                                                                     Overall
                      Observed     Expected          Observed      Expected


            1                23        22,839                 0           0,161      23
            2                27        29,294                 3           0,706      30
  Step 1


            3                29        28,613                 1           1,387      30
            4                27        26,486                 3           3,514      30
            5                27        23,015                 3           6,985      30
            6                16        17,586                14          12,414      30
            7                10        11,150                20          18,850      30
            8                5         4,997                 24          24,003      29
            9                0         1,847                 30          28,153      30
            10               1         0,495                 40          40,505      41

Table 9 shows percentages representing different levels of classification of the model.
Quite high indicators were obtained, i.e. 92.2% of cases were classified correctly.

                                     Table 9. Table classification

                       Observed                                  Predicted
                                                           Class           Correctness
                                                     Healthy      Sick
           Step        Class       Healthy            147          18         93.1
           1                         Sick              23         115         91.3
                         Total percentage                                     92.2

Table 10 shows the parameters of the logistic regression equation.
10


                     Table 10. Parameters of logistic regression equation

       Influencing variable         Regression          Stand       Wald          Signifi
                                    equation            ard         statistics    -cance
                                    cofficients β       error                     level
    A – sex(1)                       -1,464            0,490        8,932        0,003
    B - chestpaintype                                               30,864       0,000
    B1 - chestpaintype(1)            2,286             0,444        26,474       0,000
    B2 - chestpaintype(2)            0,971             0,590        2,705        0,100
    B3 - chestpaintype(3)            0,170             0,652        0,068        0,794
    C - angina(1)                    -0,763            0,380        4,044        0,044
    D - slope                                                       24,588       0,000
    D1 - slope(1)                    1,724             0,700        6,068        0,014
    D2 - slope(2)                    2,018             0,415        23,700       0,000
    E - @#coloredvessels                                            36,481       0,000
    E1                  -            -1,763            0,505        12,204       0,000
  @#coloredvessels(1)
    E2                  -            0,495             0,547        0,818        0,366
  @#coloredvessels(2)
    E3                  -            1,498             0,793        3,566        0,059
  @#coloredvessels(3)
    F - thal                                                        14,056        0,001
    F1 - thal(1)                     -1,492             0,733       4,137         0,042
    F2 - thal(2)                     -1,452             0,411       12,472        0,000

The remaining variables were excluded from the formula due to data redundancy.
   Based on this table, you can determine the most significant factors by which you
can get the smallest errors with a high probability. The general form of the regression
equation for the patient will have the form similar to formula (24):

           g  x   1,464  A  2,286  B1  0,971  B2 
                                                                                         (24)
          0,170  B3  0,763  C  1,724  D1  2,018  D2 
          1,763  E1  0,495  E2  1,498  E3  1,492  F1  1,452  F2

Then, for a male patient with a second type of chest pain that did not have a sore throat,
with a bias of the first type, with vessels of the third type, as well as a thal of the first
type, it will be true (25):

              g  x   1,464  0,971  1,724  1,498  1,492  1,237                   (25)

Then the probability that such a patient will be healthy is calculated by the formula
(26):
                                                                                           11


                                    g x
                                                g x
                                                       
                            x   e   / 1  e    0,77                             (26)


Moreover, as can be seen from the Table 10, according to Wald's statistics, the most
significant are the following factors: chestpaintype (value 30.864), slope (value
24.588), coloredvessels (value 36.481).
     Wald test – a statistical test used to check the restrictions on the parameters of sta-
tistical models estimated on the basis of sample data. It is the most appropriate of the
three basic constraint checking tests such as the likelihood ratio test and the Lagrange
multiplier test. The test is asymptotic, that is, a sufficiently large sample size is required
for the reliability of the conclusions. The confidence interval (CI) of the test is also a
closed form. The higher the statistics, the better.
     The significance of the factors is confirmed using the appropriate level of signifi-
cance. It is defined as the p-level, which is calculated during the test. The lower this
level, the better.
     Based on the data in Table 4, the probabilities of getting into the Healthy group
were calculated for all data.
     In the “Expected” and “Group” columns, you can see the probabilities of getting
into the “Healthy” or “Sick” group.
     The simulation results (Table 8) show high accuracy of the classification results in
comparison with the classes previously known for the experimental sample (Table 4).
     Based on the results, we can say that the model adequately describes this population.


5      Conclusions

In this work we identified, analyzed and implemented methods that allow us to assess
the likelihood of a patient's disease with specified diagnostic characteristics.
    It is shown in which cases it is advisable to use certain methods to determine the
probability and estimate the parameters of models. These models are not static. Calcu-
lation of parameters can be carried out every time when the amount of data about pa-
tients changes, and the use of SPSS software tools will allow calculations to be made
quite quickly. The data obtained will allow a more accurate assessment of the state of
health in the face of constantly changing diagnostic parameters.


       References
  1. Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach (2nd ed.). MIT
Press, 400 p. (2001).
  2. Meniailov I., et. al.: Using the K-means Method for Diagnosing Cancer Stage Using the
Pandas Library. In CEUR Workshop Proceedings, vol. 2386, pp. 107-116 (2019).
  3. Chumachenko, D.: On Intelligent Multiagent Approach to Viral Hepatitis B Epidemic
Processes Simulation, in Proceedings of the 2018 IEEE 2nd International Conference on Data
Stream Mining and Processing, DSMP 2018, pp. 415-419 (2018).
12

   4. Chumachenko, D., Chumachenko, K., Yakovlev, S.: Intelligent simulation of network
worm propagation using the code red as an example. In Telecommunications               and    Radio
Engineering, vol. 78, iss. 5, pp. 443-463. (2019).
   5. Polyvianna, Yu., Chumachenko, D., Chumachenko T.: Computer Aided System of Time
Series Analysis Methods for Forecasting the Epidemics Outbreaks, 2019 15th International
Conference on the Experience of Designing and Application of CAD Systems (CADSM), pp. 1-
4 (2019).
   6. Chumachenko, D., Chumachenko, T.: Intelligent Agent-Based Simulation of HIV
Epidemic Process. In Advances in Intelligent Systems and Computing, vol. 1020, pp. 175-188
(2019).
   7. Berry, M.W.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer,
244 p. (2003).
   8. MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observa-
tions, In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability,
Berkeley, University of California Press, pp. 281-297 (1967).
   9. Deshpande, M., Kuramochi, M., Karypis, G.: Automated approaches for classifying
structures. In Proc. 2002 Workshop on Data Mining in Bioinformatics (BIOKDD’02), Canada,
11 – 18 (2002).
   10. Frakes, W., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms
(1992).
   11. Bazilevych, K. et al.: Stochastic modelling of cash flow for personal insurance fund using
the cloud data storage. In: International Journal of Computing, Vol. 17, Iss. 3, pp. 153-162 (2018).
   12. Cancer control: early detection. WHO Guide for effective programmes. Geneva: World
Health                                      Organization;                                      2007,
http://apps.who.int/iris/bitstream/10665/43743/1/9241547338_eng.pdf,              last     accessed
2019/10/28
   13. Rubin, G, et. al.: The expanding role of primary care in cancer control Lancet Oncol, pp.
31 – 72 (2015).
   14. Chumachenko, D., et. al.: Intelligent Expert System of Knowledge Examination of
Medical Staff Regarding Infections Associated with the Provision of Medical Care, in CEUR
Workshop Proceedings, vol. 2386, pp. 321-330 (2019).
   15. Bowers, N.L., et. al.: Actuarial mathematic, Illinois, USA by Society Of Actuaries, 621 p.
(1997).
   16. Norman, T. J.: The mathematical approach to biology and medicine norman, Wiley, 296
p. (1967).
   17. Sample Dataset, https://github.com/DorianDrain/Excel-Data-Sets/tree/master, last accessed
2019/10/28.
  18. Cox, D.R., Snell, E.J.: Analysis of Binary Data, Chapman and Hall, CRC, 240 p. (1989).
  19. Hosmer-Lemeshow Test, http://www.real-statistics.com/logistic-regression/hosmer-
lemeshow-test/, last accessed 2019/10/28.
  20. Bartlett, J.: The Hosmer-Lemeshow goodness of fit test for logistic regression. In The Stats
Geek (2014).