=Paper= {{Paper |id=Vol-3348/short3 |storemode=property |title=Cardiac Studies Diagnostic Data Informative Features Investigation based on Cumulative Frequency Analysis |pdfUrl=https://ceur-ws.org/Vol-3348/short3.pdf |volume=Vol-3348 |authors=Kseniia Bazilevych,Mykola Butkevych,Nataliia Dotsenko |dblpUrl=https://dblp.org/rec/conf/profitai/BazilevychBD22 }} ==Cardiac Studies Diagnostic Data Informative Features Investigation based on Cumulative Frequency Analysis== https://ceur-ws.org/Vol-3348/short3.pdf
Cardiac Studies Diagnostic Data Informative Features
Investigation based on Cumulative Frequency Analysis
Kseniia Bazilevych1, Mykola Butkevych1 and Nataliia Dotsenko2
1
 National Aerospace University “Kharkiv Aviation Institute”, Chkalow str., 17, Kharkiv, 61070, Ukraine
2
 O.M. Beketov National University of Urban Economy in Kharkiv, Marshal Bazhanov str., 17, Kharkiv, 61002,
Ukraine


                Abstract
                The development of information technology in the modern world affects the public health
                sector on the one hand and accumulates enormous amounts of data on the other hand. The
                global COVID-19 pandemic has contributed to the digitalization of healthcare. Heart disease
                is a global problem that causes death worldwide. Therefore, this study proposes a model for
                determining the information content of signs of diagnostic data of heart diseases based on the
                cumulative frequency method. The software implementation of the model has been
                completed. A database of 303 patients, consisting of 14 attributes, was used for the
                experiments. As a result of the model's work, the features with the most significant
                information content were identified. The study is promising and can apply diagnostic models
                in public health practice.

                Keywords 1
                Features informativeness, cumulative frequency analysis, medical diagnostics, heart disease,
                data-driven medicine

1. Introduction

   Cardiovascular disease is the leading cause of adult death worldwide. Mortality reaches 30% of the
total number of all deaths [1]. Cardiovascular diseases are congenital and acquired. The following are
distinguished among cardiovascular diseases [2]:
   •    Arterial hypertension.
   •    Cardiac ischemia.
   •    Acute coronary syndrome.
   •    Heart disease.
   •    Heart failure.
   •    Arrhythmia.
   •    Venous thrombosis.
   •    Atherosclerosis.
   The main danger of cardiovascular disease is the disability or sudden death. The likelihood of such
consequences increases when ignoring the signs of the disease. Among the main risk factors are [3]:
   •    Smoking.
   •    Alcohol abuse.
   •    Lack of physical activity.
   •    Unbalanced nutrition.
   •    Stress.

2nd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2022), December 2-4, 2022, Łódź, Poland
EMAIL: ksenia.bazilevich@gmail.com (K. Bazilevych); nikolai.butkevych@gmail.com (M. Butkevych); nvdotsenko@gmail.com
(N. Dotsenko)
ORCID: 0000-0001-5332-9545 (K. Bazilevych); 0000-0001-8189-631x (M. Butkevych); 0000-0003-3570-5900 (N. Dotsenko)
           ©️ 2022 Copyright for this paper by its authors.
           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org)
   Also, the causes of cardiovascular diseases include high blood pressure and diabetes. Therefore,
early diagnosis is one of the most effective methods of preventing cardiovascular diseases.
   The COVID-19 pandemic has stimulated research in the field of data-driven medicine to solve
various problems. These areas include modeling the epidemic process of infectious diseases [4, 5], the
study of molecular structures [6], the study of social factors affecting the spread of disease [7], the
study of the behavior of viruses [8], medical diagnostics [9], etc.
   However, the available data on the disease does not always allow the construction of high-quality
models of automated medical diagnostics.
   This study aims to determine the information content of signs for the diagnosis of cardiovascular
diseases using the cumulative frequency method.
   Given research is part of a complex intelligent information system for epidemiological diagnostics,
the concept of which is discussed in [10].

2. Materials and Methods
2.1. Features informativeness
    Often the data sets to be processed contain a large number of features. When building machine
learning models, it is not always clear which of the features are important for it and which are
redundant [11]. At the same time, the removal of redundant data allows a better understanding of the
data, as well as reducing the time for setting up the model, improving its accuracy and facilitating
interpretability. Often this is the most important task. Feature selection methods are divided into three
types:
         • Filter methods.
         • Embedded methods.
         • Wrapped methods.
    The choice of the appropriate method is not obvious and depends on the data.
    In the field of data-driven medicine, it is possible to recognize the presence or absence of a disease
only when certain signs inherent in the patient are received and analyzed. Such signs are called
informative [12]. But informative features are not equivalent to achieve a specific goal, so
determining their informativeness is an important task.
    Informativeness of a sign means how much this sign characterizes the state of the object, that is,
how much the diagnosis depends on it - the result of recognition. At the same time, two approaches
can be distinguished for determining the information content: energy and information.
    The energy approach is based on the fact that the information content is estimated by the value of
the feature. However, this approach may be poorly suited for object recognition. If some attribute is
large in absolute value, but almost the same for objects of different classes, then it is difficult to
attribute the object to a certain class by the value of this attribute. And if the attribute is relatively
small in size, but differs greatly for objects of different classes, then the object can be easily classified
by its value.
    According to the informational approach, feature information is considered as a reliable difference
between classes of images in the feature space. When classifying objects, such a significant difference
can be the difference in the probability distributions of a feature built on samples from comparable
classes.

2.2.    Cumulative frequency method
   The essence of the cumulative frequency method is that if there are two samples of a feature x
belonging to two different classes, then for both samples in the same coordinate axes, there are
empirical distributions of the feature x [13]. The cumulative frequencies are calculated, i.e. the sum of
frequencies from the initial to the current distribution interval. In this case, the module of the
maximum difference of the accumulated frequencies serves as an estimate of information content:
                                  𝐼(𝑥) = max |𝑀1𝑗 − 𝑀2𝑗 |,                                          (1)
                                          𝑗=0,..,𝑞
where M1j is the cumulative frequency for the j-th sampling interval A1;
  M2j is the cumulative frequency for the j-th sampling interval A2;
  q + 1 is the number of intervals.
  The cumulative frequency algorithm is shown in Figure 1.




Figure 1: The algorithm of the cumulative frequency method.
3. Results
   Experimental studies were carried out using the Python programming language. The open Heart
Disease Cleveland dataset [14] was used for the analysis. The dataset contains data on 303 patients
with 14 attributes. Attribute data is shown in Table 1.

Table 1
Description of the data
                              Attribute                       Description
                                 Age                          Age in years
                                  Sex                  Sex (1=male; 0=female)
                            Chest pain type          1: typical angina; 2: atypical
                                                   angina; 3: non-anginal plan; 4:
                                                             asymptomatic
                         Blood pressure                 Resting blood pressure
                           Cholesterol               Serum cholesterol in mg/dl
                    Fasting blood sugar < 120               1=trye; 2=false
                           Resting ECG             0: normal; 1: having ST-T wave
                                                       abnormality; 2: showing
                                                       probable or definite left
                                                     ventricular hypertrophy by
                                                             Estes’ criteria
                          Maximum heart rate       Maximum heart rate achieved
                               Angina                  Exercise included angina
                                                             (1=yes; 0=no)
                                 Peak                 ST depression induced by
                                                       exercise relative to rest
                                Slope              The slope of the peak exercise
                                                              ST segment
                            Colored vessels        Number of major vessels (0-3)
                                                         colored by flourosopy
                                 Thal                 3=normal; 6=fixed defect;
                                                          7=reversable defect
                          Predicted attribute       0: <50% diameter narrowing;
                                                    1: >50% diameter narrowing

   Data was distributed to two classes: “Healthy” and “Sick”.
   The results of informative features by cumulative features method are presented in Table 2.
   As a result, the information content was calculated for different groups of cardiological data. It
was found that the following signs are the most informative: thal, chest pain type, colored vessels,
angina, age. The cumulative frequency method is used to determine the information content of a
feature involved in the recognition of two classes of objects.
   The use of an automated software package developed in the framework of this study allows its use
at workplaces in medical institutions to support decision-making when making a diagnosis. An
automated solution is especially relevant in conditions of limited resources in low- and middle-
income countries and during force majeure, such as war, natural disasters, and other conditions in
which access to medical care is limited.
Table 2
Description of the data
                             Attribute                           Result
                                Age                                19
                                 Sex                               -1
                          Chest pain type                          50
                          Blood pressure                           37
                            Cholesterol                             6
                     Fasting blood sugar < 120                     45
                            Resting ECG                           147
                       Maximum heart rate                          11
                              Angina                               99
                                Peak                               21
                               Slope                              142
                          Colored vessels                          66
                                Thal                              118



4. Conclusions
   As a result of the study, an automated software package was used to determine the information
content of the signs of these patients with suspected heart disease based on the accumulated frequency
method. An open dataset of patients with suspected heart disease was used for experimental studies,
which included 303 patients and 14 attributes. It was found that the following signs are the most
informative: thal, chest pain type, colored vessels, angina, and age.
   The proposed software package is highly relevant in Russia's war in Ukraine, as it does not require
high computing power. At the same time, automating a doctor's diagnosis and decision-making
support in conditions of limited resources is an urgent task.

5. Acknowledgements
   The study was funded by the National Research Foundation of Ukraine in the framework of the
research project 2020.02/0404 on the topic “Development of intelligent technologies for assessing the
epidemic situation to support decision-making within the population biosafety management”.

6. References

[1] M. Alessandro, P.E. Puddu, Epidemiology of heart disease of uncertain etiology: a population
    study and review of the problem, Medicina 55 (10) (2019): 687. doi: 10.3390/medicina55100687
[2] S.M. Hollenberg, Valvular heart disease in adult: etiologies, classification, and diagnosis, FP
    essentials 457 (2017): 11-16.
[3] S.S. Virani, et. al., Heart disease and stroke statistics – 2020 update: a report from the American
    heart association, Circulation 141 (9) (2020): e139-e596. doi: 10.1161/CIR.0000000000000757
[4] D. Chumachenko, et. al., On agent-based approach to influenza and acute respiratory virus
    infection simulation, 14th International Conference on Advanced Trends in Radioelectronics,
    Telecommunications and Computer Engineering, TCSET 2018 – Proceedings (2018): 192-195.
    doi: 10.1109/TCSET.2018.8336184
[5] D. Chumachenko, On intelligent multiagent approach to viral hepatitis B epidemic processes
    simulation, Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining
    and Processing, DSMP 2018 (2018): 415-419. doi: 10.1109/DSMP.2018.8478602
[6] A.S. Tkachenko, et. al., Semi-refined carrageenan promotes generation of reactive oxygen
     species in leukocytes of rats upon oral exposure but not in vitro, Wiener Medizinische
     Wochenschrift 171 (3-4) (2021): 68-78. doi: 10.1007/s10354-020-00786-7
[7] N. Davidich, et. al., Monitoring of urban freight flows distribution considering the human factor,
     Sustainable Cities and Society 75 (2021): 103168. doi: 10.1016/j.scs.2021.103168
[8] D. Chumachenko, K. Chumachenko, S. Yakovlev, Intelligent simulation of network work
     propagation using the Code Red as an example, Telecommunications and Radio Engineering 78
     (5) (2019): 443-464. doi: 10.1615/TELECOMRADENG.V78.I5.60
[9] I. Izonin, R. Tkachenko, I. Dronyuk, R. Tkachenko, M. Gregus, M. Rashkevych, Predictive
     modeling based on small data in clinical medicine: RBF-based additive input-doubling method,
     Mathematical        Biosciences     and      Engineering    18      (3)     (2021):    2599-2613.
     doi: 10.3934/mbe.2021132.
[10] S. Yakovlev, et. al., The concept of development a decision support system for the epidemic
     morbidity control, CEUR Workshops Proceedings 2753 (2020): 265-274.
[11] W. Jitkrittum, et. al., Informative features for model comparison, Advances in Neural
     Information Processing Systems 31 (2018): 1-12.
[12] T. Tran, et. al. A framework for feature extraction from hospital medical data with applications
     in risk prediction, BMC Bioinformatics 15 (2014): 425. doi: 10.1186/s12859-014-0425-8
[13] M. Riachi, J. Himms-Hagen, M.E. Harper, Percent relative cumulative frequency analysis in
     indirect calorimetry: application to studies of transgenic mice, Canadian journal of physiology
     and pharmacology 82 (12) (2004): 1075-83. doi: 10.1139/y04-117
[14] R. Detrano, et. al., International application of a new probability algorithm for the diagnosis of
     coronary artery disease, American Journal of Cardiology 64 (1989): 304-310.