<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigation based on Cumulative Frequency Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kseniia Bazilevych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mykola Butkevych</string-name>
          <email>nikolai.butkevych@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nataliia Dotsenko</string-name>
          <email>nvdotsenko@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Aerospace University “Kharkiv Aviation Institute”</institution>
          ,
          <addr-line>Chkalow str., 17, Kharkiv, 61070</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>O.M. Beketov National University of Urban Economy in Kharkiv</institution>
          ,
          <addr-line>Marshal Bazhanov str., 17, Kharkiv, 61002</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The development of information technology in the modern world affects the public health sector on the one hand and accumulates enormous amounts of data on the other hand. The global COVID-19 pandemic has contributed to the digitalization of healthcare. Heart disease is a global problem that causes death worldwide. Therefore, this study proposes a model for determining the information content of signs of diagnostic data of heart diseases based on the cumulative frequency method. The software implementation of the completed. A database of 303 patients, consisting of 14 attributes, was used for the experiments. As a result of the model's work, the features with the most significant information content were identified. The study is promising and can apply diagnostic models in public health practice. Features informativeness, cumulative frequency analysis, medical diagnostics, heart disease, ORCID: 0000-0001-5332-9545 (K. Bazilevych); 0000-0001-8189-631x (M. Butkevych); 0000-0003-3570-5900 (N. Dotsenko)</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Cardiovascular disease is the leading cause of adult death worldwide. Mortality reaches 30% of the
total number of all deaths [1]. Cardiovascular diseases are congenital and acquired. The following are
distinguished among cardiovascular diseases [2]:
EMAIL:
(K. Bazilevych);
(M. Butkevych);</p>
      <p>2022 Copyright for this paper by its authors.
•
•
•
•
•
•
•
•
•
•
•</p>
      <sec id="sec-1-1">
        <title>Arterial hypertension.</title>
      </sec>
      <sec id="sec-1-2">
        <title>Cardiac ischemia.</title>
      </sec>
      <sec id="sec-1-3">
        <title>Acute coronary syndrome.</title>
      </sec>
      <sec id="sec-1-4">
        <title>Heart disease.</title>
      </sec>
      <sec id="sec-1-5">
        <title>Heart failure.</title>
      </sec>
      <sec id="sec-1-6">
        <title>Arrhythmia.</title>
      </sec>
      <sec id="sec-1-7">
        <title>Venous thrombosis.</title>
      </sec>
      <sec id="sec-1-8">
        <title>Atherosclerosis.</title>
      </sec>
      <sec id="sec-1-9">
        <title>Smoking.</title>
      </sec>
      <sec id="sec-1-10">
        <title>Alcohol abuse.</title>
      </sec>
      <sec id="sec-1-11">
        <title>Lack of physical activity.</title>
      </sec>
      <sec id="sec-1-12">
        <title>Unbalanced nutrition.</title>
      </sec>
      <sec id="sec-1-13">
        <title>Stress.</title>
        <p>The main danger of cardiovascular disease is the disability or sudden death. The likelihood of such
consequences increases when ignoring the signs of the disease. Among the main risk factors are [3]:</p>
        <p>Also, the causes of cardiovascular diseases include high blood pressure and diabetes. Therefore,
early diagnosis is one of the most effective methods of preventing cardiovascular diseases.</p>
        <p>The COVID-19 pandemic has stimulated research in the field of data-driven medicine to solve
various problems. These areas include modeling the epidemic process of infectious diseases [4, 5], the
study of molecular structures [6], the study of social factors affecting the spread of disease [7], the
study of the behavior of viruses [8], medical diagnostics [9], etc.</p>
        <p>However, the available data on the disease does not always allow the construction of high -quality
models of automated medical diagnostics.</p>
        <p>This study aims to determine the information content of signs for the diagnosis of cardiovascular
diseases using the cumulative frequency method.</p>
        <p>Given research is part of a complex intelligent information system for epidemiological diagnostics,
the concept of which is discussed in [10].</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Methods</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Features informativeness</title>
      <p>Often the data sets to be processed contain a large number of features. When building machine
learning models, it is not always clear which of the features are important for it and which are
redundant [11]. At the same time, the removal of redundant data allows a better understanding of the
data, as well as reducing the time for setting up the model, improving its accuracy and facilitating
interpretability. Often this is the most important task. Feature selection methods are divided into three
types:
• Filter methods.
• Embedded methods.</p>
      <p>• Wrapped methods.</p>
      <sec id="sec-3-1">
        <title>The choice of the appropriate method is not obvious and depends on the data.</title>
        <p>In the field of data-driven medicine, it is possible to recognize the presence or absence of a disease
only when certain signs inherent in the patient are received and analyzed. Such signs are called
informative [12]. But informative features are not equivalent to achieve a specific goal, so
determining their informativeness is an important task.</p>
        <p>Informativeness of a sign means how much this sign characterizes the state of the object, that is,
how much the diagnosis depends on it - the result of recognition. At the same time, two approaches
can be distinguished for determining the information content: energy and information.</p>
        <p>The energy approach is based on the fact that the information content is estimated by the value of
the feature. However, this approach may be poorly suited for object recognition. If some attribute is
large in absolute value, but almost the same for objects of different classes, then it is difficult to
attribute the object to a certain class by the value of this attribute. And if the attribute is relatively
small in size, but differs greatly for objects of different classes, then the object can be easily classified
by its value.</p>
        <p>According to the informational approach, feature information is considered as a reliable difference
between classes of images in the feature space. When classifying objects, such a significant difference
can be the difference in the probability distributions of a feature built on samples from comparable
classes.
2.2.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Cumulative frequency method</title>
      <p>The essence of the cumulative frequency method is that if there are two samples of a feature x
belonging to two different classes, then for both samples in the same coordinate axes, there are
empirical distributions of the feature x [13]. The cumulative frequencies are calculated, i.e. the sum of
frequencies from the initial to the current distribution interval. In this case, the module of the
maximum difference of the accumulated frequencies serves as an estimate of information content:
(1)
 ( ) = max | 1 −  2 |,</p>
      <p>=0,..,
where M1j is the cumulative frequency for the j-th sampling interval A1;</p>
      <sec id="sec-4-1">
        <title>M2j is the cumulative frequency for the j-th sampling interval A2; q + 1 is the number of intervals.</title>
      </sec>
      <sec id="sec-4-2">
        <title>The cumulative frequency algorithm is shown in Figure 1.</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3. Results</title>
      <p>Experimental studies were carried out using the Python programming language. The open Heart
Disease Cleveland dataset [14] was used for the analysis. The dataset contains data on 303 patients
with 14 attributes. Attribute data is shown in Table 1.</p>
      <sec id="sec-5-1">
        <title>Attribute Age Sex</title>
      </sec>
      <sec id="sec-5-2">
        <title>Chest pain type</title>
      </sec>
      <sec id="sec-5-3">
        <title>Blood pressure</title>
      </sec>
      <sec id="sec-5-4">
        <title>Cholesterol</title>
      </sec>
      <sec id="sec-5-5">
        <title>Fasting blood sugar &lt; 120</title>
      </sec>
      <sec id="sec-5-6">
        <title>Resting ECG</title>
      </sec>
      <sec id="sec-5-7">
        <title>Maximum heart rate</title>
      </sec>
      <sec id="sec-5-8">
        <title>Angina</title>
      </sec>
      <sec id="sec-5-9">
        <title>Peak</title>
      </sec>
      <sec id="sec-5-10">
        <title>Slope</title>
      </sec>
      <sec id="sec-5-11">
        <title>Colored vessels</title>
      </sec>
      <sec id="sec-5-12">
        <title>Thal</title>
      </sec>
      <sec id="sec-5-13">
        <title>Predicted attribute</title>
      </sec>
      <sec id="sec-5-14">
        <title>Description</title>
      </sec>
      <sec id="sec-5-15">
        <title>Age in years</title>
        <p>Sex (1=male; 0=female)</p>
      </sec>
      <sec id="sec-5-16">
        <title>1: typical angina; 2: atypical angina; 3: non-anginal plan; 4: asymptomatic</title>
      </sec>
      <sec id="sec-5-17">
        <title>Resting blood pressure</title>
      </sec>
      <sec id="sec-5-18">
        <title>Serum cholesterol in mg/dl</title>
        <p>1=trye; 2=false
0: normal; 1: having ST-T wave
abnormality; 2: showing
probable or definite left
ventricular hypertrophy by</p>
        <p>Estes’ criteria</p>
      </sec>
      <sec id="sec-5-19">
        <title>Maximum heart rate achieved</title>
      </sec>
      <sec id="sec-5-20">
        <title>Exercise included angina (1=yes; 0=no)</title>
      </sec>
      <sec id="sec-5-21">
        <title>ST depression induced by</title>
        <p>exercise relative to rest</p>
      </sec>
      <sec id="sec-5-22">
        <title>The slope of the peak exercise</title>
      </sec>
      <sec id="sec-5-23">
        <title>ST segment</title>
      </sec>
      <sec id="sec-5-24">
        <title>Number of major vessels (0-3)</title>
        <p>colored by flourosopy
3=normal; 6=fixed defect;
7=reversable defect
0: &lt;50% diameter narrowing;</p>
      </sec>
      <sec id="sec-5-25">
        <title>1: &gt;50% diameter narrowing</title>
        <sec id="sec-5-25-1">
          <title>Data was distributed to two classes: “Healthy” and “Sick”.</title>
        </sec>
        <sec id="sec-5-25-2">
          <title>The results of informative features by cumulative features method are presented in Table 2.</title>
          <p>As a result, the information content was calculated for different groups of cardiological data. It
was found that the following signs are the most informative: thal, chest pain type, colored vessels,
angina, age. The cumulative frequency method is used to determine the information content of a
feature involved in the recognition of two classes of objects.</p>
          <p>The use of an automated software package developed in the framework of this study allows its use
at workplaces in medical institutions to support decision-making when making a diagnosis. An
automated solution is especially relevant in conditions of limited resources in low- and
middleincome countries and during force majeure, such as war, natural disasters, and other conditions in
which access to medical care is limited.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Conclusions</title>
      <p>As a result of the study, an automated software package was used to determine the information
content of the signs of these patients with suspected heart disease based on the accumulated frequency
method. An open dataset of patients with suspected heart disease was used for experimental studies,
which included 303 patients and 14 attributes. It was found that the following signs are the most
informative: thal, chest pain type, colored vessels, angina, and age.</p>
      <p>The proposed software package is highly relevant in Russia's war in Ukraine, as it does not require
high computing power. At the same time, automating a doctor's diagnosis and decision -making
support in conditions of limited resources is an urgent task.</p>
    </sec>
    <sec id="sec-7">
      <title>5. Acknowledgements</title>
      <p>The study was funded by the National Research Foundation of Ukraine in the framework of the
research project 2020.02/0404 on the topic “Development of intelligent technologies for assessing the
epidemic situation to support decision-making within the population biosafety management”.
6. References
[1] M. Alessandro, P.E. Puddu, Epidemiology of heart disease of uncertain etiology: a population
study and review of the problem, Medicina 55 (10) (2019): 687. doi: 10.3390/medicina55100687
[2] S.M. Hollenberg, Valvular heart disease in adult: etiologies, classification, and diagnosis, FP
essentials 457 (2017): 11-16.
[3] S.S. Virani, et. al., Heart disease and stroke statistics – 2020 update: a report from the American
heart association, Circulation 141 (9) (2020): e139-e596. doi: 10.1161/CIR.0000000000000757
[4] D. Chumachenko, et. al., On agent-based approach to influenza and acute respiratory virus
infection simulation, 14th International Conference on Advanced Trends in Radioelectronics,
Telecommunications and Computer Engineering, TCSET 2018 – Proceedings (2018): 192-195.
doi: 10.1109/TCSET.2018.8336184
[5] D. Chumachenko, On intelligent multiagent approach to viral hepatitis B epidemic processes
simulation, Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining
and Processing, DSMP 2018 (2018): 415-419. doi: 10.1109/DSMP.2018.8478602
[6] A.S. Tkachenko, et. al., Semi-refined carrageenan promotes generation of reactive oxygen
species in leukocytes of rats upon oral exposure but not in vitro, Wiener Medizinische
Wochenschrift 171 (3-4) (2021): 68-78. doi: 10.1007/s10354-020-00786-7
[7] N. Davidich, et. al., Monitoring of urban freight flows distribution considering the human factor,</p>
      <p>Sustainable Cities and Society 75 (2021): 103168. doi: 10.1016/j.scs.2021.103168
[8] D. Chumachenko, K. Chumachenko, S. Yakovlev, Intelligent simulation of network work
propagation using the Code Red as an example, Telecommunications and Radio Engineering 78
(5) (2019): 443-464. doi: 10.1615/TELECOMRADENG.V78.I5.60
[9] I. Izonin, R. Tkachenko, I. Dronyuk, R. Tkachenko, M. Gregus, M. Rashkevych, Predictive
modeling based on small data in clinical medicine: RBF-based additive input-doubling method,
Mathematical Biosciences and Engineering 18 (3) (2021): 2599-2613.
doi: 10.3934/mbe.2021132.
[10] S. Yakovlev, et. al., The concept of development a decision support system for the epidemic
morbidity control, CEUR Workshops Proceedings 2753 (2020): 265-274.
[11] W. Jitkrittum, et. al., Informative features for model comparison, Advances in Neural</p>
      <p>Information Processing Systems 31 (2018): 1-12.
[12] T. Tran, et. al. A framework for feature extraction from hospital medical data with applications
in risk prediction, BMC Bioinformatics 15 (2014): 425. doi: 10.1186/s12859-014-0425-8
[13] M. Riachi, J. Himms-Hagen, M.E. Harper, Percent relative cumulative frequency analysis in
indirect calorimetry: application to studies of transgenic mice, Canadian journal of physiology
and pharmacology 82 (12) (2004): 1075-83. doi: 10.1139/y04-117
[14] R. Detrano, et. al., International application of a new probability algorithm for the diagnosis of
coronary artery disease, American Journal of Cardiology 64 (1989): 304-310.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>