Application of medical data classification methods for a
               medical decision support system*

    Ekaterina Yu. Zimina1[0000-0002-8625-1956], Maxim A. Novopashin2[0000-0002-8919-4002] and
                         Alexander V. Shmid 1[0000-0002-4672-1458]
      1
       National Research University Higher School of Economics, 11, Pokrovsky Boulevard,
                             Moscow, 101000, Russian Federation
    2
      EC-leasing Company, 125, Varshavskoe highway, Moscow, 117587, Russian Federation
                                     ezimina@hse.ru


          Abstract. Decision support systems (DSS) allow us to help the doctor in
          making diagnoses to the patient, also medical DSS help to assess the need for a
          particular examination of the patient. In this article methods of medical data
          classification are considered, these methods are the part of the medical DSS.
          The paper includes investigation of data classification methods as hierarchical
          cluster analysis, k-means analysis and discriminant analysis. The selected
          methods are implemented using the example of cardiological data. A hypothesis
          is put forward that it is possible to determine the presence or absence of
          tuberculosis in a person from cardiological data by using data classification
          methods. Such indicators as sensitivity and specificity evaluate the
          effectiveness of the methods. In addition, ROC and AUC are presented. Thus,
          the DSS will be able to determine a certain degree of probability to assume the
          presence of tuberculosis in a person. The doctor will decide on the need for
          additional examinations depending on the values obtained,

          Keywords: Decision Support System, Data Analysis, Telemedicine,
          Classification.


1         Introduction

Currently, the creation of decision support systems (DSS) is relevant, and this
direction is also developing in the field of medicine. DSS allow us to help the doctor
in making diagnoses to the patient. In addition, with the help of these systems, it is
possible to determine the need for various examinations for the patient [1]. The use of
the medical DSS for doctors will prevent patients from being sent to expensive
additional examinations, which are not always safe [2].
   The paper discusses methods of data analysis that will be implemented in the
medical DSS in order to help doctors. The paper implements such methods as
hierarchical cluster analysis, k-means analysis and discriminant analysis.

*
    Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
    The parameters of electrocardiogram (ECG) were used as experimental data. This
data is depersonalized.
    With implementing the methods of medical DSS a hypothesis is put forward about
the possibility of predicting the presence or absence of tuberculosis by ECG
parameters. The sample contains a nominal variable (tb), which reflects the presence
of diagnosed tuberculosis in a person (tb = 1) or its absence (tb = 0). The
experimental data collected ECGs recorded in people with a confirmed form of
tuberculosis in the second stage of the disease.
    Currently ETU “LETI” under the leadership of Professor Kalinichenko A. N. is
doing the similar studies. However, investigations of ETU “LETI” have a direction
different from this work. They research the detection of signs of cardiac disease in
ECG using machine learning [3].
    The sample used in the study is divided into training and test samples, where the
mathematical model is created on the training one, and the quality of the obtained
model is evaluated on the test one. As a result, a model and an accuracy value of the
correct prediction of belonging to the group are obtained for each of the considered
methods.
    An approach with training of DSS methods based on medical data will make it
possible to make an early diagnosis of the patient's health condition. This means that
it is possible to assume with a certain degree of probability that a person has signs of
tuberculosis or not according to the recorded ECG. If possible signs of the disease are
detected, this patient should be sent to get a more detailed examination together with a
pulmonologist.
    The purpose of this work is to implement classification methods to the medical
SPR for early diagnosis to determine the presence or absence of tuberculosis signs. In
accordance with this goal, the following tasks were identified: to compile descriptive
statistics of the initial experimental data, to investigate and apply methods for
classification on experimental data, and to formulate a conclusion.
    The performance of the methods is evaluated using sensitivity and specificity
indicators. Sensitivity is the percentage of correctly classified "ill" people, and
specificity is the percentage of correctly classified "healthy" people [4]. In addition, a
ROC is constructed for the results of the methods and the area under the curve (AUC)
is calculated. The ROC curve is a tool for assessing diagnostic ability, representing a
graph where the sensitivity and specificity values in the range from 0 to 1 are taken as
axes [5].


2      Materials and methods

2.1    Materials
General information of the person and parameters of his cardicycle were taken to
study the methods of data classification and subsequent verification of the proposed
hypothesis about the possibility of determining the presence or absence of
tuberculosis in a person from cardiological data.
   A cardiocycle (or cardiac cycle) is a period of blood circulation generated by the
cyclic activity of the heart. The measurement unit for this periodicity is one cardiac
cycle. The length of the cardiocycle is the period of cardiac contractions [6]. The
elements of the cardiocycle are presented below (Fig. 1).


                         Fig. 1. The elements of the cardiac cycle.

   The main elements for the ECG analysis are the start and end time of the elements
of the cardiocycle, as well as the PQ interval, the QRS complex, the ST segment, the
QT interval and the P wave stands out especially among the elements of the
cardiocycle.
   The data is presented in the form of a table, where each row corresponds to one
ECG. Also in the same row in columns contains non-personally-identifying
information about the person. Below is a table with parameters for analysis and
explanations to them (Table 1).
   The total sample consists of 5928 registered ECGs. Below is a table with general
information about people from the data sample (Table 2).
   The distribution of the number of people according to their age classification is
also presented (Fig. 2). In the sample by age, there is a bias towards people over 18
years old. This is explained by the fact that ECG registration of persons under 18
years old is possible in the presence of a parent and with his permission, so there were
few persons of younger groups in the collected data.
                           Table 1. Data parameters for analysis.

         No             The name of parameter                       Comments
 1                   pid                              Patient identification number
 2                   cid                              Cardiogram identification number
 3                   date1                            Date of registration of the ECG
 4                   gender                           Gender (1 – M, 0 – W)
 5-7                 age, weight, height              Age, weight, height
 8                   cardiostimulator                 Presence of pacemaker (1 – yes, 0 –
                                                      no)
 9                   smoking                          Smoking (1 - yes, 0 – no)
 10                  Tb                               Presence of diagnosed tuberculosis
                                                      (1 – yes, 0 – no)
 11-13               p_a, p_da, p_t                   Parameters of P wave
 14-15               p_left_slopes, p_right_slopes The length of the slopes of P wave
 16-18               q_a, q_b_t, q_e_t                Parameters of Q wave
 19-21               r_a, r_b_t, r_e_t                Parameters of R wave
 22-23               r_left_slopes, r_right_slopes    The length of the slopes of R wave
 24-27               s_a, s_da, s_b_t, s_e_t          Parameters of S wave
 28-30               t_a, t_da, t_t                   Parameters of T wave
 31-32               t_left_slopes, t_right_slopes    The length of the slopes of T wave
 33                  interval_pq                      The length of interval PQ
 34                  komplex_qrs                      The length of complex QRS
 35                  segment_st                       The length of segment ST
 36                  interval_qt                      The length of interval QT
 37                  zubets_p                         The length of P wave
   Where: X_a – the amplitude (height) of the figure X on the ECG, Х_da – the
amplitude (height) indicator X on the differentiated ECG (ECG is taken first
production), X_t – length index of X, Х_b_t start time of metric X on ECG, Х_e_t –
the end time of the index X on ECG.

                              Table 2. The number of people.

              All     With            Without With           Without         Smo      No
                      tuberculo       tuberculo pacemaker    pacemaker       king     smokin
                      sis             sis                                             g
Men           329     70              259       37           292             121      208
Women         262     66              196      25            237             59       203
All           591     136             455      62            529             180      411
                       Fig. 2. Distribution of the number of people by age.

   The following table includes the main values of the number of ECGs for different
groups (Table 3).

                                     Table 3. The number of ECG.


                                                                                               No smoking
                             tuberculosis


                                            tuberculosis


                                                           pacemaker


                                                                       pacemaker


                                                                                    Smoking
                                              Without


                                                                        Without
                                With


                                                             With
                 All


Men             3546          573           2973            776        2770        1945       1601
Women           2382          684           1698            115        2267        972        1410
All             5928         1257           4671            891        5037        2917       3011


   It should be noted that not all parameters of the cardiocycle were calculated for all
ECGs, so observations with partially uncalculated parameters were automatically
discarded methods implementation.


2.2    Methods
This section describes methods of data analysis and provides brief information on
them [7].
   Cluster analysis is used to separate the original data into groups (clusters) that are
amenable to interpretation so that the elements of one group were similar in the
parameters, while elements from different groups should differ from each other [8]
(Fig. 3).
         Fig. 3. An example of the points separation in the plane into similar clusters.

   Hierarchical cluster analysis is used for relatively small numbers of observations.
During the analysis, initially each observation is located in its own cluster, then
neighboring clusters are combined in pairs until there are only two clusters left.
   K-means analysis allows you to divide an arbitrary data set into a given number of
groups so that the objects of the same cluster are close enough to each other, and the
objects of different groups do not intersect [9]. In this case, the observation belongs to
the cluster to the center of which it is the closest.
   First, the center of the class is determined, then all objects within the specified
threshold value from the center are grouped.
   Discriminant analysis is a method of statistical analysis that allows you to divide
data into disjoint groups. This method allows us to identify the variables that affect
the separation, as well as their weight coefficients [10]. The result of performing a
discriminant analysis is a discriminant function that uses a nominal dependent
variable. Discriminant analysis is an alternative to multiple regression analysis.


3      Results

IBM SPSS Statistics 23 software was used in order to implement selected methods.
IBM SPSS Statistics is a statistical analysis platform with a set of functions [11].
   The use of the hierarchical cluster analysis method did not lead to significant
results. The number of observations in 5928 recorded ECGs was too large as a sample
for this method.
   Further, the number of observations in the sample was reduced to 50% of randomly
selected observations. As an assumption, a range was set for the number of classes:
there should be 2-3 clusters. This was done because a huge number of clusters are
obtained on this sample without this restriction. The values in two or three clusters
were chosen based on the following: we need to get two clusters with measurements
of people without tuberculosis and with tuberculosis. The possible number of three
clusters is taken to compare the results.
   The model was built, its variables were saved with the indication of belonging to
the cluster. A conjugacy table was constructed for two variables containing
information about the distribution of all variables into 2 and 3 clusters in order to
evaluate the performance of this method (Figure 4).


            Fig. 4. A distribution of observations by division into 2 and 3 clusters.

According to the table almost all observations fell into the first cluster, it also
observed when the data divided into two clusters and into three clusters. It is also
seen that the second and third clusters in both divisions are very small relative to the
first cluster. If we compare the decision to divide into two clusters or three, we can
conclude that the two-cluster solution is the most stable.
    Further, for the sake of clarity of the obtained solution an analysis of the averages
was carried out, the part of the resulting picture is presented below (Fig. 5).


                  Fig. 5. The average values when divided into two clusters.

   The target variable – the variable of the presence or absence of tuberculosis in this
dimension (tb). During the application of hierarchical clustering it was found that the
average values of clusters are 0.24 and 0.73, where 0 is “healthy” and 1 is “sick”. You
can also pay attention to other parameters, for example, taller people fell into the
“healthy” cluster, and almost all smokers fell into the “sick” cluster. It is worth noting
the average weight of the subjects in the second cluster – 138 kg, which is quite a lot.
   When analyzing this model in detail we conclude that hierarchical clustering is not
suitable for working with this sample of medical data.
   When implementing the k-means method two clusters are initially set: people
without a diagnosis of tuberculosis and people with diagnosed tuberculosis (parameter
tb=0 and tb=1, respectively). There are many observations in the medical dataset, so
10 iterations were set for the method to work.
   The centers of the two clusters obtained are presented below (Fig. 6).
                       Fig. 6. Cluster centers in k–means clustering.

   We also obtained an estimate based on Fischer statistics on the significance of the
parameter in the differentiation of clusters (Fig. 7).The figure below reveals an
example that shows that the target variable tb is significant, as is weight, height, and
smoking. The most significant parameters among the parameters of the cardiocycle
are the R wave, S wave and the QRS complex.
   In addition, the k-means method obtained results is similar to the hierarchical
clustering method: outputs data on the number of observations in clusters are 2331
observations in the first cluster and 30 observations in the second cluster. The
obtained values coincided with the values for the number of observations when
dividing into two clusters during hierarchical clustering. These calculations were
obtained by randomly selecting 50 % of all observations.
When using the k-means method with the same parameters on a full sample the
following division was obtained by the number of observations in clusters: 4707 and
56 observations, respectively. Thus, the result was obtained that one cluster is
dominated by data when clustering into two groups.
   Two additional parameters were created using of the k-means clustering method:
indicating the number of the membership cluster and the distance to its center.
   Next a graphical illustration of the results of this method was constructed: the
grouping variable is the cluster number, the differentiating variable is the distance to
the cluster center. The figure below, as well as the line on the cluster, shows the
median value (Fig. 8).
                            Fig. 7. Values of Fisher statistics.

   Based on the results of the analysis it can be concluded that the parameters of the R
wave were the most significant parameters in clustering by this method. Below is the
spread of the R wave indicator depending on the presence or absence of tuberculosis
(Fig. 9).
   The figure shows that the variations of this indicator differ depending on the
presence or absence of tuberculosis, but visually almost half of the values of the
indicator are the same both in the presence of tuberculosis and in its absence.
According to the results of the clustering analysis the k-means indicator is the most
significant when divided into groups. This leads to the conclusion that the model is
not sufficiently accurate using the k-means method.


Fig. 8. The vertical axis - distance from observation to cluster center, the horizontal axis is the
                                         cluster number.

    Then the discriminant method was implemented and investigated. The sample was
first divided into training and test samples: 60% and 40%, respectively [12-13]. For
implementation the method of forced inclusion of variables was used and grouping
was performed by the variable of the presence or absence of tuberculosis tb.
    A table "Group statistics" was obtained with an indication of the average values of
each parameter, its standard deviation by group. The inequality of the mean and
standard deviation does not prove that these variables are distinctive features of the
selected clusters.
    The figure below shows the calculated values of the variables. Parameters whose
values in the table are greater than 0.05 can be excluded later for analysis purposes.
    From the figure below it can be seen that there are parameters that are insignificant
when divided into groups, for example, p_da, t_da and others. Thus, they can be
excluded when composing the equation (Fig. 10).
    The coefficients of the canonical discriminant function were also obtained to create
the equation (Fig. 11).
    The accuracy of the division into clusters is determined by the distance between
the average values of the discriminant function in the studied clusters. The greater the
distance, the better the groups are separated. The values of the centroids of the groups
are as follows: -0.376 and 1.242.
                     Fig. 9. The scatter plot of figure R wave and tb.

   You can determine the quality of the model based on the results of the
classification at the following table (Table 4).

                Table 4. The results of classification discriminant analysis.

                                            Predicted group membership
                                                                                Total
                                   Tb       0             1
                                        0   1826          389                   2215
                      Quantity
Selected                                1   200           500                   700
observations                            0   82.4          17.6                  100
                      %
                                        1   28.6          71.4                  100
                                        0   1195          243                   1438
                      Quantity
Unselected                              1   117           293                   410
observations                            0   83.1          16.9                  100
                      %
                                        1   28.5          71.5                  100

   In the training sample the sensitivity is 71.4% and the specificity is 82.4%. In the
test sample the sensitivity is 71.5% and the specificity is 83.1%. This shows good
accuracy of this model.
   In addition, a ROC curve was constructed, the area under the curve of which was
0.853 (Fig. 12).
    Fig. 10. Evaluation of the significance of the parameter in the distribution into groups.

   Using the table with the coordinates of the curve points the threshold value for the
final discriminant equation 0.4511434 was selected. At this threshold the sensitivity is
76.4% and the specificity is 76.5%.
   The threshold value was selected from the points of the coordinate ROC. The
sensitivity and specificity values were selected so that the sum of sensitivity and
specificity was the maximum.
                    Fig. 11. Coefficients of the discriminant function.


                         Fig. 12. ROC in discriminant analysis.


4     Discussion

Three methods were implemented: hierarchical cluster analysis, k-means analysis, and
discriminant analysis.
    Analysis of the hierarchical cluster method showed that this method is not suitable
for analyzing large datasets. Even with the usage of reduction in the number of
observations in the sample it was not possible to obtain acceptable results.
    Analysis of the k-means method showed that this method can be used for
classification problems into two clusters "sick" and "healthy", but the accuracy of this
method did not show high results. The parameters that were identified by the method
as the most significant have not significant differences in the spread between
"healthy" and "sick". For more accurate operation of this method it is necessary to
filter out the least significant parameters and continue a more detailed study.
    The discriminant method analysis allowed us to obtain a discriminant equation
with sensitivity and specificity values of 76.4% and 76.5% respectively. Based on the
selected sensitivity and specificity values, a threshold was selected for working with
the discriminant equation.
    In order to implement the best method in terms of sensitivity and specificity in
medical DSS it should be tested on a larger sample size. Also for better accuracy in
predicting the probability of a person having second-stage tuberculosis, cross-
validation should be performed.


5      Conclusion

In this paper several classification methods for working in the medical DSS were
investigated. The idea of creating a medical DSS is as follows: according to the ECG
parameters the trained methods determine the degree of probability of the presence of
tuberculosis of the second type in the examined person, whose open symptoms are
practically not observed. Thus, the system will help in the early stages of the disease
to determine the presence of it on the ECG.
   Medical data were used as experimental data in order to train DSS methods.
Medical data includes parameters calculated from an electrocardiogram, as well as
general impersonal parameters about its owner (height, weight, age, etc.). The
experimental sample consisted of more than five thousand electrocardiograms. Also a
hypothesis was put forward and tested about the possibility of determining the
presence or absence of signs of tuberculosis in a person by the parameters of an
electrocardiogram. This hypothesis was confirmed.
   Three methods were implemented: hierarchical cluster analysis, k-means analysis,
and discriminant analysis. The program for statistical data processing IBM SPSS
Statistics was used to carry out the work.
   Of the methods considered in this paper the most suitable for the classification
problem with a nominal target variable on the example of the study of medical
experimental data was the method of discriminant analysis. This method is similar to
regression analysis, which will be studied further, as well as other classification
methods that are not included in this work. In the future the model should be refined
to obtain a higher accuracy of the medical DSS.
References
 1. Sim, I., et al.: Clinical decision support systems for the practice of evidence-based
    medicine. Journal of the American Medical Informatics Association, 8.6, 527-534 (2001).
 2. Golub, J., et al.: Delayed tuberculosis diagnosis and tuberculosis transmission. The
    international journal of tuberculosis and lung disease, 10.1, 24-30 (2006).
 3. Nemirko, A., Manilo, L., Kalinichenko, A.: Intellectual analysis of biomedical signals.
    Biotekhnosfera Journal 2(20), 30-37 (2012).
 4. Parikh, R., et al.: Understanding and using sensitivity, specificity and predictive values.
    Indian journal of ophthalmology, 56.1, 45-50 (2008).
 5. Fawcett, T.: An introduction to ROC analysis. Pattern recognition letters, 27.8, 861-874
    (2006).
 6. Lu, B. I. N., et al.: Coronary artery motion during the cardiac cycle and optimal ECG
    triggering for coronary artery imaging. Investigative radiology, 36.5, 250-256 (2001).
 7. Ott, R. Lyman, Longnecker, M. T.: An introduction to statistical methods and data
    analysis. 7th edn., Brooks/Cole, Boston, United States (2015).
 8. Kaufman, L., Rousseeuw P. J.: Finding groups in data: an introduction to cluster analysis.
    John Wiley & Sons, New York, United States (2009).
 9. Kanungo, T., et al.: An efficient k-means clustering algorithm: Analysis and
    implementation. IEEE transactions on pattern analysis and machine intelligence, 24.7,
    881-892 (2002).
10. McLachlan, G. J.: Discriminant analysis and statistical pattern recognition. John Wiley &
    Sons, New York, United States (2004).
11. Field A.: Discovering statistics using IBM SPSS statistics. 4th edn., SAGE Publications
    Ltd., London (2013).
12. Breiman L.: Bagging predictors. Machine Learning, 24, 123-140 (1996).
13. Fletcher, G., Ades, P., Kligfаild, P.: Exercise standards for testing and training: a scientific
    statement from the American Heart Association. Circulation Journal, 128, 873-934 (2013).