=Paper= {{Paper |id=Vol-3348/short7 |storemode=property |title=Dimensionality Reduction of Chronic Kidney Disease Data using Principal Components Analysis |pdfUrl=https://ceur-ws.org/Vol-3348/short7.pdf |volume=Vol-3348 |authors=Tetyana Chumachenko,Kseniia Bazilevych |dblpUrl=https://dblp.org/rec/conf/profitai/ChumachenkoB22 }} ==Dimensionality Reduction of Chronic Kidney Disease Data using Principal Components Analysis== https://ceur-ws.org/Vol-3348/short7.pdf
Dimensionality Reduction of Chronic Kidney Disease Data using
Principal Components Analysis
Tetyana Chumachenkoa and Kseniia Bazilevychb
a
    Kharkiv National Medical University, Nauky ave., 4, Kharkiv, 61000, Ukraine
b
    National Aerospace University “Kharkiv Aviation Institute”, Chkalow str., 17, Kharkiv, 61070, Ukraine


                 Abstract
                 Chronic kidney disease is a long-term progressive decline in kidney function. Chronic kidney
                 disease is widespread throughout the world. The disease is diagnosed in 10-13% of adults, 20%
                 older than 60. Early diagnosis allows for taking timely and effective measures to reduce the
                 risk of developing chronic kidney disease. Automated diagnostics using machine learning
                 methods allow for making a diagnosis at an early stage with high accuracy. However, medical
                 data requires pre-processing, and many attributes can negatively affect the model's
                 performance. Therefore, the study of intelligent methods for reducing the dimension of medical
                 data samples is relevant. In this article, we have developed a data dimensionality reduction
                 model for patients with suspected chronic kidney disease based on principal component
                 analysis. As a result, an application was implemented that made it possible to reduce the sample
                 size from 13 to 2 principal components using the principal component analysis method.

                 Keywords 1
                 Dimensionality reduction, kidney disease, principal component analysis

1. Introduction
   Chronic kidney disease is a long-term progressive decline in kidney function [1]. Any cause can
cause the disease due to a significant kidney function impairment. The most common causes are diabetic
nephropathy, hypertensive nephrosclerosis, and primary and secondary glomerulopathies [2]. Also, a
common cause of kidney damage is metabolic syndrome, characterized by arterial hypertension and
type 2 diabetes mellitus.
   Chronic kidney disease is widespread throughout the world. The disease is diagnosed in 10-13% of
the adult population, 20% of whom are older than 60 years [3].
   Chronic kidney disease in the early stages is described as a decrease in renal reserve or renal
insufficiency that may progress. The loss of renal tissue function has practically no obvious pathological
manifestations because, due to the functional adaptation of the kidneys, the remaining tissue works
hard.
   With a moderate decrease in renal reserve, the course is often asymptomatic. Symptoms of the
disease develop slowly and in later stages include [4]:
   •     anorexia;
   •     nausea;
   •     vomiting;
   •     stomatitis;
   •     apathy;
   •     chronic fatigue;
   •     decreased clarity of consciousness;

2nd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2022), December 2-4, 2022, Łódź, Poland
EMAIL: tatalchum@gmail.com (T. Chumachenko); ksenia.bazilevich@gmail.com (K. Bazivelych)
ORCID: 0000-0002-4175-2941 (T. Chumachenko); 0000-0001-5332-9545 (K. Bazilevych)
              ©️ 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
    •    fluid retention;
    •    muscle convulsions and spasms;
    •    peripheral neuropathies;
    •    epileptic seizures;
    •    etc.
    The presence of chronic kidney disease is first suspected with an increase in serum creatine levels
[5]. At the beginning of the diagnosis, it is determined whether the kidney failure is acute, chronic, or
acute, which has passed into chronic. The examination includes a urinalysis with microscopy of the
urinary sediment, an assessment of the level of electrolytes, urea nitrogen, creatinine, phosphates,
calcium, and a complete blood count. A history of elevated creatinine or abnormal urinalysis is most
helpful in the differential diagnosis. Therefore, early diagnosis is an essential tool for detecting and
preventing acute and chronic kidney disease development.
    In recent years, data-driven medicine and intelligent technologies for healthcare have been widely
developed. Such approaches are used for automated diagnostics [6], forecasting the development of
infectious morbidity [7], studying epidemic processes [8], analyzing medical data [9], studying factors
affecting the dynamics of morbidity [10], developing medical decision support systems [11], etc.
    Machine learning methods are the most effective for automated diagnostics and the development of
physician decision support systems. However, data sets often require pre-processing, and many data
attributes can reduce the accuracy of models. Therefore, an urgent task is to study methods for reducing
data dimension for their application to medical data.
    Therefore, this work aims to develop a model for reducing the dimensionality of these patients with
suspected chronic kidney disease based on principal component analysis.
    Research is part of a complex, intelligent information system for epidemiological diagnostics, the
concept of which is discussed in [12].

2. Materials and Methods
   Principal components analysis is one of the main methods of data dimensionality reduction, losing
the least amount of information [13]. The method is used in many areas, including pattern recognition,
computer vision, and data compression. The calculation of principal components is reduced to the
calculation of eigenvectors and eigenvalues of the covariance matrix of the original data or the singular
value decomposition of the data matrix.
   Principal components analysis has several basic versions [14]:
   •     Approximate data by linear manifolds of lower dimension;
   •     Find subspaces of lower dimension in the orthogonal projection on which the data spread, is
   maximum;
   •     Find subspaces of lower dimension in the orthogonal projection onto which the root-mean-
   square distance between points is maximal;
   •     For a given multidimensional random variable, construct such an orthogonal transformation of
   coordinates that, as a result, the correlations between individual coordinates will vanish.
   The first three options operate on finite data sets. They are equivalent and do not use any hypothesis
about statistical data generation. The last option operates with random variables. Finite sets appear here
as samples from a given distribution, and the solution of the first three problems approximates the true
Karhunen-Loeve transformation [15].
   Let there be n numerical features fj(x), j=1,…,n. The objects of the training sample will be identified
with their indicative descriptions:

                             𝑥𝑖 ≡ (𝑓1 (𝑥𝑖 ), … , 𝑓𝑛 (𝑥𝑖 )), 𝑖 = 1, … , 𝑙.                           (1)

   Consider the matrix F, the rows of which correspond to the indicative descriptions of training
objects:
                                        𝑓1 (𝑥1 ) … 𝑓𝑛 (𝑥1 )  𝑥1                                   (2)
                                𝐹𝑙×𝑛 = ( …       …    … ) = ( … ).
                                        𝑓1 (𝑥1 ) … 𝑓𝑛 (𝑥1 )  𝑥1

   Denote by zi = (g1(xi), …, gm(xi)) feature descriptions of the same objects in the new space Z=Rm of
lower dimension, m < n:

                                       𝑔1 (𝑥1 ) … 𝑔𝑚 (𝑥1 ) 𝑧1                                     (3)
                               𝐺𝑙×𝑛 = ( …       …   … ) = ( … ).
                                       𝑔1 (𝑥1 ) … 𝑔𝑚 (𝑥1 ) 𝑧1

   We require that the original feature descriptions can be restored from new descriptions using some
linear transformation determined by the matrix U=(ujs)n x m:
                                         𝑚
                                                                                                  (4)
                               𝑓̂𝑗 (𝑥) = ∑ 𝑔𝑠 (𝑥)𝑢𝑗𝑠 , 𝑗 = 1, … , 𝑛, 𝑥 ∈ 𝑋.
                                        𝑠=1

or in vector notation.

                                                𝑥̂ = 𝑧𝑈 𝑇 .                                       (5)

   The reconstructed description of the vector form does not have to exactly match the original
description x, but their difference on the objects of the training sample should be as small as possible
for the chosen dimension m. We will search simultaneously for the matrix of new feature descriptions
G and the linear transformation matrix U for which the total discrepancy Δ2(G,U) of the restored
descriptions is minimal:
                           𝑙                    𝑙                                                 (6)
             2 (𝐺,                      ‖2              𝑇        ‖2       ‖𝐺𝑈 𝑇       2
           ∆         𝑈) = ∑‖𝑥̂𝑖 − 𝑥𝑖          = ∑‖𝑧𝑖 𝑈 − 𝑥𝑖           =           − 𝐹‖ → min,
                                                                                          𝐺,𝑈
                          𝑖=1                  𝑖=1

where all norms are Euclidean.
   Assume that the matrices G and U are non-degenerate: rank G = rank U = m. Otherwise there would
be a representation

                                               𝐺̅ 𝑈
                                                  ̅ 𝑇 = 𝐺𝑈 𝑇 ,                                    (7)

with the number of columns in the matrix 𝐺̅ less than m. Therefore, only cases where m ≤ rank F are of
interest.
   If m ≤ rank F, then the minimum of Δ2(G, U) is reached when the columns of the matrix U are the
FTF eigenvectors corresponding to the m maximum eigenvalues. Moreover, G = FU, the matrices U
and G are orthogonal.
   The main limitations of the principal component method are:
   •      The impossibility of semantic interpretation of the components, since they include dispersion
   from several initial variables;
   •      The method can only work with continuous data.

3. Results

    The Python language was used to build the dimensionality reduction model. For experimental
studies, a dataset of patients with suspected chronic kidney disease was used [16]. It contains measures
of 24 features for 400 people. 14 features are numerical and 10 are categorical. Description of features
is presented in Table 1.
Table 1
Dataset description
            Feature                            Scale type                         Range
         BloodPressure                           Metric                          50…180
         SpecificGravity                         Metric                           1…1.02
            Albumin                              Metric                             0…5
              Sugar                              Metric                             0…5
          RedBloodCell                          Boolean                             0,1
           BloodUrea                             Metric                          1.5…391
        SerumCreatinine                          Metric                           0.4…76
            Sodium                               Metric                          4.5…163
           Pottasium                             Metric                           2.5…47
          Hemoglobin                             Metric                         3.1…17.8
      WhiteBloodCellCount                        Metric                        2200…26400
       RedBloodCellCount                         Metric                            2.1…8
          Hypertension                          Boolean                             0,1
         PredictedClass                         Boolean                             0,1

   The dataset visualization is presented in Figure 1.




Figure 1: Data visualization

   Initially, the data set was divided into objects and the objective function, and data processing was
performed. Then an instance of the principal components analysis object was created and the data
dimension was reduced while maintaining the two principal components. Thus, the new dimension of
the dataset has the form (400, 2), principal component shape is (2, 13). Principal components analysis
coefficients are presented in Table 2.

Table 2
Principal components coefficients.
                           0.1785287                             -0.08201316
                           -0.2972433                             0.30440845
                           0.34606806                            -0.20975004
                           0.17004891                            -0.30040035
                          -0.19416257                             0.10444567
                           0.34101035                             0.29990523
                           0.28353996                             0.52222925
                          -0.25673523                            -0.38659614
                           0.10679924                             0.20726542
                          -0.39901579                             0.06977409
                           0.09928799                            -0.42844184
                           -0.3681989                             0.05511764
                           0.33874907                            -0.09441467

   Visualization of the obtained results is shown in Figure 2.




Figure 2: Results visualization


4. Conclusions

   Chronic kidney disease is a pressing problem worldwide. An early diagnosis is an effective tool for
reducing the development of chronic kidney disease. Machine learning methods make it possible to
build models for the early detection of a disease, which allows for taking timely, practical measures to
counteract the development of the disease. However, medical data requires preliminary preparation,
and datasets containing many attributes can reduce the accuracy of diagnostic models. Therefore, within
the framework of this study, a model for reducing the dimensionality of data from patients with
suspected chronic kidney disease was developed based on the principal component analysis method.
The constructed model made it possible to reduce the dimension of the dataset from 13 to 2.
   Further research will combine the developed model with machine learning models to classify
suspected chronic kidney disease patients.

5. Acknowledgements
   The study was funded by the National Research Foundation of Ukraine in the framework of the
research project 2020.02/0404 on the topic “Development of intelligent technologies for assessing the
epidemic situation to support decision-making within the population biosafety management”
6. References

[1] C. Charles, A.H. Ferris, Chronic kidney disease, Primary care 47 (4) (2020) 585-595.
     doi: 10.1016/j.pop.2020.08.001
[2] M. Provenzano, et. al., Epidemiology of cardiovascular risk in chronic kidney disease patients: the
     real silent killer, Reviews in cardiovascular medicine 20 (4) (2019) 209-220.
     doi: 10.31083/j.rcm.2019.04.548
[3] V. Jha, et. al., Chronic kidney disease: global dimension and perspectives, Lancet 382 (9888)
     (2013) 260-72. doi: 10.1016/S0140-6736(13)60687-X.
[4] M.G. Shlipak, et al., The case for early identification and intervention of chronic kidney disease:
     conclusions from a kidney disease: improving global outcomes controversies conference, Kidney
     international 99 (1) (2021) 34-47. doi: 10.1016/j.kint.2020.10.012
[5] T.K. Chen, D.H. Knicely, M.E. Grams, Chronic kidney disease diagnostics and management: a
     review, Journal of American Medical Association 322 (13) (2019) 1294-1304.
     doi: 10.1001/jama.2019.14745
[6] M.A. Myszczynska, et. al., Applications of machine learning to diagnosis and treatment of
     neurodegenerative diseases, Nature Reviews Neurology 16 (2020) 440-456. doi: 10.1038/s41582-
     020-0377-8
[7] D. Chumachenko, et. al., Investigation of statistical machine learning models for COVID-19
     epidemic process simulation: random forest, k-nearest neighbors, gradient boosting, Computation
     10 (6) (2022) 86. doi: 10.3390/computation10060086
[8] D. Chumachenko, On intelligent multiagent approach to viral Hepatitis B epidemic processes
     simulation, Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining
     and Processing, DSMP 2018 (2018) 415-419. doi: 10.1109/DSMP.2018.8478602
[9] R. Tkachenko, et. al., Committee of the SGTM neural-like structures with extended inputs for
     predictive analytics in insurance, Communications in Computer and Information Science 1054
     (2019). doi: 10.1007/978-3-030-27355-2_9
[10] N. Davidich, et. al. Monitoring of urban freight flows distribution considering the human factor,
     Sustainable Cities and Society 75 (2021) 103168. doi: 10.1016/j.scs.2021.103168
[11] D. Chumachenko, et. al. Intelligent expert system of knowledge examination of medical staff
     regarding infections associated with the provision of medical care, CEUR Workshop Proceedings
     2386 (2019) 321-330.
[12] S. Yakovlev, et. al. The concept of developing a decision support system for the epidemic
     morbidity control, CEUR Workshop Proceedings 2753 (2020) 265-274.
[13] I.T. Jolliffe, J. Cadima, Principal component analysis: a review and recent developments,
     Philosophical Transactions of the Royal Society 374 (2065) (2016) 20150202.
     doi: 10.1098/rsta.2015.0202
[14] D. Groth, S. Hartmann, S. Klie, J. Selbig, Principal components analysis, Methods in molecular
     biology 930 (2013) 527-47. doi: 10.1007/978-1-62703-059-5_22
[15] Y. Zhou, X. Ai, M. Lv, B. Tian, Karhunen-Loève Expansion for the Second Order Detrended
     Brownian Motion, Abstract and Applied Analysis 2014 (2014) 457051. doi: 10.1155/2014/457051
[16] Chronic Kidney Disease Data Set, UCI Machine Learning Repository (2019). Available at:
     https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease (accessed on 30.09.2022)