Dimensionality Reduction of Data on Patients with Diabetes Mellitus by Multidimensional Scaling Ievgen Meniailova, Serhii Krivtsovb and Tetyana Chumachenkoc a V.N. Karazin Kharkiv National University, Kharkiv, Ukraine b National Aerospace University “Kharkiv Aviation Institute”, Kharkiv, Ukraine c Kharkiv National Medical University, Kharkiv, Ukraine Abstract Diabetes Mellitus is a global public health problem. According to the World Health Organization, more than 6% of the world's population suffers from diabetes. In the context of the Russian invasion, the problem of diabetes is especially relevant for Ukraine. This is due to the difficulty of supplying medicines and obtaining medical care. Also, the stress caused by the war is one of the factors in the appearance and complications of diabetes. Automated models and information technologies for classifying patients with suspected diseases are practical decision support tools for making medical diagnoses in resource-limited settings. One of the problems with using such models is data redundancy. Therefore, this study uses multidimensional scaling to focus on dimensionality reduction in patients with suspected Diabetes Mellitus type II. Keywords 1 Diabetes Mellitus, dimensionality reduction, multidimensional scaling 1. Introduction Diabetes Mellitus is a disease characterized by increased blood sugar levels, leading to damage to the kidneys, and nervous system, impaired vision, and affecting the state of the nervous and vascular systems [1]. There are different types of diabetes, depending on which patient requires special treatment based on lifestyle changes, dietary choices, and medications. The disease can progress without symptoms for a long time, so many do not seek medical help promptly. Diabetes is characterized by the following risk factors [2]: • cardiovascular diseases; • the predominance of carbohydrates in the food, leading to a violation of their metabolism; • overweight and obesity; • genetic predisposition; • chronic stress; • long-term use of drugs that contribute to the development of diabetes. The main symptoms of the disease are: • dry mouth and intense thirst; • frequent and profuse urination; • dry skin and mucous membranes; • general weakness and fatigue; • increased appetite; • decreased vision; • leg muscle cramps. IDDM-2022: 5th International Conference on Informatics & Data-Driven Medicine, November 18-20, 2022, Lyon, France EMAIL: evgenii.menyailov@gmail.com (IM); krivtsovpro@gmail.com (SK); tatalchum@gmail.com (TC) ORCID: 0000-0002-9440-8378 (IM); 0000-0001-5214-0927 (SK); 0000-0002-4175-2941 (TC); ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) The most common is type II diabetes, which is characterized by high levels of insulin with low sensitivity of body cells to it [4]. This leads to damage to internal organs. The patient damages the retina, small vessels, nerves, and kidneys. As a result of malnutrition of the skin on the ankles, trophic ulcers form. More than 400 million adults live with diabetes worldwide, which is growing yearly [5]. More than 60 million people have diabetes in the European Region [6]. To date, there are no official statistics on the incidence of diabetes in Ukraine. In 2017, 1.27 million people with diabetes were registered in Ukraine [7]. Among them, 200,000 patients need daily insulin. One of the most effective tools to combat diabetes is its prevention and early detection. With the spread of the COVID-19 pandemic in the world, the number of studies aimed at applying information technology in healthcare has increased. Such studies were aimed at modeling the epidemic process [8, 9], analysis of medical data [10], analysis of medical images [11], analysis of factors in the spread of morbidity [12], analysis of the behavior of the virus [13], a study of the information content of factors affecting the dynamics morbidity [14], etc. Using mathematical modeling and information technologies to support doctors' decision-making when making medical diagnoses is practical. The problem in building models of medical diagnostics is the redundancy of data; therefore, reducing the dimensionality of data of patients with the suspected disease is an urgent task. This study aims to develop a model to reduce the dimensionality of patient data on the incidence of diabetes based on the maximum likelihood method. Research is part of a complex, intelligent information system for epidemiological diagnostics, the concept of which is discussed in [15]. 2. Materials and Methods The more information about the objects of study in the form of a set of characterizing features will be used to create a model, the better. However, too much information can reduce the efficiency of data analysis. It is important to note that non-informative features are a source of additional noise and affect the accuracy of model parameter estimation. In addition, datasets with a large number of features may contain groups of correlated variables. The presence of such groups of features means duplication of information, which can distort the model's specification and affect the quality of the estimation of its parameters. The higher the data dimension, the higher the size of calculations during their algorithmic processing [16]. High dimensionality can mean hundreds, thousands, or even millions of input variables. When dealing with high-dimensional data, it is often helpful to reduce the dimensionality by projecting the data onto a subspace of lower dimensions that retains the "essence" of the data. This is called dimensionality reduction [17]. More minor input data often means fewer parameters or a more straightforward structure in a machine learning model called degrees of freedom. A model with too many degrees of freedom is likely to overflow the training dataset and, therefore, may not work correctly on new data or not work at all. The multidimensional scaling method is one of the well-known non-linear dimensionality reduction methods used to analyze the similarity (similarity or difference) of data by reducing data to a low-dimensional space [18]. It is also important to note that this method is one of the first fundamental teaching methods. Multidimensional scaling (MDS) is a set of statistical methods dealing with the problem of constructing an n-point configuration in Euclidean space using dissimilarity information between n objects. It is not necessary to rely on differences between Euclidean distance objects; they can represent many types of differences. MDS aims to reflect objects before the configuration (or embedding) of points in such a way that the given differences are well approximated by the Euclidean distance [19]. MDS generally attempts to model data such as distances between points in geometric space. The main reason for this is that a graphical representation of the data structure is required, which is much easier to understand than an array of numbers, and, in addition, reflects essential information in the data, smoothing out the noise [20]. In MDS analysis, the data is typically embedded in a 2D or 3D map such that, given similarities or differences, the information matches the distances between points exactly. Objects of interest, such as objects, attributes, stimuli, respondents, etc., correspond to points in such a way that those nearby are empirically similar, and those far apart are considered different. To evaluate the simulation result, two metrics were applied: Euclidean Distance [21] and Manhattan Distance [22]. The Euclidean Distance can be calculated from the Cartesian coordinates of points using the Pythagorean theorem, which is why it is sometimes called the Pythagorean distance. For observations a and b measured in multiple dimensions, this is  ((a − b ) ) . It should be noted that even if i i b 2 you use zoom, normalize, or size weighting, the distance figure will still be the result. This is a good default distance measure if it makes sense to match the dimensions. Manhattan or city-block distance is a distance introduced by Hermann Minkowski. According to this metric, the distance between two points equals the sum of the modules' differences in their N coordinates  a − b . It is important to note that the Manhattan distance depends on the rotation of i =1 i i the coordinate system but does not depend on its mapping from the coordinate axis or offset. 3. Results For the experimental investigation the Pima Diabetes dataset [23] has been used. Table 1 shows the parameters of the dataset. Distribution of the values by parameter is presented in Figure 1. Table 1 Parameters of the dataset Name Scale type Data range Pregnancies Metric 0…13 PG Concentration Metric 44…197 Diastolic BP Metric 0…110 Tri Fold Thick Metric 0…60 Serum Ins Metric 0…846 BMI Metric 0…46.8 DP Function Metric 0.134…2.288 Age Metric 21…60 Diabetes Nominal Sick / Healthy Figure 1: Distribution of parameters. The software implementation of the data dimensionality reduction model by the multidimensional scaling method was carried out in the Python programming language in the Anaconda programming environment. Table 2 shows the import of the data. Table 2 Input data # Pregnancies PG Diastolic … DP Age Diabetes Concentration BP Function 0 6 148 72 … 0.627 50 Sick 1 1 85 66 … 0.351 31 Healthy 2 8 183 64 … 0.672 32 Sick 3 1 89 66 … 0.167 21 Healthy 4 0 137 40 … 2.288 33 Sick … … … … … … … … 763 10 101 76 … 0.171 63 Healthy 764 2 122 70 … 0.340 27 Healthy 765 5 121 72 … 0.245 30 Healthy 766 1 126 60 … 0.349 47 Sick 767 1 93 70 … 0.315 23 Healthy After that, the console will display information about the dissimilarity matrices (distance), new data sets, stress indicators for the multidimensional scaling method based on two metrics, Manhattan and Euclidean. The dissimilarity matrices are shown in tables below. Table 3 shows Manhattan MDS, Table 4 shows Euclidean MDS. Table 3 Manhattan MDS [[0 312 199 … 432 249 299] [312 0 335 … 206 119 83] [199 335 0 … 441 320 402] … … … … … … … [432 206 441 … 0 239 213] [249 119 320 … 239 0 118] [299 83 402 … 213 118 0]] Table 4 Euclidean MDS [[0 178.4320599 106.465957 … 256.4488253 161.78689687 192.82893974] [178.4320599 0 201.86876925 … 113.91224693 59.4726828 46.4865572] [106.465957 201.86876925 0 … 271.97977866 194.12882321 230.32151441] … … … … … … … [256.4488253 113.91224693 271.97977866 … 0 116.02154972 102.32790431] [161.78689687 59.4726828 194.12882321 … 116.02154972 0 54.55272679 [192.82893974 46.4865572 230.32121441 … 102.32790431 54.55272679 0] Figure 2 shows a visual representation of the Manhattan distance dissimilarity matrix. Figure 3 shows a visual representation of the Euclidean distance dissimilarity matrix. On graphical representations, you can see that each is symmetrical and contains zero values on the diagonals. Figure 2: Visualization of Manhattan distance. Figure 3: Visualization of Euclidean distance. Table 5 shows new dataset according to Manhattan MDS. Table 6 shows new dataset according to Euclidean MDS. Stress indicator value of Manhattan distance is 0.17852952329291213. Stress indicator value of Euclidean distance is 0.11104963752850103. Table 5 New database (Manhattan MDS) [[140.03126997 111.65116183] [36.81845208 -142.76590475] [305.34294034 -75.87439157] … … [-98.88342103 -157.0404145] [-2.90571001 -94.96936076] [-32.55505262 -106.65190576]] As we can see, for multidimensional scaling based on the Manhattan distance, the stress factor is 0.17, which is sufficient reason to doubt the results' reliability. Understandably, the number of features set for new data is not optimal for data dimensionality reduction. It is better to set the data dimension to more than two to avoid such a situation for a given set. Table 6 New database (Euclidean MDS) [[-108.25344951 -59.01160171] [61.24171203 -47.18164425] [-80.4182441 -160.9898698] … … [126.87588531 7.0477588] [49.04460745 -15.20916602] [72.37191565 -4.4528538]] In turn, the stress factor for multidimensional scaling using Euclidean distance is 0.11, which is also not ideal, but acceptable to rely on the results obtained, but do not forget that the data is still built with possible errors. The new data sets contain information about 768 patients, but not with 20 features, as initially, but with only two. This is due to the specified data dimension. These received features include a geometric justification. Each data pair represents x, y coordinates. These coordinates will be used to visually represent the result of data dimensionality reduction. It is important to note that the axes in the resulting plots alone do not make sense and that the figures' orientations are arbitrary. Figure 7 shows a visual representation of the results, the graph called MDS (Manhattan distances) is a reflection of multidimensional scaling using Manhattan distance, and called MDS (Euclidean distances) is a multidimensional scaling method using Euclidean distance. In the resulting graphs, each point corresponds to a patient, which means that the graph shows information about 768 patients, but this information only shows the dissimilarity between patients. This can be explained as follows: if two points are near, this means that they have similar indicators, but if two points are far apart, this means that these input features presented at the beginning in these patients are very different. Figure 7: Visualization of new data samples. 4. Conclusions The task of dimensionality reduction is relevant for the application of mathematical modeling methods and information technologies to support doctors' decision making when making diagnoses in conditions of limited resources. Within the framework of this study, a model for reducing the dimensionality of medical data was built based on the multidimensional scaling method. An information system for automated data processing has been developed in the Python language. Diabetes Mellitus Type II was chosen as the object of study, the containment of which is especially relevant in the context of the escalation of the Russian war in Ukraine. As a result of the study, the Pima Indians Diabetes dataset was processed, consisting of 768 records and 9 attributes. After processing, the new dataset consists of 2 attributes. Manhattan distance is 0.17, Euclidean distance is 0.11. 5. Acknowledgements The study was funded by the National Research Foundation of Ukraine in the framework of the research project 2020.02/0404 on the topic “Development of intelligent technologies for assessing the epidemic situation to support decision-making within the population biosafety management”. 6. References [1] W. Kerner, J. Bruckel, Definition, classification and diagnosis of diabetes mellitus, Experimental and Clinical Endocrinology & Diabetes 122 (7) (2014): 384-6. doi: 10.1055/s-0034-1366278 [2] D. Glovaci, W. Fan, N.D. Wong, Epidemiology of Diabetes Mellitus and Cardiovascular Disease, Current Cardiology Reports 21 (4) (2019): 21. doi: 10.1007/s11886-019-1107-y [3] L. Cloete, Diabetes mellitus: an overview of the types, symptoms, complications and management, Nursing Standard 37 (1) (2022): 61-66. doi: 10.7748/ns.2021.e11709 [4] F. Zaccardi, D.R. Webb, T. Yates, M.J. Davies, Pathophysiology of type 1 and type 2 diabetes mellitus: a 90-year perspective, Postgraduate Medical Journal 92 (1084) (2016): 63-9. doi: 10.1136/postgradmedj-2015-133281 [5] D. Lovic, et. al., The growing epidemic of diabetes mellitus, Current vascular pharmacology 18 (2) (2020): 104-109. doi: 10.2174/1570161117666190405165911 [6] M.S. Paulo, N.M. Abdo, R. Bettencourt-Silva, R.H. Al-Rifai, Gestational diabetes mellitus in Europe: a systematic review and meta-analysis of prevalence studies, Frontiers in Endocrinology 12 (2021): 691033. doi: 10.3389/fendo.2021.691033 [7] R.M. Stuart, et. al., Diabetes care cascade in Ukraine: an analysis of breakpoints and opportunities for improved diabetes outcomes, BMC Health Services Research 20 (1) (2020): 409. doi: 10.1186/s12913-020-05261-y [8] D. Chumachenko, On intelligent multiagent approach to viral Hepatitis B epidemic processes simulation, Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing, DSMP 2018 (2018): 415-419. doi: 10.1109/DSMP.2018.8478602 [9] D. Chumachenko, et. al., Development of an intelligent agent-based model of the epidemic process of syphilis, International Scientific and Technical Conference on Computer Sciences and Information Technologies (2019): 42-45. doi: 10.1109/STC-CSIT.2019.8929749 [10] I. Izonin, R. Tkachenko, N. Shakhovska, N. Lotoshynska, The additive input-doubling method based on the SVR with Nonlinear Kernels: small data approach, Symmetry 13 (4) (2021): 4. doi: 10.3390/sym13040612 [11] R. Radutniy, et. al., Automated measurement of bone thickness on SCT sections and other images, Proceedings of the 2020 IEEE 3rd International Conference on Data Stream Mining and Processing (2020): 222-226. doi: 10.1109/DSMP47368.2020.9204289 [12] N. Davidich, et. al., Monitoring of urban freight flows distribution considering the human factor, Sustainable Cities and Society 75 (2021): 103168. doi: 10.1016/j.scs.2021.103168 [13] D. Chumachenko, K. Chumachenko, S. Yakovlev, Intelligent simulation of network worm propagation using the code red as an example, Telecommunications and Radio Engineering 78 (5) (2019): 443-464. doi: 10.1615/TELECOMRADENG.V78.I5.60 [14] O. Skitsan, I. Meniailov, K. Bazilevych, H. Padalko, Evaluation of the informative features of cardiac studies diagnostic data using the Kullback method, CEUR Workshop Proceedings 2917 (2021): 186-195. [15] S. Yakovlev, et. al., A. The concept of developing a decision support system for the epidemic morbidity control, CEUR Workshop Proceedings 2753 (2020): 265-274. [16] R. Xiang, et. al., A comparison for dimensionality reduction methods of single-cell RNA-seq data, Frontiers in Genetics 12 (2021): 646936. doi: 10.3389/fgene.2021.646936 [17] T. Isomura, T. Toyoizumi, Dimensionality reduction to maximize prediction generalization capability, Nature Machine Intelligence 3 (2021): 434-446. doi: 10.1038/s42256-021-00306-1 [18] M.A.A. Cox, T.F. Cox, Multidimensional Scaling, Handbook of Data Visualization (2008): 315- 317. doi: 10.1007/978-3-540-33037-0_14 [19] J. Tzeng, H.H.S. Lu, W.H. Li, Multidimensional scaling for large genomic data sets, BMC Bioinformatics 9 (2008): 179. doi: 10.1186/1471-2105-9-179 [20] C. Becavin, et. al., Improving the efficiency of multidimensional scaling in the analysis of high- dimensional data using singular value decomposition, Bioinformatics 27 (10) (2011): 1413-1421. doi: 10.1093/bioinformatics/btr143 [21] I. Dokmanic, R. Parhizkar, J. Ranieri, M. Vetterli, Euclidean distance matrices: essential theory, algorithms, and applications, IEEE Signal Processing Magazine 32 (6) (2015): 12-30. doi: 10.1109/MSP.2015.2398954 [22] R. Shahid, S. Bertazzon, M.L. Knudtson, W.A. Ghali, Comparison of distance measures in spatial analytical modeling for health service planning, BMC Health Services Research 9 (2009): 200. doi: 10.1186/1472-6963-9-200 [23] J.W. Smith, et. al., Using the ADAP learning algorithm to forecast the onset of diabetes mellitus, Proceedings of the symposium on computer applications and medical care (1988): 261-265.