Comparative Analysis of the Machine Learning Dimensionality Reduction Methods on the Example of Fetal Health Determination Oleksii Khlystun and Kseniia Bazilevych National Aerospace University “Kharkiv Aviation Institute”, Chkalow str., 17, Kharkiv, 61070, Ukraine Abstract The urgent worldwide challenge of decreasing child mortality and guaranteeing fetal health requires innovative approaches. Within the context of the Sustainable Development Goals (SDGs), we analyze the potential of machine learning models to tackle these issues. This research concentrates on forecasting child mortality rates and fetal health results using thorough datasets. Methods: Our study evaluates the effectiveness of two unique techniques for reducing dimensions: Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). These methods aim to transform high- dimensional data while still retaining essential information. Our research aims to enhance data analysis efficiency and minimize the risk of model overfitting, particularly in complicated medical research scenarios. Using Python programming, we utilized PCA and UMAP on real-world datasets pertaining to child mortality and fetal health. These techniques aid in identifying crucial attributes impacting fetal well-being and in creating predictive models to enhance healthcare results. Results: our experiments indicate that PCA fast, but UMAP more effectively reduce data dimensionality in medical research. UMAP outperforms PCA in maintaining complex structural and correlational relationships. Choosing between PCA and UMAP should be guided by the data characteristics and research objectives. Conclusions: this research highlights the potential of dimensionality reduction techniques, specifically UMAP - this method emerged superior. The results have critical implications for healthcare providers and policymakers, providing a valuable model for improving data analysis, classification, and ultimately increasing the likelihood of successful outcomes for at-risk mothers and infants. Keywords Child mortality, fetal health, dimension reduction, PCA method, UMAP method 1 1. Introduction In 2015, the world began working toward a new global development agenda, seeking to achieve, by 2030, new targets set out in the Sustainable Development Goals (SDGs). The proposed SDG target for child mortality aims to end, by 2030, preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce neonatal mortality to at least as low as 12 deaths per 1,000 live births and under-5 mortality to at least as low as 25 deaths per 1,000 live births [1]. The most vulnerable population across nations is that of newborns of to which babies or prematures born less than 37 weeks of gestation belong. Child mortality and morbidity has continuously manifested within children under 5 years as death, disability and ill health most especially in Sub-Saharan Africa and south Asia where preterm birth accounts of for over 60% of newborn deaths worldwide. Due to preterm birth complications, over 1,000,000 babies reportedly die out of the 15,000,000 reported newly born prematures and others become disabled besides ill after birth. In Bangladesh alone, over 26,100 children die under 5 years out of 439,000 prematures per year due to preterm complications of reported cases [2]. There is a ProfIT AI 2023: 3rd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2023), November 20–22, 2023, Waterloo, Canada oleksii.khlystun@student.khai.edu (O. Khlystun); ksenia.bazilevich@gmail.com (K. Bazilevych) 0009-0001-9087-3299 (O. Khlystun); 0000-0001-5332-9545 (K. Bazilevych) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings very urgent need to reduce risks of this vulnerable population in order to increase survival and wellbeing for neonates, mothers and children under 5 years. Parallel to the concept of child mortality, of course, is maternal mortality, which accounts for 295,000 deaths during and after pregnancy and childbirth (as of 2017). The vast majority of these deaths (94%) occurred in resource-limited settings, and most of them could have been prevented. In 2017, sub-Saharan Africa and South Asia accounted for approximately 86% (254,000) of the estimated maternal deaths worldwide. Sub-Saharan Africa alone accounted for approximately two-thirds (196,000) of maternal deaths, while nearly one-fifth of deaths were reported in South Asia (58,000) [3]. With infant mortality being a major concern in resource-limited countries and beyond, research into infant mortality remains an important and relevant topic. With this in mind, cardiotocograms (CTGs) are a simple and affordable option for assessing fetal health, allowing healthcare providers to take steps to prevent infant and maternal mortality. The equipment itself works by sending out ultrasound pulses and reading their response, thus shedding light on fetal heart rate, fetal movements, uterine contractions, and more. CTG is widely used during pregnancy as a method of assessing the condition of the fetus, mainly in pregnancies with an increased risk of complications. Expanding on the role of data-driven medicine and ICT tools, their contribution extends beyond immediate clinical applications to influence healthcare policy and planning [4]. By providing a rich health data repository, these technologies facilitate epidemiological studies and public health research, which is essential for understanding health trends and disease patterns [5]. This information is invaluable for governments and health organizations in formulating policies that target the most pressing health issues, such as high neonatal and maternal mortality rates in certain regions. Data-driven insights enable a more proactive and predictive approach to healthcare, allowing for allocating resources where they are most needed and potentially reducing healthcare disparities [6]. The advent of telemedicine and mobile health (mHealth) applications exemplifies the transformative impact of ICT in healthcare [7]. These platforms offer remote access to medical advice and monitoring, particularly beneficial in underserved and rural areas [8]. In maternal and child healthcare, mHealth initiatives can be pivotal in educating and empowering mothers with information about prenatal care, nutrition, and infant health [9]. This can lead to better health outcomes by increasing awareness and facilitating early intervention in cases of health risks. Additionally, telemedicine initiatives can connect patients with specialists in distant locations, breaking down geographical barriers to quality healthcare. Integrating data-driven medicine and ICT tools in healthcare enhances clinical care and reshapes the broader healthcare landscape. These technologies are pivotal in driving a more equitable, efficient, and effective healthcare system by providing actionable insights, enabling remote care, and informing health policies [10]. As the global community strives towards the 2030 SDGs, the continued advancement and application of these technologies will be crucial in overcoming the challenges of maternal and child mortality and achieving a healthier future for all. Machine learning (ML) is a group of methods in the field of artificial intelligence, a set of algorithms used to create a system that learns from its own experience. As training, huge amounts of input data are processed to find dependencies and patterns in them. One of the main advantages of ML compared to traditional data analysis methods is that it allows the system to detect patterns in the data on its own and learn from these patterns. This means that the system can recognize complex relationships between different factors. The result of this learning is a model that can predict fetal health based on their input [11]. Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining as much of the important information as possible. In other words, it is a process of transforming high-dimensional data into a lower-dimensional space that still preserves the essence of the original data [12]. Dimensionality reduction methods can include feature selection and feature extraction. This improves the efficiency of data analysis and reduces the amount of computation. One of the main advantages of dimensionality reduction compared to traditional data analysis is the ability to preserve key information and reduce the risk of model overfitting. As a model for dimensionality reduction, we can recommend the Principal Component Analysis (PCA) method, which allows you to find linear combinations of attributes that best describe the variability of the data. As a second model for dimensionality reduction, we can recommend the Uniform Manifold Approximation and Projection (UMAP) method, which allows you to effectively reduce the dimensionality of data while preserving its structure and internal dependencies. UMAP is a non-linear method and can better account for complex relationships between data attributes than the Principal Component Analysis (PCA) method, which works on the basis of linear combinations of attributes. UMAP can be especially useful in cases where the relationships between data attributes are complex and non-linear, which is often the case in medical research and fetal health data analysis. Using UMAP will reduce the dimensionality of the data while preserving important structural and correlational relationships, which can help improve the quality and accuracy of data analysis and classification in medical research. This task is important for identifying factors that influence fetal health and can help to study which factors have an impact and to what extent, in order to improve fetal health by influencing these factors. This means that dimensionality reduction can help identify the most important attributes that affect fetal health and create a model that can predict fetal health based on a reduced set of attributes. This approach will improve the accuracy and efficiency of data analysis in the medical field and help to preserve fetal health and increase its chances of successful development. 2. Current research analysis In ML modeling high-dimensional data can cause problems with classification, pattern recognition and visualization accuracy. The growth and update speed of data sets are accelerating, and the data is developing in a high-dimensional and unstructured direction. Massive and complex data contains a lot of useful information, but it also increases the difficulty to use the data effectively. For example, the problem called the “curse of dimensionality” appears due to the rapid and large-scale expansion of dimensions. Quite a lot of computing time and storage space are spent on the processing of the data. Effective information is submerged in complex data, making it difficult to discover the essential characteristics of the data. It takes lots of time and manpower to process the data. And this problem also has a bad influence on the accuracy of the recognition. When the data dimension increases, the performance of the classifier becomes better for a short period, then the performance of the classifier becomes worse. How to analyze the huge amount of information and extract useful information features from high- dimensionality data, as well as eliminate the influence of related or repetitive factors. In other words, the problems need to be solved by dimension reduction. The basic principle of feature dimensionality reduction is to map a data sample from a high-dimensional space to a relatively low-dimensional space. Its basic task is to find the mapping and obtain an effective low- dimensional structure hidden in high-dimensional observable data [13]. Dimensionality reduction methods are very useful for removing noise, unnecessary information, accelerating learning, reducing the feature space to 2 and 3 dimensions and data visualization. It allows to depict a multidimensional training dataset on a graph and often gain some important insights by visually identifying patterns such as clusters. Dimensionality reduction visualization is widely used in data mining, ML, image processing, and other fields. For example, in image processing, high-dimensional image data can be reduced and visually represented to better understand and analyze image characteristics and structure. In ML, high-dimensional datasets can be reduced and visually represented to better understand and analyze the distribution and characteristics of the data, thus selecting appropriate models and algorithms. Figure 1: Projection of 3D space onto 2D 3. Materials and methods 3.1. Principal Component Analysis Principal Component Analysis, or PCA, is a technique that uses mathematical principles to transform a number of possibly correlated variables into a smaller number of variables called principal components. The new basis assures filtering the noise out and reveals the hidden structure of the original dataset. It has a wide range of applications based on large datasets. PCA uses a vector space transform to reduce the dimensionality of large datasets. It finds the directions of maximum variance in high-dimensional data that is equivalent to the least squares line of best fit through the lotted data, and projects it onto a smaller dimensional subspace while retaining most of the information. PCA method informs the contributions of each principal component, to the total variance, and the eigenvectors associated with non-zero eigenvalues, of the coordinates. In practice, it is sufficient to include enough principal components that cover about (70 − 80%) of the data variation [14]. The reduced dimension dataset allows users to explore and visualize, and data analysis becomes much easier and faster for ML algorithms without handling extraneous variables. This algorithm uses a projection approach, which determines the closest hyperplane to the data and then projects all points in the training set onto this hyperplane. Before designing a training set, you need to determine the correct hyperplane. That is, we need to choose an axis for which the projections of the points on this axis will give the largest variance, which means that the points should be far apart on this conditional line. This gives us an understanding that the variance will be preserved much better if we maximize the variance of the projections. The axis that preserves the maximum amount of variance should be chosen because it will preserve more information than the other projections. Another way to justify this choice is that it is the axis that minimizes the root mean square distance between the original data set and its projection onto that axis. Algorithmic model of the Principal Component Analysis (PCA) method: Step 1. The PCA algorithm starts with the input of a data matrix, where each row corresponds to an observation and each column represents a separate feature (or attribute). Step 2. To center the data, the average value of the feature is subtracted from each value of the feature in the matrix. This is an important step because it helps to eliminate the shift from the center point and allows you to analyze the distribution of data about the center. Step 3. Normalization can be an additional step in which the data is scaled to ensure that the scale is the same between features. This is useful when the input features have different units or orders of magnitude. Step 4. After centering the data, the correlation matrix is calculated, which contains information about the relationship between the features. The element of the correlation matrix (i, j) indicates the correlation between features i and j. Step 5. Eigenvectors (or principal components) are calculated by solving a system of linear equations using a correlation matrix. Each eigenvector corresponds to a principal component, and reflects the direction in which the variance of the data is maximized. Step 6 Principal components are calculated by selecting k eigenvectors, where k is the number of components you want to extract from the data. The principal components are constructed by linearly combining the features, where the coefficients of this combination correspond to the components of the eigenvectors. After completing all these steps, you get a new principal component matrix that can be used to reduce the dimensionality of the data, preserve important information, and visualize the data in a lower dimensional space. Figure 2 shows a flowchart of the algorithmic model. Input Calculation Data Data Data Scaling of the Start Matrix Centering Correlation Matrix Finding Determination of Eigen Principal Finish Vectors Components Figure 2: Block diagram of the algorithm PCA 3.2. Uniform Manifold Approximation and Projection UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP as described has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for ML [15]. In the simplest sense, UMAP creates a high-dimensional graphical representation of the data and then optimizes the low-dimensional graph to be as structurally similar as possible. While the math that UMAP uses to build a high-dimensional graph is advanced, the intuition behind it is extremely simple. To build the initial high-dimensional graph, UMAP creates what is known as a "fuzzy simplistic complex". This is actually just a representation of a weighted graph, where the weights of the edges represent the probability that two points are connected. To determine the connection, UMAP extends a radius outward from each point, connecting the points when these radii overlap. The choice of this radius is crucial - too small a choice will result in small, isolated clusters, while too large a choice will lump everything together. UMAP overcomes this problem by choosing the radius locally based on the distance to the nth nearest neighbor of each point. UMAP then makes the graph "fuzzy" by reducing the probability of connection as the radius increases. Finally, by assuming that every point must be connected to at least its nearest neighbor, UMAP ensures that the local structure remains in balance with the global structure. Algorithmic model of the UMAP method Step 1. The initial stage involves collecting and preparing the data. The data may be in the form of a feature matrix, where each row represents an object and each column represents a feature. Typically, data is normalized or standardized before applying the UMAP method. Step 2. First, the k-nearest neighbors of every object in the original space are computed. This process is undertaken to construct local structures that portray the interdependence of objects within small groups. Step 3. A graph is constructed using these local structures where each object is a node and edges indicate connections between them. Edge weights are based on the similarity of neighboring objects. Step 4. The graph of nearby connections is optimized to locate the global data structure. This process involves minimizing the distances between entities in a space with low dimensions while preserving both the local and global data structure. Step 5. The simple data map can help to see and study information on a smaller scale. You can use the map to examine data more, group it, sort it, or make it easier to understand. Step 6. The resulting low-dimensional data mapping can be used to visualize and analyze data in a smaller dimensional space. The mapping can be used for further data analysis, clustering, classification, or data visualization. Figure 3 shows a flowchart of the algorithmic model. Data Computing Constructio Building the Star collection neighbors n of Local Neighbors t Structures Graph Graph Construction of a Mapping and Optimizatio Low-Dimensional Visualization Finish n Space Figure 3: Block diagram of the algorithm UMAP UMAP is a powerful dimensionality reduction and data visualization method that allows you to preserve both local and global data structure in a low-dimensional space. 4. Results 4.1. Data The input data are presented in Table 1. The experimental study was conducted using data from the open dataset “Fetal Health Classification” [16] available on Kaggle and contains anonymized data. The dataset used in this project is annotated fetal health data obtained from the CTG data of 2126 pregnant women. There are 21 independent variables and 1 dependent variable. The dependent variable being fetal health, labeled 1 - normal, 2 - suspected , 3 - high risk. The dataset has 2126 rows and 22 columns. Table 1 Input data Name Type Range baseline_value int 106-160 accelerations float 0-0.019 fetal_movement float 0-0.481 uterine_contractions float 0-0.015 light_decelerations float 0-0.015 severe_decelerations float 0-0.001 prolongued_decelerations float 0-0.005 abnormal_short_term_variability int 12-87 mean_value_of_short_term_variability float 0.2-7.0 percentage_of_time_with_abnormal_long_term_variability int 0-91 mean_value_of_long_term_variability float 0-50.7 histogram_width int 3-180 histogram_min int 50-159 histogram_max Int 122-238 histogram_number_of_peaks Int 0-18 histogram_number_of_zeroes Int 0-10 histogram_mode Int 60-87 histogram_mean Int 73-182 histogram_median Int 77-186 histogram_variance Int 0-269 histogram_tendency int -1-1 fetal_health int 1-3 4.2. Experimental results Step 1. Data upload To begin with, consider the input that was downloaded from the resource [9]. Figure 4 shows a histogram of input data. Figure 4: Histogram of patient data Validation of the data set in the presence of blank data that would have lost prediction. Figure 5 shows the output of all data as data, the presence of zero and the number of entries 2126. Figure 5: Check for empty values Step 2. Evaluation of dimensionality reduction. Dimensionality reduction methods were trained on a computer with CPU: 8 cores AMD Ryzen 7 5800H - 3.2 Hz; GPU: NVIDIA GeForce RTX 3060 - 6Gb; RAM: 16Gb on a dataset of size 2126x22. PCA method with different parameters were visualized on Figures 6-9. Figure 6: Dimentionality reduction PCA (n_components=20) Figure 7: Dimentionality reduction PCA (0.95) Figure 8: Dimensionality reduction PCA (n_components=2, whiten=True, svd_solver='auto', random_state=42) Figure 9: Dimensionality reduction PCA(n_components=2, whiten=False, svd_solver='full', random_state=42) UMAP method with different parameters were visualized on Figures 10-13. Figure 10: Dimensionality reduction UMAP (n_neighbors=5, min_dist=0.01) Figure 11: Dimensionality reduction UMAP (n_neighbors=70, min_dist=1) Figure 12: Dimensionality reduction UMAP (n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean') Figure 12: Dimensionality reduction UMAP (n_neighbors=40, min_dist=0.5, n_components=2, metric='manhattan') Comparing Figures 5-8 and 9-12, we can conclude that the PCA method was learned in an average of 0.003-0.004 seconds, while UMAP was learned in an average of 12-13 seconds. This means that the PCA method is many times faster and less informative, but the UMAP method takes a lot of time and is able to preserve nonlinear relationships, structures in the data. 5. Conclusions The use of data dimensionality reduction methods is important in the analysis of fetal health data. The main purpose of such analysis is to obtain information about the fetal health status and identify factors that may affect its condition. Dimensionality reduction methods play an important role in the analysis of fetal health data, as they allow to reduce a large number of features to a smaller number, while retaining important information. In this study, we examined two widely used methods: Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). Specific examples of the use of the described algorithmic models are presented. A comparison of different methods and the results obtained by them is made. In this research, we demonstrate the use of PCA and UMAP for analyzing fetal health data by comparing the results obtained with these methods and highlighting the advantages of each in identifying important aspects of fetal health. Combining PCA and UMAP can provide a comprehensive approach to dimensionality reduction in fetal health data analysis, helping to better understand fetal health and the impact of various factors on it. Acknowledgements The study was funded by the National Research Foundation of Ukraine in the framework of the research project 2020.02/0404 on the topic “Development of intelligent technologies for assessing the epidemic situation to support decision-making within the population biosafety management” [17]. References [1] M. Sahu and R. Dutt, “Temporal analysis of infant and child health indicators from health management and information system of a vulnerable district of India: Tracking the road toward the Sustainable Development Goal-3,” Indian Journal of Community Medicine, vol. 44, no. 4, p. 397, 2019, doi: 10.4103/ijcm.ijcm_5_19. [2] Md. A. Rafi, M. M. Z. Miah, Md. A. Wadood, and Md. G. Hossain, “Risk factors and etiology of neonatal sepsis after hospital delivery: A case-control study in a tertiary care hospital of Rajshahi, Bangladesh,” PLOS ONE, vol. 15, no. 11, p. e0242275, Nov. 2020, doi: 10.1371/journal.pone.0242275. [3] Helvy Yunida, “Saving of Maternal and Infant Lives with Sustainable Midwifery Services.,” PubMed, vol. 10, no. 4, pp. 313–314, Oct. 2022, doi: 10.30476/ijcbnm.2022.95877.2092. [4] D. Chumachenko, V. Dobriak, M. Mazorchuk, I. Meniailov, and K. Bazilevych, “On agent- based approach to influenza and acute respiratory virus infection simulation,” 2018 14th International Conference on Advanced Trends in Radioelecrtronics, Telecommunications and Computer Engineering (TCSET), pp. 192–195, Feb. 2018, doi: 10.1109/tcset.2018.8336184. [5] D. Chumachenko, V. Balitskii, T. Chumachenko, V. Makarova, and M. Railian, “Intelligent Expert System of Knowledge Examination of Medical Staff Regarding Infections Associated with the Provision of Medical Care,” CEUR Workshop Proceedings, vol. 2386, pp. 321–330, 2019. [6] K. Bazilevych, et al., “Stochastic modelling of cash flow for personal insurance fund using the cloud data storage,” International Journal of Computing, vol. 17, no. 3, pp. 153–162, Sep. 2018, doi: 10.47839/ijc.17.3.1035. [7] M. Addotey-Delove, R. E. Scott, and M. Mars, “Healthcare Workers’ Perspectives of mHealth Adoption Factors in the Developing World: Scoping Review,” International Journal of Environmental Research and Public Health, vol. 20, no. 2, p. 1244, Jan. 2023, doi: 10.3390/ijerph20021244. [8] V. M. Babaiev, et al., “The method of adaptation of a project-oriented organization’s strategy to exogenous changes,” Naukovyi Visnyk Natsionalnoho Hirnychoho Universytetu, vol. 2, pp. 134– 140, 2017. [9] B. L. Burke and R. W. Hall, “Telemedicine: Pediatric Applications,” Pediatrics, vol. 136, no. 1, pp. e293–e308, Jun. 2015, doi: https://doi.org/10.1542/peds.2015-1517. [10] I. Meniailov and H. Padalko, “Application of Multidimensional Scaling Model for Hepatitis C Data Dimensionality Reduction,” CEUR Workshop Proceedings, vol. 3348, pp. 34–43, 2022. [11] V. Advani, “What is Machine Learning? How Machine Learning Works and future of it?,” GreatLearning, Apr. 29, 2020. https://www.mygreatlearning.com/blog/what-is-machine- learning/ (accessed Oct. 15, 2023). [12] A. Uberoi, “Introduction to Dimensionality Reduction - GeeksforGeeks,” GeeksforGeeks, Jun. 2017. https://www.geeksforgeeks.org/dimensionality-reduction/ (accessed Oct. 15, 2023). [13] W. Jia, M. Sun, J. Lian, and S. Hou, “Feature dimensionality reduction: a review,” Complex & Intelligent Systems, Jan. 2022, doi: 10.1007/s40747-021-00637-x. [14] N. Salem and S. Hussein, “Data dimensional reduction and principal components analysis,” Procedia Computer Science, vol. 163, pp. 292–299, 2019, doi: 10.1016/j.procs.2019.12.111. [15] L. McInnes, J. Healy, N. Saul, and L. Großberger, “UMAP: Uniform Manifold Approximation and Projection,” Journal of Open Source Software, vol. 3, no. 29, p. 861, Sep. 2018, doi: 10.21105/joss.00861. [16] D. Ayres-de-Campos, J. Bernardes, A. Garrido, J. Marques-de-Sá, and L. Pereira-Leite, “Sisporto 2.0: A program for automated analysis of cardiotocograms,” The Journal of Maternal- Fetal Medicine, vol. 9, no. 5, pp. 311–318, 2020, doi: 10.1002/1520- 6661(200009/10)9:5%3C311::AID-MFM12%3E3.0.CO;2-9. [17] S. Yakovlev et al., “The Concept of Developing a Decision Support System for the Epidemic Morbidity Control,” CEUR Workshop Proceedings, vol. 2753, pp. 265–274, 2020.