Analysis of Datasets Created to Assess the Risk of Developing Gestational Diabetes Mellitus* Mukhriddin Arabboev1,∗,†, Shohruh Begmatov1,†, Mokhirjon Rikhsivoev1,†, Saidakmal Saydiakbarov1,†, Zukhriddin Khamidjonov1,†, Sardor Vakhkhobov1,†, Khurshid Aliyarov1,† and Khabibullo Nosirov1,† 1 Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 108 Amir Temur St., Tashkent, 100084, Uzbekistan Abstract In recent years, the healthcare field has seen a rise in the use of artificial intelligence. There is growing interest in applying artificial intelligence technology to the field of healthcare. To effectively predict disease and deploy proper artificial intelligence and machine learning algorithms, there is a need for suitable datasets. Datasets are widely used to assess the risk of developing diabetes, one of the most common diseases. Given the preceding, this paper reviews datasets created to assess the risk of developing gestational diabetes mellitus (GDM) used worldwide. Keywords Dataset, gestational diabetes mellitus, machine learning 1. Introduction In recent years, the number of people with diabetes has been increasing worldwide. Diabetes is one of the most common diseases among the population [1]. There are common types of diabetes such as type 1 [2], type 2 [3] and gestational diabetes [4]. Gestational Diabetes Mellitus (GDM) poses significant health risks to both mothers and infants, making its early detection and effective management crucial for maternal and fetal well-being. As the prevalence of GDM continues to rise globally, there is a growing need for robust predictive models to identify women at risk. Key to developing such models is the availability and analysis of high-quality datasets specifically tailored for assessing GDM risk factors. Artificial intelligence (AI) is a rapidly developing field, and its application in treating diabetes may revolutionize the approach to diagnosing and managing this chronic condition [5]. Machine learning algorithms are used to support predictive models for the risk of developing diabetes or its complications [6]. Digital therapy has proven to be an established intervention for lifestyle therapy in the treatment of diabetes. Patients are increasingly empowered to self-manage their diabetes, and both patients and healthcare professionals benefit from clinical decision support. AI enables continuous and remote monitoring of patient symptoms and biomarkers. Technological advances have helped optimize resource utilization in diabetes. Artificial intelligence is changing massive processes in diabetes care, from traditional treatment strategies to creating targeted, data- driven precision care. Our review provides a comprehensive resource for researchers, IT companies involved in developing medical data, and technology companies specializing in the healthcare sector. * IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania 1,∗ Corresponding author † These author contributed equally. mukhriddin.9207@gmail.com (M. Arabboev); bek.shohruh@gmail.com (Sh.Begmatov); mrikhsivoev@gmail.com (M.Rikhsivoev); saidakmalflash@gmail.com (S. Saydiakbarov); hamidjanovzuhriddin22@gmail.com (Z. Khamidjonov); sardorakbarivich@gmail.com; (S. Vakhkhobov); uzregxurshid@gmail.com (K. Aliyarov); n.khabibullo1990@gmail.com (K. Nosirov). 0000-0001-5733-5889 (M. Arabboev); 0000-0002-2441-916X (Sh.Begmatov); 0000-0002-4691-1470 (M.Rikhsivoev); 0009-0005-6654-2851 (S. Saydiakbarov). ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The contributions of this survey are summarised as follows: 1. Research conducted on creating a dataset for GDM across geographic regions is presented. 2. We present a comprehensive review of research on creating a dataset for assessing the risk of gestational diabetes mellitus (GDM). This review covers the algorithms or models used, datasets used or created, and the results achieved. 3. The dependence of the number of data in the dataset on the accuracy of the model is critically analyzed. The rest of the paper follows this structure: The section “Analysis of studies on creating datasets for gestational diabetes” reviewed up-to-date research done in the related field. The section “Results obtained in the analyzed studies” compared the results gained in the analyzed studies. The section “Conclusion” concludes this paper. 2. Analysis of studies on creating datasets for gestational diabetes This section reviews studies on creating datasets for gestational diabetes. In preparation for this review paper, between January and February 2024, Google Scholar, PubMed, Science Direct, and IEEE Xplore databases were searched for articles with the keywords “gestational, diabetes mellitus, dataset, Artificial Intelligence”. 2557 articles were found as a result of the search, of which 15 papers met specific inclusion. Figure 1: PRISMA Diagram of article review and selection. In [7], it is proposed an ensemble prediction model for the diagnosis of gestational diabetes. The data collection was obtained from the laboratories of the Kurdistan region, which collected data from pregnant women with and without diabetes. The proposed model uses the KMeans clustering method for data reduction, and the elbow method to find the optimal k value and the Mahalanobis distance method to find the cluster most related to new samples. In the study, it is used classification methods such as decision tree (DT), random forest (RF), SVM, KNN, logistic regression (LR), and Naïve Bayes (NB). In [8], a system is proposed to solve the problem of dividing diabetic patients into two categories: diabetic patients with acute illnesses and diabetic patients without acute illnesses. This study is based on the Electronic Health Records (EHR) of Osakidetza (Basque Health Service). Analytical and clinical parameter data for this study were obtained from the PREST database. In [9], it is created a dataset using CERNER records of pregnancies observed at St Mary’s Hospital, London between April 2016 and November 2019. In this study, the researchers conducted a retrospective observational study. The initial search identified 26,063 patients with the following factors: postcode, height, weight, BMI at booking, ethnicity (self-reported), parity, glucose tolerance test offer, test results (0 min and up to 120 min after 75 g glucose load), mode of delivery, estimated total blood loss, gestational age, newborn weight, SCBU admission, length of postpartum stay, fetal sex, and stillbirth are some other factors to consider. In [10], it is aimed at improving the diagnosis of gestational diabetes by using data collection methods. Also, this study analyzed the performance of supervised learning algorithms such as ID3, Naïve Bayes, C4.5, and Random Tree. The results of the experiment showed that the Random Tree algorithm gave the best result with the highest accuracy and the lowest error rate. The dataset used in this study is a clinical dataset collected from St. Isabella’s Hospital, Mylapore, Chennai, and the National Institute of Diabetes, Digestive, and Kidney Diseases, which includes records of about 600 patients. In particular, all patients listed in the dataset were pregnant and over 21 years old. In [11], it is developed a machine learning-based prediction model for gestational diabetes (GDM) in early pregnancy in Chinese women. This study used population-based data from 19,331 pregnant women registered as pregnant up to 15 weeks of gestation between October 2010 and August 2012 in Tianjin, China. The dataset is randomly divided into a training set (70%) and a testing set (30%). Risk factors collected during enrollment were reviewed and used to build a predictive model on the training data set. Machine learning, such as the extreme gradient boosting (XGBoost) method, was used to develop the model. In [12], it is created a dataset based on data from 6822 pregnant women living in a geographic area defined by three regional health boards in New Zealand. The prevalence of GDM was estimated using four commonly used data sources. Coded clinical data on diabetes status were collected from regional health boards and the Ministry of Health’s National Minimum Data Set, and plasma glucose results were collected from laboratories serving the recruitment area and were coded according to the New Zealand Society for the Study of Diabetes diagnostic criteria and collected via self-administered diabetes status questionnaires. In another interesting study [13], the authors created a dataset based on data collected at the West China Second Hospital in Chengdu, Sichuan. A total of 33,935 pregnant women were enrolled in the EHR from 2013 to 2016 for experimental data. The GDM-related data of these samples contained 106 features of archival data, 23 features of audit data, 157 features of laboratory information system test data, and 268 features of EHR first pages. After data cleaning, the authors used a filtering strategy to preselect patients whose EHR data were associated with GDM as candidate samples, excluding pre- gestational diabetes. Through this process, the authors obtained an accurate data set of 10,105 samples with common clinical characteristics. This dataset contains 1649 GDM (positive) cases and 8456 non- GDM (negative) cases. A new dataset for gestational diabetes is created in [14]. Data used in this study were collected from local hospitals in Mysuru, Karnataka, India. Medical records were obtained after anonymizing patients to ensure confidentiality. The dataset was developed by keeping obstetrics and gynecology consultants in feedback. The dataset contains information on 1352 pregnant women. The GDM dataset was developed with the help of physicians by removing less significant and irrelevant features, followed by data cleaning and transformation. In [15], it is created a dataset based on data from the general ward of Kurmitola General Hospital in Bangladesh to test an ML model and predict diabetes for Bangladeshi patients. The authors addressed the group of trainee doctors who participated in the data collection process. They conducted a brief interview with the patients, and after their appropriate consent, the doctors agreed to provide relevant information to the authors. In total, it took about three weeks in November 2019 to collect all relevant information. This split dataset contains data from 181 patients and consists of 4 characteristics: patient age, body mass index, number of pregnancies, and glucose concentration. In [16], the authors created a dataset based on real pregnancy test data from a hospital in Beijing from 2008 to 2018. The dataset contains examination records of 120,396 pregnant women. In the entire sample set, 18,400 pregnant women had gestational diabetes, accounting for approximately 15.28%, and 7,518 pregnant women had gestational hypertension, accounting for approximately 6.24%. In [17], it is created a new dataset based on data from 2016 to 2018 used routinely collected maternity and birth data for singleton pregnancies that ended in birth at Monash Health, Australia’s largest public health service, at Universal Health Melbourne. Within the framework of the health care system, three maternity hospitals served different ethnic populations. In another interesting study, a homicidal diabetes dataset was created in [18]. This dataset included a total of 48,502 singleton pregnancies from January 2016 to June 2021 across the Monash Health maternity network. The incidence of GDM was 21.3%. A randomly selected 80% dataset was used for model development and 20% for validation. Performance, including calibration and discrimination performance, was evaluated. In [19], it is created a dataset obtained from patients attending the Department of Obstetrics and Fetal Medicine at the Hospital Parroquial de San Bernardo, Santiago, Chile [19]. The dataset included data from 1,611 different pregnant patients from 2019 to 2022. The dataset is divided into three parts: the training set (70%), the validation set (10%), and the testing set (20%). In this study, twelve different ML models and their hyperparameters were optimized to achieve early and high predictive performance of GDM. To improve the forecast results, the method of data augmentation was used in training. Three methods were used to select the most suitable variables for GDM prediction. After training, the models with the highest area under the receiver operating characteristic curve were evaluated on the validation set. The models with the best results were evaluated as a measure of generalization performance on the test set. In [20], data from pregnant Mexican women included in the “Cuido mi Embarazo” (CME) group were used for development (107 cases, 469 controls), and data from the "Monica Pretelini Sáenz" Maternal Perinatal Group was used for the investigation (32 cases, 199 controls) [20]. A 2-hour oral glucose tolerance test (OGTT) with 75 g of glucose at 24-28 weeks of gestation was used to diagnose GDM. A total of 114 single nucleotide polymorphisms with predictive power were selected for evaluation. Blood samples collected during the OGTT were used for SNP analysis. The CME group was randomly divided into a training dataset (70% of the group) and a testing dataset (30% of the group). The training dataset is divided into 10 groups: 9 for building the predictive model and 1 for validation. In [21], it is created a dataset obtained from a perinatal database for women who gave birth at seven hospitals in four regions of South Korea, under the authority of the Catholic University of Korea from January 2009 to December 2020. Data on mothers’ demographic characteristics, body mass index, blood pressure measurements, blood and urine laboratory tests, diagnoses recorded by doctors, and prescribed medications were collected from the hospital database through electronic medical cards. In this study, a machine learning algorithm was developed to predict gestational diabetes mellitus (GDM) using retrospective data from 34,387 multicenter pregnancies in South Korea. Figure 2 shows the geographical locations of the countries where the organizations of the authors of the studies analyzed in this review paper. Figure 2: Scope of research to create a dataset for gestational diabetes worldwide (based on studies analyzed in this review paper) 3. Results obtained in the analyzed studies This section provides information on the results obtained in the articles analyzed in Section 2. The table below shows information about the countries of the organizations of the authors in which the GDM datasets were created, and the number of participants registered in the dataset. Table 1 Analysis of datasets created for gestational diabetes mellitus Reference Country Hospital No. of participants in dataset [7] Iraq Kurdistan region laboratories 1012 [8] Spain Osakidetza (Basque Health Service) 149 015 [9] United Kingdom St. Mary’s hospital 26 063 [10] India St. Isabella Hospital 600 [11] China Tianjin regional laboratories 19 331 [12] New Zealand three regional health boards 6 822 [13] China West China Second Hospital 10 105 [14] India local hospitals of Mysuru 1 352 [15] Bangladesh Kurmitola General Hospital 181 [16] China one hospital in Beijing 120 396 [17] Australia Monash Health 2 880 [18] Australia Monash Health maternity hospitals 48 502 [19] Chile Hospital Parroquial de San Bernardo 1 611 Maternal Perinatal Hospital Mónica [20] Mexico 807 Pretelini seven hospitals in four regions of South [21] South Korea 34 387 Korea Data collected in the studies Ref. [21] Ref. [20] Ref. [19] Ref. [18] Ref. [17] Ref. [16] Ref. [14] Ref. [13] Ref. [12] Ref. [11] Ref. [10] Ref. [9] Ref. [8] Ref. [7] 0 20000 40000 60000 80000 100000 120000 140000 160000 Figure 3: Data collected from the female patients in the studies Figure 3 depicts the datasets developed in the analyzed studies. If we compare the number of participants in the developed datasets with each other, the five with the highest quantities belong to Ref. [8], Ref. [16], Ref. [18], Ref. [21], and Ref. [9], respectively. Ref. [15], Ref. [10], Ref. [20], Ref. [7], and Ref. [14], were the five lowest quantities on the number of participants in the developed datasets, according to the analyzed studies in this review paper. Table 2 List of studies evaluating the prediction of gestational diabetes mellitus by machine learning models Ref. Algorithm (s)/model (s) No. of participants Results achieved in dataset (highest) [7] DT, RF, SVM, KNN, LR, and NB 1012 Accuracy: 86.74 % Precision: 69 % [8] Quantum Machine Learning 149 015 Accuracy: 69 % ID3, Naïve Bayes, C4.5 and [10] 600 Accuracy: 93.8 % Random Tree Specificity: 76.9 % [11] LR and XGBoost 19 331 Accuracy: 75.7 % LR, Bayesian network, Neural [13] 10 105 Accuracy: 90 % Networks, SVM and CHAID trees [14] J48 Decision Tree, RF and NB 1 352 Accuracy: 93 % Accuracy: 81.2 % [15] KNN, DT, RF, and NB 181 Precision: 80 % AUC: 84 % Accuracy: 91.7 % [16] LR, XGBoost and LightGBM 120 396 Precision: 75.3 % AUC: 92.1 % LR, KNN, Gaussian Naïve Bayes Accuracy: 85 % (GNB), SVM, DT, multi-layer AUC: 93 % perceptron (MLP), RF, Extreme [18] 48 502 Precision: 90 % randomized tree, AdaBoost, Recall: 78 % Gradient Boosting, CatBoost, and Specificity: 90 % XGBoost Accuracy: 75 % [19] MLP and SVM 1 611 Specificity: 74 % AUC: 81 % [21] XGBoost and LightGBM 34 387 AUC: 80.4 % It can be seen from Table 2 that different types of Machine Learning models/algorithms were used in the analyzed articles. In [7], six ML algorithms were used. In this study, six ML algorithms were used. In this study, the number of participants in the dataset was 1012, and the result was 86.74 % accuracy. In [8], Quantum Machine Learning algorithm was used to develop a prediction model. In this research, the number of participants in the dataset was 149015, and the results were 69 % for both precision and accuracy. In [10], ID3, Naïve Bayes, C4.5 and Random Tree algorithms were used for developing a prediction model. In this work, the number of participants in the dataset was 600, and the result was 93.8 % accuracy. In [11], two ML algorithms, LR and XGBoost, were used to develop a prediction model. In this study, the number of participants in the dataset was 19331, and the results were 76.9 % specificity and 75.7 % accuracy. In [13], five ML algorithms were used for developing a prediction model. In this work, the number of participants in the dataset was 10105, and the result 90 % accuracy. In [14], three ML algorithms were used to develop a prediction model. In this study, the number of participants in the dataset was 1352, and the result was 93 % accuracy. In [15], the prediction model was developed using four ML algorithms: KNN, DT, RF, and NB. In this research, the number of participants in the dataset was 181, and the results were 80 % precision, 84 % AUC and 81.2 % accuracy. A prediction model was developed in [16] using LR, XGBoost, and LightGBM algorithms. In this study, the dataset had 120396 participants. The results showed 91.7% accuracy, 75.3% precision, and 92.1% AUC. In [18], twelve ML algorithms were used to develop a prediction model. In this study, the dataset consisted of 48502 participants. The results showed 85% accuracy, 93% AUC, 90% precision, 78% recall, and 90% specificity. In [19], the prediction model was developed using MLP and SVM algorithms. In this study, the dataset had 1611 participants. The results showed 75 % accuracy, 74 % specificity, and 81% AUC. In [21], XGBoost and LightGBM algorithms were used to develop a prediction model. In this study, the dataset consisted of 34387 participants, and the resulting AUC was 80.4%. Accuracy 100 80 60 40 20 0 Ref.[7] Ref.[8] Ref.[10] Ref.[11] Ref.[13] Ref.[14] Ref.[15] Ref.[16] Ref.[18] Ref.[19] Figure 4: Comparison of analyzed studies based on Accuracy evaluation metric It can be seen from Figure 4 that in almost all of the analyzed articles, results were obtained based on an accuracy evaluation metric. If we compare the results on accuracy with each other, Ref. [10] had the highest result and Ref. [8] had the lowest result. The results of Ref. [11] and Ref. [19] are close to each other. Also, the results of Ref. [10] and Ref. [14] are close to each other. 4. Conclusion The paper analyzes the studies conducted on dataset development to assess the risk of developing gestational diabetes mellitus. The results of the study suggest that creating a dataset for assessing the risk of gestational diabetes is a global research topic. The paper also highlights that active research is being conducted on all continents of the world to create a dataset for gestational diabetes mellitus. Based on the results of the analysis, the following will be conclusions: Having a large amount of data in a dataset always does not necessarily lead to increased accuracy in machine-learning models. It is important to note that although having more data can be beneficial, it is not the only factor that contributes to model accuracy; In addition to the dataset, choosing the right prediction models is an important factor in improving model accuracy; Large amounts of irrelevant or noisy data can mislead the model. Data cleaning and feature engineering are crucial for the effective utilization of large datasets. In our future work, we plan to create a dataset of women in Uzbekistan with gestational diabetes, using the expertise gained from studying leading scientists' remarkable results during the preparation of this review paper. References [1] P. Saeedi et al., “Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas, 9th edition,” Diabetes Res. Clin. Pract., vol. 157, p. 107843, 2019, doi: 10.1016/j.diabres.2019.107843. [2] K. Nosirov, S. Begmatov, and M. Arabboev, “Design of a model for multi-parametric health monitoring system,” International Conference on Information Science and Communications Technologies, ICISCT 2020. pp. 1–5, 2020. doi: 10.1109/ICISCT50599.2020.9351522. [3] P. Bose, S. K. Bandyopadhyay, A. Bhaumik, and S. Poddar, “Female Diabetic Prediction in India Using Different Learning Algorithms,” Univers. J. Public Heal., vol. 9, no. 6, pp. 460–471, 2021, doi: 10.13189/ujph.2021.090614. [4] M. Arabboev, S. Begmatov, M. Rikhsivoev, S. Saydiakbarov, J. Uraimov, and K. Nosirov, “Gestational diabetes mellitus risk assessment using artificial intelligence: a review,” in International Conference on Information Science and Communications Technologies: Applications, Trends and Opportunities, 2023. [5] J. Shen et al., “An innovative artificial intelligence-based app for the diagnosis of gestational diabetes mellitus (GDM-AI): Development study,” J. Med. Internet Res., vol. 22, no. 9, Sep. 2020, doi: 10.2196/21573. [6] Z. Zhang et al., “Machine Learning Prediction Models for Gestational Diabetes Mellitus: Meta- analysis,” J. Med. Internet Res., vol. 24, no. 3, 2022, doi: 10.2196/26634. [7] R. Jader and S. Aminifar, “Predictive Model for Diagnosis of Gestational Diabetes in the Kurdistan Region by a Combination of Clustering and Classification Algorithms: An Ensemble Approach,” Appl. Comput. Intell. Soft Comput., vol. 2022, 2022, doi: 10.1155/2022/9749579. [8] D. Maheshwari, B. Garcia-Zapirain, and D. Sierra-Soso, “Machine learning applied to diabetes dataset using Quantum versus Classical computation,” 2020 IEEE Int. Symp. Signal Process. Inf. Technol. ISSPIT 2020, 2020, doi: 10.1109/ISSPIT51521.2020.9408944. [9] S. Jeyaparam and R. Agha-Jaffar, “GDM Dataset,” Figshare, 2023. https://doi.org/10.6084/m9.figshare.21806472.v1 [10] S. Nagarajan, P. Ramasubramanian, and R. . Chandrasekaran, “Data Mining Techniques for Performance Evaluation of Diagnosis in Gestational Diabetes,” Int. J. Curr. Res. Acad. Rev., vol. 2, no. 10, pp. 91–98, 2014. [11] H. Liu et al., “Machine learning risk score for prediction of gestational diabetes in early pregnancy in Tianjin, China,” Diabetes. Metab. Res. Rev., vol. 37, no. 5, 2021, doi: 10.1002/dmrr.3397. [12] R. L. Lawrence, C. R. Wall, and F. H. Bloomfield, “Prevalence of gestational diabetes according to commonly used data sources: An observational study,” BMC Pregnancy Childbirth, vol. 19, no. 1, pp. 1–9, 2019, doi: 10.1186/s12884-019-2521-2. [13] H. Qiu et al., “Electronic Health Record Driven Prediction for Gestational Diabetes Mellitus in Early Pregnancy,” Sci. Rep., vol. 7, no. 1, pp. 1–13, 2017, doi: 10.1038/s41598-017-16665-y. [14] N. Prema and M. Pushpalatha, “Analysis of Risk Factors of Gestational Diabetes Mellitus (GDM) Using Data Mining,” J. Women’s Heal. Issues Care, vol. 8, no. 2, pp. 1–4, 2018, doi: 10.4172/2325-9795.1000329. [15] B. Pranto, S. M. Mehnaz, E. B. Mahid, I. M. Sadman, A. Rahman, and S. Momen, “Evaluating machine learning methods for predicting diabetes among female patients in Bangladesh,” Inf., vol. 11, no. 8, 2020, doi: 10.3390/INFO11080374. [16] X. Lu, J. Wang, J. Cai, Z. Xing, and J. Huang, “Prediction of Gestational Diabetes and Hypertension Based on Pregnancy Examination Data,” J. Mech. Med. Biol., vol. 22, no. 3, pp. 1– 18, 2022, doi: 10.1142/S0219519422400012. [17] S. D. Cooray et al., “Temporal validation and updating of a prediction model for the diagnosis of gestational diabetes mellitus,” J. Clin. Epidemiol., vol. 164, pp. 54–64, 2023, doi: 10.1016/j.jclinepi.2023.08.020. [18] Y. Belsti et al., “Comparison of machine learning and conventional logistic regression-based prediction models for gestational diabetes in an ethnically diverse population; the Monash GDM Machine learning model,” Int. J. Med. Inform., vol. 179, no. September, p. 105228, 2023, doi: 10.1016/j.ijmedinf.2023.105228. [19] G. Cubillos et al., “Development of machine learning models to predict gestational diabetes risk in the first half of pregnancy,” BMC Pregnancy Childbirth, vol. 23, no. 1, pp. 1–19, 2023, doi: 10.1186/s12884-023-05766-4. [20] M. Zulueta et al., “Development and validation of a multivariable genotype-informed gestational diabetes prediction algorithm for clinical use in the Mexican population: Insights into susceptibility mechanisms,” BMJ Open Diabetes Res. Care, vol. 11, no. 2, 2023, doi: 10.1136/bmjdrc-2022-003046. [21] B. S. Kang et al., “Prediction of gestational diabetes mellitus in Asian women using machine learning algorithms,” Sci. Rep., vol. 13, no. 1, pp. 1–10, 2023, doi: 10.1038/s41598-023-39680-8.