Detection of Patients with Diabetes Mellitus using Density- Based Spatial Clustering of Applications with Noize Serhii Krivtsov1, Ievgen Meniailov2 and Kyryl Korobchynskyi1 1 National Aerospace University “Kharkiv Aviation Institute”, Chkalow str., 17, Kharkiv, 61070, Ukraine 2 V.N. Karazin Kharkiv National University, Svobody sq., 4, Kharkiv, 61022, Ukraine Abstract Machine learning is an effective tool for data-driven medicine. Machine learning methods show high accuracy in the direction of automated diagnostics. Diabetes Mellitus is a significant global problem. Today, more than 400 million people live with this diagnosis. Within the framework of this study, a model for diagnosing patients with suspected Diabetes Mellitus based on the Density-Based Spatial Clustering of Applications with Noize method was developed. An experimental study of the method was carried out on the PIMA Indians Diabetes open dataset. The model shows high accuracy, allowing it to be used in medical institutions for decision support in diagnosing. Keywords 1 DBSCAN, Diabetes Mellitus, clustering, machine learning 1. Introduction Diabetes Mellitus is a metabolic disorder of multiple etiologies characterized by chronic hyperglycemia with abnormal carbohydrate, fat, and protein metabolism resulting from impaired insulin secretion and action [1]. Diabetes mellitus is a global public health problem. As of 2022, 422 million people worldwide have diabetes, 6.028% of the planet's total population [2]. At the same time, there is an annual increase in the incidence. According to forecasts of the growth dynamics of the incidence of diabetes, by 2025, the number of patients will increase by two times. Furthermore, by 2030, diabetes will be the world's number 7 cause of death [3]. The main threat posed by Diabetes Mellitus is an early disability and high mortality from concomitant cardiovascular diseases. The main consequences of Diabetes Mellitus are as follows [4]: • Diabetes is the leading cause of blindness. • Diabetes is the leading cause of non-traumatic lower limb amputation. • The risk of stroke, kidney failure, heart attacks, and other cardiovascular diseases increases four times. The main risk factors for diabetes include [5]: • Overweight. • Unbalanced nutrition. • Hereditary predisposition. • Physical inactivity. • Chronic gastritis. • Cholecystitis. • Impaired glucose tolerance. 2nd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2022), December 2-4, 2022, Łódź, Poland EMAIL: krivtsovpro@gmail.com (S. Krivtsov); evgenii.menyailov@gmail.com (I. Meniailov); kirill.korobchinskiy@gmail.com (K. Korobchynskyi) ORCID: 0000-0001- 5214-0927 (S. Krivtsov); 0000-0002-9440-8378 (I. Meniailov); 0000-0002-3676-6070 (K. Korobchynskyi) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) • Age over 40 years. • Constant stress. In addition to high social importance, the problem of diabetes is also of economic importance. The fight against diabetes, depending on the country, is spent from 3 to 15% of annual health care budgets [6]. The primary means of combating diabetes are prevention and early diagnosis. Early diagnosis is especially effective in the context of the escalation of Russia's war in Ukraine. Access to facilities and adequate medical care is often difficult in areas with active hostilities. Therefore, physicians' automation tools and decision support systems are of particular relevance. Over the past few years, the global COVID-19 pandemic has driven the digitization of healthcare worldwide. Information technologies are used to solve such problems as modeling the consequences of epidemics [7], medical diagnostics [8], forecasting infectious diseases [9], assessing resources and consequences of disease outbreaks [10], researching viruses [11], etc. This study aims to develop a clustering model for patients with suspected Diabetes Mellitus based on the Density-Based Spatial Clustering of Applications with Noize method. Research is part of a complex, intelligent information system for epidemiological diagnostics, the concept of which is discussed in [12]. 2. Materials and Methods Cluster analysis is the task of grouping a set of objects into subsets so that objects from one cluster are more similar than objects from other clusters according to some criterion. The clustering problem belongs to the class of unsupervised learning problems. Let X be a set of objects, Y be a set of cluster identifiers. The distance function between objects ρ (x, x′) is given on the set X, given a finite training sample of objects Xm = {x1, …, xm} ⊂ X. It is necessary to divide the sample into clusters, that is, to each object xi ∈ Xm, assign a label yi ∈ Y, so that the objects within each cluster are close concerning the metric ρ, and objects from different clusters differed significantly. In medicine, cluster analysis is used for diagnostics [13], determining the severity of a disease in a patient [14], searching for factors influencing the development of disease [15], and identifying treatment regimens [16]. Within the framework of this study, a model for diagnosing patients with suspected Diabetes Mellitus based on the Density-Based Spatial Clustering of Applications with Noize (DBSCAN) method was developed. The DBSCAN method consists of the fact that inside each cluster, a typical density of points (objects) is observed, which is noticeably higher than the density outside the cluster, as well as the density in areas with noise is lower than the density of any of the clusters [17]. At the same time, each cluster point's neighborhood of a given radius must contain at least a certain number of points. A threshold value sets this number of points. Consider a set of points in some space requiring clustering. To perform DBSCAN clustering, points are divided into core points, point density reachable points, and outliers as follows: • A point p is a core point if at least minPts points are at a distance not exceeding ε, the maximum neighborhood radius from p, to it. Such points are reachable from p. • The point q is directly accessible from p if the point q is at a distance not more significant than ε from the point p, and p must be the main point. • A point Aq is reachable from p if there is a path p1, p2, …, pn and pn = q, where every point pi + 1 is reachable directly from pi (all points on the path must be primary except for q). In this case, all points that are not reachable from the main points are considered outliers. If p is a core point, it forms a cluster along with all points (core or non-core) that are reachable from that point. Each cluster contains at least one central point. Non-core points can also be part of a cluster. Reachability is not a symmetric relationship because, by definition, no point can be reached from a non-primary point, regardless of distance. Two points, p, and q, are density related if there is a point o such that both p and q are reachable from o. Density connectivity is symmetrical. Then the cluster satisfies two properties: • All points in the cluster are pairwise connected in density. • If a point is a density reachable from some point in the cluster, it also belongs to the cluster. The DBSCAN algorithm has the following form: • It is necessary to find points in the ε neighborhood of each point and select the main points with more than minPts neighbors. • It is necessary to find the connected components of the core points on the graph of neighbors, ignoring all non-core points. • You must assign each non-principal nearest cluster if the cluster is ε -neighbor. Otherwise, the point is noise. Advantages of DBSCAN: • The method does not require specification of the number of clusters in the data. • The method has the concept of noise and is resistant to outliers. • The method allows finding arbitrary-shaped clusters. • Experts can set method parameters if the data is well interpretable. • The method is insensitive to the order of points in the dataset. Disadvantages of DBSCAN: • Edge points that can be reached from more than one cluster may belong to any of those clusters, depending on the order in which the points are viewed. However, such situations rarely occur for most datasets, so they have practically no effect on the final result. • The quality of DBSCAN depends on the distance measurement. Usually, the Euclidean metric is used for this. • The method cannot cluster data well with a significant difference in density. 3. Results The Python programming language was used for the cluster analysis model's software implementation. For the pilot study, we used data from an open dataset of patients with suspected Diabetes Mellitus PIMA Indian Diabetes, collected by the National Institute of Diabetes and Digestive and Kidney Diseases [18]. The dataset contains 768 records with nine attributes. Dataset parameters are presented in Table 1. Table 1 Medical records description Attribute Scale type Data type Range Pregnancies Metric Decimal 0…17 PGConcentration Metric Integer 0…199 DiastolicBP Metric Integer 0…122 TriFoldThick Metric Integer 0…99 SerumIns Metric Integer 0…846 BMI Metric Integer 0…67.1 DPFunction Metric Integer 0.08…2.42 Age Metric Decimal 21…81 Diabetes Boolean 0/1 0,1 Attribute Scale type Data type Range The data distribution is shown in Figure 1. The data set was divided into objects and objective functions. Figure 2 shows the results of the cluster membership calculation, the clustering quality assessment using the corrected Rand coefficient, and the clustering quality assessment using the normalized mutual information. The model quality assessment using adjusted Rand index is 0,0028. The model quality assessment using normalized mutual information is 0,0021. Figure 1: Data distribution Figure 3 shows the visualization of points belonging to clusters and the display of noise points, i.e. points that took the value "-1". Figure 2: Results of cluster analysis The simulation results show sufficient performance to use the model in practical healthcare to support decision-making on the diagnosis of Diabetes Mellitus. 4. Conclusions The task of clustering patients with suspected diseases is relevant for supporting decision-making by doctors when making diagnoses. Automation of diagnostics and the use of machine learning methods for this increase the accuracy of diagnostics and can be used by medical institutions to improve work efficiency. This task is especially relevant in Russia's war in Ukraine because residents of the temporarily occupied territories and territories where active hostilities occur do not have full access to medical care. Therefore, models for diagnosing various diseases can be used by patients in combination with remote consultation by a family doctor. As part of this study, a clustering model for patients with suspected Diabetes Mellitus was developed based on the DBSCAN method. An open dataset of patients with suspected Diabetes Mellitus PIMA Indian Diabetes, collected by the National Institute of Diabetes and Digestive and Kidney Diseases, was used to verify the model. The developed model showed high performance. The model quality assessment using the adjusted Rand index is 0.0028. The model quality assessment using normalized mutual information is 0.0021. This suggests that the proposed model can effectively be used as a decision support tool for physicians diagnosing Diabetes Mellitus. In the future, it is planned to apply the model to actual data on patients with suspected Diabetes Mellitus in the Kharkiv region. Figure 3: Visualizations of cluster analysis 5. Acknowledgement The study was funded by the National Research Foundation of Ukraine in the frame-work of the research project 2020.02/0404 on the topic “Development of intelligent technologies for assessing the epidemic situation to support decision-making within the population biosafety management” 6. References [1] A.M. Schmidt, Highlighting diabetes mellitus: the epidemic continues. Arteriosclerosis, thrombosis, and vascular biology 38 (1) (2018): e1-e8. doi: 10.1161/ATVBAHA.117.310221 [2] Y. Zheng, S.H. Ley, F.B. Hu, Global aetiology and epidemiology of type 2 diabetes mellitus and its complications. Nature Reviews. Endocrinology (14, iss. 2, 2018, pp. 88-98. doi: 10.1038/nrendo.2017.151 [3] P. Saeedi, et. al., Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: results from the International Diabetes Federation Diabetes Atlas, 9th edition. Diabetes research and clinical practice 157 (2019): 107843. doi: 10.1016/j.diabres.2019.107843 [4] J.B. Cole, J.C. Florez, Genetics of diabetes mellitus and diabetes complications. Nature Reviews Nephrology 16 (7) (2020): 377-390. doi: 10.1038/s41581-020-0278-5 [5] B. Fletcher, M. Gulanick, C. Lamendola, Risk factors for tupe 2 diabetes mellitus. The Journal of Cardiovascular Nursing 16 (2) (2002): 17-23. doi: 10.1097/00005082-200201000-00003 [6] C. Bommer, et. al., The global economic burden of diabetes in adults aged 20-79 years: a cost-of- illness study. The Lancet. Diabetes & Endocrinology 5 (6) (2017): 423-430. doi: 10.1016/S2213- 8587(17)30097-9 [7] D. Chumachenko, V. Dobriak, M. Mazorchuck, I. Meniailov, K. Bazilevych, On agent-based approach to influenza and acute respiratory virus infection simulation. 14th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering, TCSET 2018 – Proceedings (2018): 192-195. doi: 10.1109/TCSET.2018.8336184 [8] A.S. Nechyporenko, et. al., Implementation and analysis of uncertainty of measurement results for lower walls of maxillary and frontal sinuses, 2020 IEEE 40th International Conference on Electronics and Nanotechnology, ELNANO 2020 – Proceedings (2020): 460-463. doi: 10.1109/ELNANO50318.2020.9088916 [9] D. Chumachenko, I. Meniailov, K. Bazilevych, Y. Kuznetsova, T. Chumachenko, Development of an intelligent agent-based model of the epidemic process of syphilis. International Scientific and Technical Conference on Computer Sciences and Information Technologies 1 (2019): 42-45. doi: 10.1109/STC-CSIT.2019.8929749 [10] N. Davidich, et. al. Monitoring of urban freight flows distribution considering the human factor. Sustainable Cities and Society 75 (2021): 103168. doi: 10.1016/j.scs.2021.103168. [11] D. Chumachenko, K. Chumachenko, S. Yakovlev, Intelligent simulation of network worm propagation using the code red as an example. Telecommunications and Radio Engineering 78 (5) (2019): 443-464. doi: 10.1615/TELECOMRADENG.V78.I5.60 [12] S. Yakovlev, et. al., The concept of developing a decision support system for the epidemic morbidity control. CEUR Workshop Proceedings 2753 (2020): 265-274. [13] Wartelle, A., et. al., Clustering of a health dataset using diagnosis co-occurrences. Applied Sciences 11 (5) (2021): 2373. doi: 10.3390/app11052373 [14] M. Liao, Y. Li, F. Kianifard, S. Arcona, Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis. BMC Nephrology 17 (2016): 25. doi: 10.1186/s12882-016-0238-2 [15] O. Skitsan, I. Meniailov, K. Bazilevych, H. Padalko, Evaluation of the informative features of cardiac studies diagnostic data using the Kullback method. CEUR Workshop Proceedings 2917 (2021): 186-195. [16] S. Windgassen, R. Moss-Morris, K. Goldsmith, T. Chalder, The importance of cluster analysis for enchancing clinical practice: an example from irritable bowel syndrome. Journal of Mental Health 27 (2) (2018): 94-96. doi: 10.1080/09638237.2018.1437615 [17] M. Ester, H.P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996): 226-231. [18] J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler, R.S. Johannes, Using the ADAP learning algorithm to forecast the onset of Diabetes Mellitus. Proceedings of the Symposium on Computer Applications and Medical Care (1988): 261-265.