Comparison of Machine Learning Methods for a Diabetes Prediction Information System Olexandr Shmatko 1, Olha Korol 2, Andrey Tkachov3, Vasyl Otenko4 1 National Technical University "Kharkiv Polytechnic Institute" st. Kirpichova, 2, Kharkiv, 61000, Ukraine 2,3,4 Simon Kuznets Kharkiv National University of Economics, ave. Nauki, 9-A, Kharkiv, 61166 Ukraine Abstract Diabetes is a disease for which there is no permanent cure; therefore, methods and information systems are required for its early detection. This paper proposes an information system for predicting diabetes based on the use of data mining methods and machine learning (ML) algorithms. The paper discusses a number of machine learning methods such as decision trees (DT), logistic regression (LR), k-Nearest Neighbors (k-NN). For our research, we used the Pima Indian Diabetes (PID) dataset collected from the UCI machine learning repository. The dataset contains information about 768 patients and their corresponding nine unique attributes. Research has been carried out to improve the prediction index based on the Recursive Feature Elimination method. We found that the logistic regression (LR) model performed well in predicting diabetes. We have shown that in order to use the created model to predict the likelihood of diabetes mellitus with an accuracy of 78%, it is necessary and sufficient to use such indicators of the patient's health status as the number of times of pregnancy, the concentration of glucose in the blood plasma during the oral glucose tolerance test, the BMI index and the result of the calculation. heredity functions "DiabetesPedigreeFunction". Keywords 1 Machine learning, Data Mining, Neural Network, Diabetes Prediction Information System, KNN, Logistic regression, Decision tree. 1. Introduction Normally, the level of glucose in the blood varies within fairly narrow limits: from 3.3 to 5.5 mmol / liter. This is due to the fact that in a healthy Diabetes mellitus is an "epidemic of the XXI person, the pancreas produces or stops the release century", an incurable disease of the pancreas, of insulin depending on the actual level of glucose which develops due to absolute or relative in the blood. In case of insufficiency or complete insufficiency of the hormone insulin. It is absence of insulin (type 1 diabetes mellitus) or in characterized by a steady rise in blood glucose case of impaired interaction of insulin with cells levels, which in turn can lead to complications. (type 2 diabetes mellitus), glucose accumulates in To achieve compensation for diabetes, the blood in large quantities, and the body's cells constant monitoring is required. In addition to are unable to absorb it. As of 2019, in addition to taking oral medications and insulin, following a the already mentioned type 1 and type 2 diabetes, strict diet, exercise, daily routine, checking your there are gestational diabetes (gestational blood glucose regularly, and keeping a special diabetes), MODY-diabetes and LADA diabetes diary, your diabetic should see an endocrinologist [2]. regularly for advice and appropriate measures to improve or maintain the condition. ISIT 2021: II International Scientific and Practical Conference «Intellectual Systems and Information Technologies», September 13–19, 2021, Odesa, Ukraine EMAIL: oleksandr.shmatko@khpi.edu.ua (A. 1); korol.olha2016@gmail.com (A. 2); andrew.tkachov@hneu.net (A. 3); ovi@hneu.edu.ua (A. 4) ORCID: 0000-0002-2426-900X (A. 1); 0000-0002-8733-9984 (A. 2); 0000-0003-1428-0173 (A. 3); 0000-0002-5979-1084 (A. 4) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Depending on the specifics of the diagnosis, predict diabetes. The model proposed in this paper treatment of patients with diabetes involves the showed a fairly good forecast with an accuracy of use of oral agents to improve insulin permeability 75.32%. In the study [5], all patient data were used to body tissues and / or replacement therapy with to train and test a classifier based on Naive Bayes subcutaneous insulin injections of varying (NB) and decision trees (DT). The research results duration to mimic the normal functioning of the showed that the best algorithm is the naive pancreas. With mild diabetes, you can do without Bayesian algorithm with an accuracy of medication, but a strict diet with a clear 76.3021%. understanding of the amount of nutrients The most important problem in a machine consumed, moderate exercise, daily routine, learning method is the choice of training blood glucose control and diary of self- parameters and the corresponding classifier. In monitoring are mandatory for all patients with this our work, we used the Recursive Feature diagnosis. Elimination method to improve the prediction Under conditions of poor or insufficient rate. Our research work is to select the best treatment (decompensation or subcompensation classifier for the diabetes prediction information of diabetes), blood glucose levels in the human system. In this work, various machine learning body are consistently high. Against this classification algorithms are used to predict background, complications of diabetes develop, diabetes in a patient, such as Linear Regression which not only worsen the patient's standard of (LR), K-Nearest Neighbor (KNN), Decision Tree living, but can also be fatal. These complications (DT). include: ketoacidosis (accumulation of a dangerously 3. System design large number of ketone bodies in the blood), hypoglycemia (decrease in blood glucose below The system architecture for the Diabetes 3.3 mmol / l), hyperosmolar and lactic acidotic Prediction System, shown in Figure 1 below, is a coma, polyneuropathy (peripheral nerve damage), nephropathy (kidney damage), retina retinal conceptual model that defines the structure, behavioral interactions, and several systemic vessels), angiopathy (impaired vascular permeability), diabetic foot syndrome, etc. representations that underlie the system. The To achieve compensation for diabetes - a figure shows a formal description of the system, condition in which the patient has achieved stable submodules of the system, as well as data flows normal blood glucose levels during treatment and between them. the risk of complications is reduced - constant Figure 1 shows the components of the system monitoring is required. In addition to the above architecture. measures, this control also includes regular visits to the endocrinologist for advice and appropriate measures to improve or maintain the patient's health. 2. Literature review There are a number of studies on predicting diabetes based on machine learning (ML) methods for the Pima Indian Diabetes Dataset (PIDD). Pima Indian Diabetes Dataset (PIDD) containing: 9 attributes, 768 records describing female patients. [1], [2], [3], [4], [5]. In [2], artificial neural networks were used to predict diabetes based on the PIDD dataset, which showed a prediction accuracy of 75.7%. Sajida The authors of [3] showed that among the applied machine learning methods SVM, NB and DT on PIDD, the NB classifier shows the best accuracy - Figure 1: System architecture. 76.30%. [4] applied logistic regression to PIDD to 4. Methods 4.1 An example of solving the problem of classification using Based on the comparison and analysis of the functional properties of leading software solutions machine learning to predict the in the field of medicine, it was determined that the incidence of diabetes option "Obtaining prediction of the probability of the patient's disease" is not implemented in 4.1.1. Description of the source data modern diabetes management information systems. However, due to statistics on the fate of To implement the considered methods of patients with misdiagnosis, it becomes impossible classification of machine learning, we will use the to deny the need to implement this useful popular service "UCI Machine Learning function. Repository", which provides a large number of The problem of predicting the incidence of sets of real data, and consider the initial data diabetes can be solved using the methods of presented in the sample "Pima Indians Diabetes classification of machine learning. Database" (figure 2) In the tasks of medical diagnostics, patients act as objects. The characteristic description of the patient is, in fact, a formalized medical history. Having accumulated a sufficient number of precedents in electronic form, you can use the methods of classification of machine learning and predict the likelihood of the patient's disease. Figure 2: Example data in the Pima Indians Diabetes Database sample There are a total of 768 records in the sample, 9. "Outcome" - the result of a variable class each of which is characterized by the following (0 - no diabetes, 1 - a sick person) nine parameters. The available data show the following 1. "Pregnancies" - the number of times of distribution: 500 people are healthy (ie their pregnancy "Outcome" parameter is zero) and 268 have 2. "Glucose" - plasma glucose concentration diabetes (their "Outcome" parameter is equal to (in mg / dl) for two hours in an oral glucose one). tolerance test In graphical form, the data "Pima Indians 3. "BlooodPressure" - diastolic blood Diabetes Database" can be represented as follows pressure (in mm Hg) (figure 3). 4. "SkinThickness" - the thickness of the As can be seen from Figure 3, inaccurate data folds of the skin of the triceps (in mm) are found in the sample. For example, these are: 5. "Insulin" - the concentration of serum 1. blood pressure equal to zero (35 cases); insulin for two hours (in μg / ml) 2. zero blood glucose concentration (5 6. "BMI" - body mass index, calculated by cases); the formula: weight in kg / (height in m) 2 3. skin fold thickness less than 10 mm (227 7. "DiabetesPedigreeFunction" - a function cases); of diabetes heredity 4. BMI approaching zero (11 cases); 8. "Age" - the age of man 5. zero level of insulin concentration in the blood (374 cases). Figure 3: Graphic representation of data distribution To eliminate the above problems, the task of predicting the incidence of diabetes, it is following options are proposed: advisable to choose the method whose accuracy in  Delete or ignore records. An undesirable the selected sample will be the highest. option, because it means the loss of valuable Avoidance of training and testing on the same information. The sample contains too many data is a common practice, which is explained by records with zero skin thickness and blood the fact that the purpose of the model is to provide insulin concentration, but this tool can be data other than the sample. In addition, the model applied to the fields "BMI", "Glucose", "Blood can be overly complex, leading to retraining. To pressure". avoid the above problems, there are two  Using averages. This option may be the precautionary methods: case for some samples, but using a mean value, • retention method - part of the training set can such as blood pressure, will be the wrong be postponed and used as an affirmative (test) signal for the model. set;  Avoid the use of problematic • cross-checking - repeating the method of characteristics. This option could work for the retention several times, ie repeating the thickness of the skin, but at this stage it is division of the sample into training and difficult to predict the result. approval sets. After analyzing possible ways to solve the Calculations of the accuracy values of the problem of incomplete data, we decide to remove selected classification methods will be performed from the sample rows in which the attributes using Python programming language, namely "BMI", "Glucose" and "Blood pressure" are zero. using the methods of the library "scikit-learn" As a result, 724 records remain in the database. [39]. As input parameters "x" we will give models data from columns "Pregnancies", "Glucose", "Blood Pressure", "Skin Thickness", "Insulin", 4.1.2. Choice of classification method "BMI", "Diabetes Pedigree Function" and "Age". As the expected result "y" - data from "Outcome". In order to choose the method of classification of machine learning, which is better suited for the The results of the calculations presented in the identify which of the available attributes have a table 1. greater impact on the resulting model, we use the method of recursive Feature Elimination (RFE). Table 1 The essence of the method is that it recursively The results of calculating the accuracy of the removes attributes and builds models based on classification methods by the method of those attributes that remain. RFE uses model retention and the method of cross-checking accuracy to determine which attributes or combinations of them contribute most to target Method retention cross- prediction. method checking Using the library "scikit-learn" we build a KNN 0.711521 0.711521 graph of the accuracy of the prediction of diabetes LR 0.776440 0.776440 from the number of initial parameters (Fig. 5). DT 0.681327 0.685494 80,0 78,0 76,0 74,0 72,0 70,0 68,0 66,0 Figure 5: Dependence of accuracy of prediction of diabetes mellitus by logistic regression method 64,0 on the number of initial parameters 62,0 KNN LR DT As you can see, the best accuracy of the model is achieved by using only four attributes: retention method cross-checking "Pregnancies", "Glucose", "BMI", "DiabetesPedigree Function". A comparison of Figure 4: Comparison of the obtained results of the results of the accuracy calculation is given in accuracy of classification methods the following table 2/ Based on the obtained data, it can be stated that Table 2 the method of logistic regression has a higher The results of calculating the accuracy of the accuracy of the sample "Pima Indians Diabetes method of logistic regression after reducing the Database" than the method of kNN-classification number of output attributes and the method of the decision tree (figure 4), and therefore it can be used to implement the function Number of 8 4 "Diabetes prediction" in the information system to output automate the process of admission of patients with parameters diabetes in endocrinologists. Accuracy of logistic 0.776440 0.780588 4.1.3. Improving the accuracy of regression method prediction Usually, not all source data improve the performance of the model. In order to correctly Thus, to further use the created model to predict the probability of diabetes with an accuracy of 78%, it is necessary and sufficient to [5] Zou, Quan, et al. "Predicting diabetes use such indicators of the patient's health as the mellitus with machine learning techniques." number of pregnancies, plasma glucose Frontiers in genetics 9 (2018): 515. concentration in the oral glucose tolerance test, BMI and the result of calculating the heredity function "DiabetesPedigreeFunction". 5. Conclusions Early detection of diabetes is one of the major health problems. This paper proposes a system architecture and classifier for an information system that can predict diabetes with high accuracy. We have pre-processed the data. Using the method of reducing the number of functions, we have abandoned four parameters. We used four input parameters ("Pregnancies", "Glucose", "BMI", "DiabetesPedigree Function") and one output parameter (result) in the PIMA dataset. We used three different machine learning algorithms, including DT, KNN, LR on PIDD, to predict diabetes and evaluated the performance on various parameters. All models show good results in some parameters. All models provided over 70% accuracy. LR provided an accuracy of approximately 77–78%. The use of improving the prediction index based on the Recursive Feature Elimination method allowed us to reduce the number of parameters from 8 to 4. Among all the proposed models, the forecasting accuracy for logistic regression (78.05%) was better than the accuracy in [1] (LR 75.32% ), [2] (NB - 76.30%), [3] (NB - 76.3021%), [4] (RF - 77.21%) and [5] (ANN 75.7%). 6. References [1] Sisodia, Deepti, and Dilip Singh Sisodia. "Prediction of diabetes using classification algorithms." Procedia computer science 132 (2018): 1578-1585. [2] Alam, Talha Mahboob, et al. "A model for early prediction of diabetes." Informatics in Medicine Unlocked 16 (2019): 100204. [3] Tigga, Neha Prerna, and Shruti Garg. "Prediction of type 2 diabetes using machine learning classification methods." Procedia Computer Science 167 (2020): 706-716. [4] Diwani, Salim Amour, and Anael Sam. "Diabetes forecasting using supervised learning techniques." Adv Comput Sci an Int J 3 (2014): 10-18.