Comparison of Machine Learning Methods for a Diabetes
Prediction Information System
Olexandr Shmatko 1, Olha Korol 2, Andrey Tkachov3, Vasyl Otenko4
1
    National Technical University "Kharkiv Polytechnic Institute" st. Kirpichova, 2, Kharkiv, 61000, Ukraine
2,3,4
      Simon Kuznets Kharkiv National University of Economics, ave. Nauki, 9-A, Kharkiv, 61166 Ukraine

                  Abstract
                  Diabetes is a disease for which there is no permanent cure; therefore, methods and information
                  systems are required for its early detection. This paper proposes an information system for
                  predicting diabetes based on the use of data mining methods and machine learning (ML)
                  algorithms. The paper discusses a number of machine learning methods such as decision trees
                  (DT), logistic regression (LR), k-Nearest Neighbors (k-NN). For our research, we used the Pima
                  Indian Diabetes (PID) dataset collected from the UCI machine learning repository. The dataset
                  contains information about 768 patients and their corresponding nine unique attributes.
                  Research has been carried out to improve the prediction index based on the Recursive Feature
                  Elimination method. We found that the logistic regression (LR) model performed well in
                  predicting diabetes. We have shown that in order to use the created model to predict the
                  likelihood of diabetes mellitus with an accuracy of 78%, it is necessary and sufficient to use
                  such indicators of the patient's health status as the number of times of pregnancy, the
                  concentration of glucose in the blood plasma during the oral glucose tolerance test, the BMI
                  index and the result of the calculation. heredity functions "DiabetesPedigreeFunction".

                  Keywords 1
                  Machine learning, Data Mining, Neural Network, Diabetes Prediction Information System,
                  KNN, Logistic regression, Decision tree.


1. Introduction                                                                                   Normally, the level of glucose in the blood
                                                                                              varies within fairly narrow limits: from 3.3 to 5.5
                                                                                              mmol / liter. This is due to the fact that in a healthy
    Diabetes mellitus is an "epidemic of the XXI
                                                                                              person, the pancreas produces or stops the release
century", an incurable disease of the pancreas,
                                                                                              of insulin depending on the actual level of glucose
which develops due to absolute or relative
                                                                                              in the blood. In case of insufficiency or complete
insufficiency of the hormone insulin. It is
                                                                                              absence of insulin (type 1 diabetes mellitus) or in
characterized by a steady rise in blood glucose
                                                                                              case of impaired interaction of insulin with cells
levels, which in turn can lead to complications.
                                                                                              (type 2 diabetes mellitus), glucose accumulates in
    To achieve compensation for diabetes,
                                                                                              the blood in large quantities, and the body's cells
constant monitoring is required. In addition to
                                                                                              are unable to absorb it. As of 2019, in addition to
taking oral medications and insulin, following a
                                                                                              the already mentioned type 1 and type 2 diabetes,
strict diet, exercise, daily routine, checking your
                                                                                              there are gestational diabetes (gestational
blood glucose regularly, and keeping a special
                                                                                              diabetes), MODY-diabetes and LADA diabetes
diary, your diabetic should see an endocrinologist
                                                                                              [2].
regularly for advice and appropriate measures to
improve or maintain the condition.

ISIT 2021: II International Scientific and Practical Conference
«Intellectual Systems and Information Technologies», September
13–19, 2021, Odesa, Ukraine
EMAIL:         oleksandr.shmatko@khpi.edu.ua        (A.      1);
korol.olha2016@gmail.com (A. 2); andrew.tkachov@hneu.net (A.
3); ovi@hneu.edu.ua (A. 4)
ORCID: 0000-0002-2426-900X (A. 1); 0000-0002-8733-9984 (A.
2); 0000-0003-1428-0173 (A. 3); 0000-0002-5979-1084 (A. 4)
              ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative
              Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
    Depending on the specifics of the diagnosis,      predict diabetes. The model proposed in this paper
treatment of patients with diabetes involves the      showed a fairly good forecast with an accuracy of
use of oral agents to improve insulin permeability    75.32%. In the study [5], all patient data were used
to body tissues and / or replacement therapy with     to train and test a classifier based on Naive Bayes
subcutaneous insulin injections of varying            (NB) and decision trees (DT). The research results
duration to mimic the normal functioning of the       showed that the best algorithm is the naive
pancreas. With mild diabetes, you can do without      Bayesian algorithm with an accuracy of
medication, but a strict diet with a clear            76.3021%.
understanding of the amount of nutrients                  The most important problem in a machine
consumed, moderate exercise, daily routine,           learning method is the choice of training
blood glucose control and diary of self-              parameters and the corresponding classifier. In
monitoring are mandatory for all patients with this   our work, we used the Recursive Feature
diagnosis.                                            Elimination method to improve the prediction
    Under conditions of poor or insufficient          rate. Our research work is to select the best
treatment (decompensation or subcompensation          classifier for the diabetes prediction information
of diabetes), blood glucose levels in the human       system. In this work, various machine learning
body are consistently high. Against this              classification algorithms are used to predict
background, complications of diabetes develop,        diabetes in a patient, such as Linear Regression
which not only worsen the patient's standard of       (LR), K-Nearest Neighbor (KNN), Decision Tree
living, but can also be fatal. These complications    (DT).
include:
    ketoacidosis (accumulation of a dangerously       3. System design
large number of ketone bodies in the blood),
hypoglycemia (decrease in blood glucose below
                                                         The system architecture for the Diabetes
3.3 mmol / l), hyperosmolar and lactic acidotic
                                                      Prediction System, shown in Figure 1 below, is a
coma, polyneuropathy (peripheral nerve damage),
nephropathy (kidney damage), retina retinal           conceptual model that defines the structure,
                                                      behavioral interactions, and several systemic
vessels),    angiopathy      (impaired    vascular
permeability), diabetic foot syndrome, etc.           representations that underlie the system. The
    To achieve compensation for diabetes - a          figure shows a formal description of the system,
condition in which the patient has achieved stable    submodules of the system, as well as data flows
normal blood glucose levels during treatment and      between them.
the risk of complications is reduced - constant          Figure 1 shows the components of the system
monitoring is required. In addition to the above      architecture.
measures, this control also includes regular visits
to the endocrinologist for advice and appropriate
measures to improve or maintain the patient's
health.

2. Literature review
   There are a number of studies on predicting
diabetes based on machine learning (ML)
methods for the Pima Indian Diabetes Dataset
(PIDD). Pima Indian Diabetes Dataset (PIDD)
containing: 9 attributes, 768 records describing
female patients. [1], [2], [3], [4], [5].
   In [2], artificial neural networks were used to
predict diabetes based on the PIDD dataset, which
showed a prediction accuracy of 75.7%. Sajida
The authors of [3] showed that among the applied
machine learning methods SVM, NB and DT on
PIDD, the NB classifier shows the best accuracy -     Figure 1: System architecture.
76.30%. [4] applied logistic regression to PIDD to
4. Methods                                              4.1 An example of solving the
                                                        problem of classification using
    Based on the comparison and analysis of the
functional properties of leading software solutions
                                                        machine learning to predict the
in the field of medicine, it was determined that the    incidence of diabetes
option "Obtaining prediction of the probability of
the patient's disease" is not implemented in            4.1.1. Description of the source data
modern diabetes management information
systems. However, due to statistics on the fate of
                                                           To implement the considered methods of
patients with misdiagnosis, it becomes impossible
                                                        classification of machine learning, we will use the
to deny the need to implement this useful
                                                        popular service "UCI Machine Learning
function.
                                                        Repository", which provides a large number of
    The problem of predicting the incidence of
                                                        sets of real data, and consider the initial data
diabetes can be solved using the methods of
                                                        presented in the sample "Pima Indians Diabetes
classification of machine learning.
                                                        Database" (figure 2)
    In the tasks of medical diagnostics, patients act
as objects. The characteristic description of the
patient is, in fact, a formalized medical history.
Having accumulated a sufficient number of
precedents in electronic form, you can use the
methods of classification of machine learning and
predict the likelihood of the patient's disease.


Figure 2: Example data in the Pima Indians Diabetes Database sample

   There are a total of 768 records in the sample,          9. "Outcome" - the result of a variable class
each of which is characterized by the following             (0 - no diabetes, 1 - a sick person)
nine parameters.                                            The available data show the following
   1. "Pregnancies" - the number of times of            distribution: 500 people are healthy (ie their
   pregnancy                                            "Outcome" parameter is zero) and 268 have
   2. "Glucose" - plasma glucose concentration          diabetes (their "Outcome" parameter is equal to
   (in mg / dl) for two hours in an oral glucose        one).
   tolerance test                                           In graphical form, the data "Pima Indians
   3. "BlooodPressure" - diastolic blood                Diabetes Database" can be represented as follows
   pressure (in mm Hg)                                  (figure 3).
   4. "SkinThickness" - the thickness of the                As can be seen from Figure 3, inaccurate data
   folds of the skin of the triceps (in mm)             are found in the sample. For example, these are:
   5. "Insulin" - the concentration of serum                1. blood pressure equal to zero (35 cases);
   insulin for two hours (in μg / ml)                       2. zero blood glucose concentration (5
   6. "BMI" - body mass index, calculated by                cases);
   the formula: weight in kg / (height in m) 2              3. skin fold thickness less than 10 mm (227
   7. "DiabetesPedigreeFunction" - a function               cases);
   of diabetes heredity                                     4. BMI approaching zero (11 cases);
   8. "Age" - the age of man                                5. zero level of insulin concentration in the
                                                            blood (374 cases).
Figure 3: Graphic representation of data distribution

    To eliminate the above problems, the              task of predicting the incidence of diabetes, it is
following options are proposed:                       advisable to choose the method whose accuracy in
         Delete or ignore records. An undesirable    the selected sample will be the highest.
    option, because it means the loss of valuable        Avoidance of training and testing on the same
    information. The sample contains too many         data is a common practice, which is explained by
    records with zero skin thickness and blood        the fact that the purpose of the model is to provide
    insulin concentration, but this tool can be       data other than the sample. In addition, the model
    applied to the fields "BMI", "Glucose", "Blood    can be overly complex, leading to retraining. To
    pressure".                                        avoid the above problems, there are two
         Using averages. This option may be the      precautionary methods:
    case for some samples, but using a mean value,       • retention method - part of the training set can
    such as blood pressure, will be the wrong            be postponed and used as an affirmative (test)
    signal for the model.                                set;
         Avoid the use of problematic                   • cross-checking - repeating the method of
    characteristics. This option could work for the      retention several times, ie repeating the
    thickness of the skin, but at this stage it is       division of the sample into training and
    difficult to predict the result.                     approval sets.
    After analyzing possible ways to solve the           Calculations of the accuracy values of the
problem of incomplete data, we decide to remove       selected classification methods will be performed
from the sample rows in which the attributes          using Python programming language, namely
"BMI", "Glucose" and "Blood pressure" are zero.       using the methods of the library "scikit-learn"
As a result, 724 records remain in the database.      [39]. As input parameters "x" we will give models
                                                      data from columns "Pregnancies", "Glucose",
                                                      "Blood Pressure", "Skin Thickness", "Insulin",
4.1.2. Choice of classification method                "BMI", "Diabetes Pedigree Function" and "Age".
                                                      As the expected result "y" - data from "Outcome".
   In order to choose the method of classification
of machine learning, which is better suited for the
   The results of the calculations presented in the    identify which of the available attributes have a
table 1.                                               greater impact on the resulting model, we use the
                                                       method of recursive Feature Elimination (RFE).
Table 1                                                   The essence of the method is that it recursively
The results of calculating the accuracy of the         removes attributes and builds models based on
classification methods by the method of                those attributes that remain. RFE uses model
retention and the method of cross-checking             accuracy to determine which attributes or
                                                       combinations of them contribute most to target
    Method         retention         cross-
                                                       prediction.
                    method         checking
                                                          Using the library "scikit-learn" we build a
       KNN         0.711521        0.711521            graph of the accuracy of the prediction of diabetes
        LR         0.776440        0.776440            from the number of initial parameters (Fig. 5).
        DT         0.681327        0.685494


  80,0

  78,0

  76,0

  74,0

  72,0

  70,0

  68,0

  66,0                                                 Figure 5: Dependence of accuracy of prediction
                                                       of diabetes mellitus by logistic regression method
  64,0                                                 on the number of initial parameters
  62,0
            KNN              LR             DT
                                                          As you can see, the best accuracy of the model
                                                       is achieved by using only four attributes:
          retention method        cross-checking       "Pregnancies",          "Glucose",          "BMI",
                                                       "DiabetesPedigree Function". A comparison of
Figure 4: Comparison of the obtained results of        the results of the accuracy calculation is given in
accuracy of classification methods                     the following table 2/
   Based on the obtained data, it can be stated that   Table 2
the method of logistic regression has a higher         The results of calculating the accuracy of the
accuracy of the sample "Pima Indians Diabetes          method of logistic regression after reducing the
Database" than the method of kNN-classification
                                                       number of output attributes
and the method of the decision tree (figure 4), and
therefore it can be used to implement the function       Number of             8               4
"Diabetes prediction" in the information system to         output
automate the process of admission of patients with       parameters
diabetes in endocrinologists.                            Accuracy of
                                                           logistic
                                                                           0.776440        0.780588
4.1.3. Improving the accuracy of                         regression
                                                          method
prediction
   Usually, not all source data improve the
performance of the model. In order to correctly           Thus, to further use the created model to
                                                       predict the probability of diabetes with an
accuracy of 78%, it is necessary and sufficient to   [5] Zou, Quan, et al. "Predicting diabetes
use such indicators of the patient's health as the       mellitus with machine learning techniques."
number of pregnancies, plasma glucose                    Frontiers in genetics 9 (2018): 515.
concentration in the oral glucose tolerance test,
BMI and the result of calculating the heredity
function "DiabetesPedigreeFunction".

5. Conclusions
   Early detection of diabetes is one of the major
health problems. This paper proposes a system
architecture and classifier for an information
system that can predict diabetes with high
accuracy. We have pre-processed the data. Using
the method of reducing the number of functions,
we have abandoned four parameters. We used
four input parameters ("Pregnancies", "Glucose",
"BMI", "DiabetesPedigree Function") and one
output parameter (result) in the PIMA dataset. We
used three different machine learning algorithms,
including DT, KNN, LR on PIDD, to predict
diabetes and evaluated the performance on
various parameters. All models show good results
in some parameters. All models provided over
70% accuracy. LR provided an accuracy of
approximately 77–78%. The use of improving the
prediction index based on the Recursive Feature
Elimination method allowed us to reduce the
number of parameters from 8 to 4. Among all the
proposed models, the forecasting accuracy for
logistic regression (78.05%) was better than the
accuracy in [1] (LR 75.32% ), [2] (NB - 76.30%),
[3] (NB - 76.3021%), [4] (RF - 77.21%) and [5]
(ANN 75.7%).

6. References
[1] Sisodia, Deepti, and Dilip Singh Sisodia.
    "Prediction of diabetes using classification
    algorithms." Procedia computer science 132
    (2018): 1578-1585.
[2] Alam, Talha Mahboob, et al. "A model for
    early prediction of diabetes." Informatics in
    Medicine Unlocked 16 (2019): 100204.
[3] Tigga, Neha Prerna, and Shruti Garg.
    "Prediction of type 2 diabetes using machine
    learning classification methods." Procedia
    Computer Science 167 (2020): 706-716.
[4] Diwani, Salim Amour, and Anael Sam.
    "Diabetes forecasting using supervised
    learning techniques." Adv Comput Sci an Int
    J 3 (2014): 10-18.