<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>D. Boughareb);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Novel Ensemble Learning Approach for Diabetes Prediction in Imbalanced Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Djalila Boughareb</string-name>
          <email>boughareb.djalila@univ-guelma.dz</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Said Bouteldja</string-name>
          <email>bouteldja.said@univ-guelma.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hamid Seridi</string-name>
          <email>seridi.hamid@univ-guelma.dzg</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of 8</institution>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The incapacity of the body to effectively make or use insulin results in diabetes, a chronic illness. Over time, this illness may cause harm to the kidneys, blood vessels, heart, eyes, nerves, and kidneys. Timely treatment is essential to stop the progression of diabetes and requires early detection. We provide a hybrid machine learning strategy in this work that predicts diabetes by combining two strong algorithms. XGBoost (eXtreme Gradient Boosting)-based voting classifier and bagging classifier are the two main components of our system. We evaluated our model using three distinct datasets: the Pima Indian diabetes dataset (PIDD), its extended version, and the Frankfurt Hospital Germany Diabetes Dataset (FHGDD). In comparison to individual algorithms (XGBoost, Bagging with Decision Tree) and other ensemble methods (Voting Classifier, HM-Bag Moov Voting Classifier, XGBoost+ Data feature stitching, and Soft), our experimental results show that our proposed approach achieved higher accuracy of 92.7%, precision of 97.1%, recall of 81.7%, and an F1 score of 88.7%%. Therefore, our results imply that the hybrid machine learning approach that has been suggested can be a dependable tool for the early diagnosis of diabetes, resulting in better patient outcomes and more prompt and efficient therapies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Diabetes is becoming more commonplace worldwide, which presents a serious public health
risk. The early identification and treatment of diabetes can improve the quality of life for
persons with the disease by preventing or delaying the onset of complications. Conventional
techniques for identifying diabetes, such blood glucose monitoring, can be costly, intrusive, and
time-consuming. Because machine learning techniques automate the process and enable more
precise and effective disease identification, they hold the potential to completely transform the
diagnosis and management of diabetes. Large-scale datasets and sophisticated algorithms are
used to find patterns and risk variables that could be hard for human experts to find.</p>
      <p>There are two main forms of diabetes: Type 1: Usually identified in young people, it is caused
by the immune system targeting cells that produce insulin. It requires daily insulin pumps or
injections to control. If left untreated, it can result in heart disease, retinopathy, neuropathy,
and nephropathy. Type 2: More frequent, usually affecting older persons, it is characterized by
either insufficient or inefficient insulin production or utilization by the body. controlled via
dietary adjustments, prescription drugs, and occasionally insulin. If left unchecked, the hazards
are comparable to those of Type 1. Because gestational diabetes during pregnancy results in
decreased insulin sensitivity and elevated blood sugar levels, it raises the risk of Type 2 diabetes
later in life for both the mother and the child.</p>
      <p>Numerous researches have examined the prediction of diabetes by taking into account a
range of characteristics, including lifestyle, Electronic Health Records (EHRs), environment, and
molecular attributes. Based on prior experience and medical records that contain patient
conditions and vital signs. The most widely used dataset in these studies is the Pima Indians
Diabetes Database [8; 1;6;4], which has 768 samples, 268 of which are patients with diabetes,
and 8 independent factors that are used to identify whether a patient has diabetes.</p>
      <p>
        Using the PIMA Indian Dataset, authors in Ref. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] used Decision Tree, SVM, and Naive
Bayes classifier techniques to predict diabetes. They discovered by 10-fold cross-validation that
Naive Bayes had the best accuracy, at 76.30%. On the basis of the same dataset, authors in Ref.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] created a predictive model with XGBoost and feature stitching, which produced an
astounding 80.2% accuracy and identified important predictive factors like diabetes pedigree
function, glucose, age, pregnancies, and BMI. A new method for classifying diabetes called
HMBag Moov was introduced by Bashir et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It was compared with a number of other
approaches, including as Naive Bayes, SVM, Logistic Regression, etc. The accuracy of the
HMBag Moov Voting Classifier was 77.21%, even though it did not employ hyperparameter
tweaking or cross-validation and only evaluated a small number of ensembling techniques.
      </p>
      <p>
        Despite the extensive utilization of the Pima dataset and the notable success in prediction
outcomes derived from it, a significant challenge arises due to class imbalance within the
dataset. Specifically, there exists a prevalence of healthier patients compared to those afflicted.
This inherent imbalance poses a substantial hurdle for classification algorithms, as the minority
classes are overshadowed. Consequently, even if misclassifying every minority instance, the
algorithm could still exhibit low error rates [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        One potential remedy for this problem is data augmentation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which entails raising the
minority class's representation in order to avoid overfitting. Also, previous studies on
imbalanced datasets, including those focused on biomedical data, affirm the efficacy and
reliability of ensemble learning methods in alleviating the challenges posed by class imbalance
[
        <xref ref-type="bibr" rid="ref14 ref15 ref16">14-16</xref>
        ]. For instance, in order to mitigate the impact of class imbalance [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] present Sample and
Feature Selection Hybrid Ensemble Learning (SFSHEL), a novel approach aimed at tackling the
complexities posed by imbalanced datasets in classification tasks. Base learner weights are
assigned through validation, enabling weighted voting for predictions. SFSHEL-RF, based on
random forest, shows superior performance on clinical datasets, validating its effectiveness. In
response to the limitations of traditional classification methods, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced an innovative
ensemble learning framework tailored for medical diagnosis with imbalanced data. Comprising
three phases—data pre-processing, base classifier training, and final ensemble—the proposed
approach was evaluated across nine imbalanced medical datasets. Results demonstrate its
superiority over other state-of-the-art classification techniques. Furthermore, [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] introduced a
multi-criteria ensemble training method tailored for imbalanced datasets, simultaneously
optimizing precision and recall. It presents a set of Pareto optimal solutions, allowing the
enduser to select the most suitable solution based on their preferences. Results confirmed the
method's utility, ensuring high-quality outcomes comparable to single-criterion optimization.
      </p>
      <p>
        The primary aim of this project is to address the challenge of imbalanced datasets in diabetes
diagnosis by leveraging data augmentation techniques and ensemble learning models. The
ultimate goal is to enhance patient outcomes and alleviate the strain on healthcare systems
caused by diabetes. To achieve this objective, the project proposes an effective and efficient
system capable of analyzing clinical data to accurately identify diabetes or determine if an
individual is in the pre-diabetic stage. The project involves the utilization of both bagging and
XGBoost (eXtreme Gradient Boosting) algorithms for diabetes prediction, using an expanded
version of the Pima dataset previously generated via a GAN (Generative Adversarial Network)
algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], in addition to the Frankfurt Hospital Germany Diabetes Dataset (FHGDD) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Bagging also known as bootstrapping is a prominent ensemble learning technique that
combines predictions from multiple decision trees trained on different subsets of the same
dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Another variant, ensemble bagging, involves constructing a collection of classifiers
that iteratively apply a specific algorithm to various versions of the training dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These
ensemble methods are valuable tools for enhancing predictive performance and addressing
overfitting in classification tasks.
      </p>
      <p>The proposed combined approach is designed to improve the accuracy and robustness of the
predictive model. Furthermore, the project seeks to advance the field by comparing its proposed
methodology with various state-of-the-art research studies, thereby providing insights into the
efficacy and superiority of the suggested approach.</p>
      <p>The remaining sections of the paper are arranged as follows: In Section 2, the study
technique is explained; in Section 3, the evaluation's specifics are outlined and the results are
discussed; and in Section 4, the article is concluded and potential future directions are discussed.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Method</title>
      <sec id="sec-2-1">
        <title>This section will provide an explanation of the research technique.</title>
        <p>
          2.1. Dataset
Our research utilized the PIMA Indian Diabetes dataset [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], originally gathered by the National
Institute of Diabetes and Digestive and Kidney Diseases. Widely recognized as a benchmark
dataset, it has been extensively employed in machine learning studies to assess the efficacy of
various classification and prediction algorithms in diabetes prediction. We utilized two versions
of this dataset: one with 768 instances and an extended version containing 1602 instances [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
Additionally, we incorporated the Frankfurt Hospital Germany Diabetes Dataset (FHGDD) [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
into our analysis. Each dataset consists of exactly eight attributes, including:
•
•
•
•
•
•
•
•
        </p>
        <p>Pregnancies: The total number of pregnancies.</p>
        <p>Blood glucose level measured in an oral glucose tolerance test after two hours.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Blood Pressure: Miligrams of Hg for the diastolic blood pressure.</title>
      </sec>
      <sec id="sec-2-3">
        <title>SkinThickness: Skin fold thickness (mm) of the triceps.</title>
      </sec>
      <sec id="sec-2-4">
        <title>Insulin: Two-hour serum level.</title>
      </sec>
      <sec id="sec-2-5">
        <title>Body Mass Index (BMI).</title>
      </sec>
      <sec id="sec-2-6">
        <title>Age: The number of years.</title>
        <p>DiabetesPedigreeFunction: The function of the diabetes pedigree.</p>
        <p>"Outcome," the only dependent variable in the dataset, had binary values of either 0 or 1.
The dataset was split into two sets, a training set and a testing set, using a 70:30 ratio in order
to assess the performance of the model. Using four folds of cross-validation on the testing set,
the model was assessed after being trained on the training set.</p>
        <p>
          The used extension of Pima was generated recently in a previous work [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] where we used
Generative Adversarial Networks (GANs) for data imputation, a technique introduced by
Goodfellow et al. in 2014 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. GANs employ a game-theoretic framework wherein a generator
network competes against a discriminator network. The generator's objective is to create
synthetic data samples resembling those from the training set, while the discriminator aims to
distinguish between real and synthetic samples. The experiment generated 1602 data lines,
including 602 authentic and 1000 synthetic lines.
        </p>
        <p>The Frankfurt Hospital Germany Diabetes Dataset (FHGDD) serves as another resource in
diabetes prediction and classification research. Comprising the same attributes as the PIDD but
with an expanded size of 2000 instances, it provides a rich data source for analyzing
diabetesrelated factors. Table 1 illustrates the distribution of instances across each class, delineating the
counts for Class 1 (diabetic) and Class 2 (non-diabetic).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>
        Our methodology combines XGBoost (eXtreme Gradient Boosting) and bagging techniques.
XGBoost, a gradient boosting decision tree (GBDT) algorithm, efficiently iterates weak models
to create a strong one, proven effective in prediction tasks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It optimizes model
parameters by merging regression trees and gradient descent. Initially, the model is initialized
with weak learners, typically shallow decision trees. Iteratively, the algorithm fits the gradient
of the loss function to the predictions of current weak learners, then trains new weak learners
based on this gradient information, adding them to the model. This process repeats until a
stopping criterion is met. Predictions are made by aggregating the predictions of all weak
learners. The optimization problem in XGBoost combines a loss function, measuring prediction
error, and a regularization term, penalizing model complexity to prevent overfitting.
      </p>
      <p>
        Decision trees, particularly CART (Classification and Regression Tree), are pivotal in
XGBoost, where their shallow structure mitigates overfitting risks. XGBoost defines an
objective function to optimize during training, comprising a regularization term controlling
model complexity and a loss function quantifying prediction accuracy against actual values. For
binary classification, the logistic loss function, also called log loss or cross-entropy loss, is
employed as the objective function in XGBoost, denoted by equation (1) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>L = I n (yi − yi^ )2 + ( f )
(1)</p>
      <sec id="sec-3-1">
        <title>Such as,</title>
      </sec>
      <sec id="sec-3-2">
        <title>L : the objective function.</title>
        <p>n : the number of samples.
yi : the true label.
yˆi : the predicted label.
(f ) : the regularization term, which is a function of the model parameters f.</p>
        <p>XGBoost is a powerful machine learning algorithm that trains an ensemble of decision trees
iteratively. It utilizes gradients to understand instance deviations and constructs trees to
identify patterns efficiently. Weighted updates adjust instance weights based on prediction
errors, while ensemble building combines individual predictions using their importance.
Regularization techniques control model complexity, preventing overfitting, and control
parameters fine-tune its behavior. By aggregating ensemble predictions weighted by
importance, XGBoost produces accurate predictions, making it effective for various machine
learning tasks.</p>
        <p>In our study, we aimed to boost the accuracy and robustness of our model by integrating
Bagging Classifier with XGBoost through a Voting classifier. The Voting Classifier, a form of
ensemble classifier, combines predictions from multiple base classifiers via a majority or
weighted vote. We adopted hard voting, where the final prediction for an input is determined
by the majority vote of individual model predictions. Mathematically, for N individual models
(f1, f2, ..., fN), the final prediction ypred for input x is obtained using equation (2), where argmax
selects the class label with the highest number of votes. Figure 1 illustrates the flowchart of the
proposed model.</p>
        <p>ypred = argmax(sum(fi(x)))f or i = 1 to N
(2)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <p>The experimental hardware setup comprises an Intel Core i3-3110M CPU clocked at 2.40GHz,
paired with 8 GB of RAM and a capacious 1 TB HDD for storage needs. Python 3.7 was utilized
to develop the machine learning model, with the following libraries employed: NumPy as a
fundamental tool for mathematical operations, pandas for efficient data loading, and scikit-learn
providing a suite of base classifiers.</p>
      <p>Precision=TP/(TP+FP)
Rappel =TP/(TP+FN)
F1-score=TP/(TP+1/2(FP+FN))
Accuracy=(TP+TN)/(TP+TN+FP+FN)
(3)
(4)
(5)
(6)</p>
      <p>In the realm of binary classification evaluation for this task, four pivotal terms emerge—True
Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN)—are fundamental
for evaluating classifier model performance. TP denotes correctly classified positive instances,
TN signifies correctly classified negative instances, FP represents incorrectly classified positive
instances, and FN indicates incorrectly classified negative instances. These terms are pivotal in
calculating various evaluation metrics such as precision, recall, F1 score, and accuracy, essential
for assessing classifier model performance. Precision measures the proportion of correctly
identified positives, recall gauges the proportion of actual positives correctly identified, and F1
score offers a balanced assessment of both precision and recall. Accuracy quantifies the overall
correctness of the classifier's predictions. Understanding these terms and associated evaluation
metrics is crucial for evaluating and enhancing the performance of binary classification models.
The formulas for these metrics are referenced in equations (3-6) as follows:</p>
      <p>As demonstrated in Table 2, our approach exhibited strong predictive capability across
multiple datasets. Specifically, high accuracy rates of 91% and 92.7% were achieved using the
Pima and Pima extended datasets, respectively, indicating the robustness of our model. The
precision values for the Pima and Pima extended datasets were notably high at 89.55% and
97.1%, respectively, indicating a high proportion of true positive predictions. Similarly, solid
recall values of 81.08% and 81.7% were obtained for the Pima and Pima extended datasets,
respectively, demonstrating the model's ability to accurately identify actual positives.
Maintaining a balance between recall and precision, crucial in medical diagnostic models, our
algorithm delivered F1-scores of 85% and 88.7% for the Pima and Pima extended datasets,
respectively, indicating a suitable balance between the two metrics. Furthermore, when applied
to the FHGDD dataset, our approach achieved an accuracy rate of 85.6% and a precision of
87.8%, albeit with a slightly lower recall of 68.7%. Nevertheless, the F1-score of 77.1%
demonstrates a reasonable balance between precision and recall. These results underscore the
effectiveness of our approach in accurately predicting the presence of diabetes across different
datasets. For further context and comparison, detailed performance metrics relative to other
state-of-the-art methods are provided in Table 3, and Figure 2 respectively.</p>
      <p>Let's explore two contrasting scenarios: one featuring a diabetic patient and the other a
nondiabetic individual. In the first case, the model identifies the patient, with attributes such as one
pregnancy, blood glucose level of 119, blood pressure of 78, skin thickness of 29, insulin level of
180, BMI of 38.19, diabetes pedigree function of 0.53, and age of 25, as non-diabetic. Conversely,
the second scenario portrays a patient with four pregnancies, blood glucose level of 129, blood
pressure of 70, skin thickness of 18, insulin level of 122, BMI of 29.43, diabetes pedigree function
of 1.17, and age of 41, classified as diabetic. This classification suggests that the combined
characteristics in the latter set imply a higher likelihood of diabetes, according to the model's
interpretation.
0,00
0,20
0,40
0,60
0,80
1,00</p>
      <p>Combining XGBoost with decision trees for binary classification tasks leverages the
strengths of both algorithms. Decision trees offer intuitive interpretations, robustness to noisy
data, and the ability to handle both numerical and categorical features effectively. XGBoost
enhances performance through boosting, regularization techniques, and scalability, making it
suitable for large datasets.</p>
      <p>The results obtained from our model using different datasets showcase its effectiveness in
predicting diabetes compared to various existing techniques. Our model achieved high accuracy
rates when applied to the Pima and extended Pima datasets, respectively. Additionally, when
our model was applied to the FHGDD dataset, it achieved a respectable accuracy rate of 85.6%,
indicating its applicability across different datasets.</p>
      <p>Comparing our results to those of other techniques, we observe that our model outperforms
several state-of-the-art methods in terms of accuracy. For instance, the soft voting classifier,
XGBoost, Bagging, Random Forest, and XGBoost+ Data feature stitching techniques achieved
accuracy rates of 79.08%, 75.75%, 74.89%, 77.48%, and 80.2%, respectively. Our model's accuracy
surpasses these benchmarks, highlighting its superior predictive performance. Moreover, our
approach also compares favorably to other ensemble methods, such as HM-Bag Moov Voting
Classifier and Voting Classifier, which achieved accuracy rates of 77.21% and 86%, respectively.</p>
      <p>In this study, we compare the accuracy obtained by our proposed model with other
state-ofthe-art models. While ensemble learning, which combines multiple algorithms, often demands
significant computational resources, our approach mitigates this challenge by ensembling just
two robust algorithms, Decision Trees (DT) and XGBoost. Additionally, some methods in
related works utilize more intricate techniques such as deep learning, which may enhance
performance but at the expense of increased computational demands and interpretability
challenges. Moreover, disparities in data preprocessing techniques and evaluation metrics
further complicate direct comparisons between different models.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study aimed to predict diabetes using machine learning techniques on the Pima Indian
Diabetes dataset. XGBoost and Bagging with Decision Trees were among the techniques used,
along with data pretreatment techniques like median imputation. For diabetes prediction, our
method produced remarkable accuracy rates of 92.7% with the extended Pima dataset and 91%
with the Pima dataset. With a precision of 89.55% for Pima and 97.1% for extended Pima, our
model demonstrated a high percentage of accurate positive predictions. The model's recall
values of 81.08% for Pima and 81.7% for extended Pima showed that it could recognize real
positive cases.</p>
      <p>These findings highlight the efficacy of our approach in predicting diabetes, although with
acknowledgment of potential biases and incomplete data. Future research should prioritize
addressing these limitations by diversifying datasets and incorporating more comprehensive
medical information to reinforce the model's accuracy and robustness. Moreover, ensuring the
model's applicability across diverse demographic groups is imperative for its generalizability.</p>
      <p>In addition, while this study showcased promising outcomes with XGBoost and Bagging
with Decision Trees, it's essential to explore additional algorithms and ensemble methods, such
as stacking, which may offer complementary benefits.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bashir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Qamar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F. H.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>"IntelliHealth: A medical decision support application using a novel weighted multi-layer classifier ensemble framework,"</article-title>
          <source>Journal of Biomedical Informatics</source>
          , vol.
          <volume>59</volume>
          , pp.
          <fpage>185</fpage>
          -
          <lpage>200</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Boughareb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bensalah</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Seridi</surname>
          </string-name>
          ,
          <article-title>"A Hybrid GAN-ANN-Based Model for Diabetes Prediction,"</article-title>
          <source>International Journal of Scientific Research in Science and Technology</source>
          , vol.
          <volume>10</volume>
          , no.
          <issue>14</issue>
          , pp.
          <fpage>30</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <article-title>"An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier,"</article-title>
          <source>International Journal of Cognitive Computing in Engineering</source>
          , vol.
          <volume>2</volume>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Larabi-Marie-Sainte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Aburahmah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Almohaini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Saba</surname>
          </string-name>
          ,
          <article-title>"Current techniques for diabetes prediction: Review and case study,"</article-title>
          <source>Applied Sciences</source>
          , vol.
          <volume>14</volume>
          , pp.
          <fpage>2519</fpage>
          -
          <lpage>2528</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Latha</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Jeeva</surname>
          </string-name>
          ,
          <article-title>"Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques,"</article-title>
          <source>Inform. Med. Unlocked</source>
          , vol.
          <volume>16</volume>
          , p.
          <fpage>100203</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>"Diabetes prediction based on XGBoost algorithm,"</article-title>
          <source>in IOP Conference Series: Materials Science and Engineering</source>
          , vol.
          <volume>768</volume>
          , no.
          <issue>7</issue>
          , p.
          <fpage>072093</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Paranjape</surname>
          </string-name>
          et al.,
          <article-title>"An agent-based simulation system for modeling a diabetic patient,"</article-title>
          <source>International Journal of Intelligent Information and Database Systems</source>
          , vol.
          <volume>4</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>264</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Pima</given-names>
            <surname>Indian Diabetes</surname>
          </string-name>
          . Retrieved from https://www.kaggle.com/uciml/pima-indiansdiabetes-database
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Katheria</surname>
          </string-name>
          ,
          <article-title>"Ensemble method based predictive model for analyzing disease datasets: A predictive analysis approach," Health Technol.</article-title>
          , vol.
          <volume>9</volume>
          , pp.
          <fpage>533</fpage>
          -
          <lpage>545</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G. N.</given-names>
            <surname>Sundar</surname>
          </string-name>
          et al.,
          <article-title>"Intelligent computational techniques of machine learning models for demand analysis and prediction,"</article-title>
          <source>International Journal of Intelligent Information and Database Systems</source>
          , vol.
          <volume>16</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>39</fpage>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F. H. K.</given-names>
            <surname>Tanaka</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Aranha</surname>
          </string-name>
          ,
          <article-title>"Data augmentation using GANs,"</article-title>
          <source>in 2020 IEEE International Conference on Big Data</source>
          , pp.
          <fpage>5048</fpage>
          -
          <lpage>5053</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>XGBoost</surname>
          </string-name>
          . Retrieved from https://xgboost.readthedocs.io/en/latest/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          et al.,
          <article-title>"Generative adversarial networks,"</article-title>
          <source>in Proceedings of the 27th International Conference on Neural Information Processing Systems</source>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          , and C. Cortes, Eds. Massachusetts: MIT Press,
          <year>2014</year>
          , pp.
          <fpage>2672</fpage>
          -
          <lpage>2680</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>"A Novel Ensemble Learning Paradigm for Medical Diagnosis With Imbalanced Data,"</article-title>
          <source>IEEE Access</source>
          , vol.
          <volume>8</volume>
          , pp.
          <fpage>171263</fpage>
          -
          <lpage>171280</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>"Sample and feature selecting based ensemble learning for imbalanced problems,"</article-title>
          <source>Applied Soft Computing</source>
          , vol.
          <volume>113</volume>
          , p.
          <fpage>107884</fpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W.</given-names>
            <surname>Węgier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koziarski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <article-title>"Multicriteria classifier ensemble learning for imbalanced data,"</article-title>
          <source>IEEE Access</source>
          , vol.
          <volume>10</volume>
          , pp.
          <fpage>16807</fpage>
          -
          <lpage>16818</lpage>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>[17] https://www.kaggle.com/datasets/johndasilva/diabetes</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>