<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Irish Conference on Artificial Intelligence and Cognitive Science, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Handling Class Imbalance via Counterfactual Generation in Medical Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Asifa Mehmood Qureshi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhishek Kaushik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gilbert Regan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin McDaid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fergal McCafery</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Regulated Software Research Centre, Dundalk Institute of Technology</institution>
          ,
          <addr-line>Dundalk</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Real-world datasets often contain uneven class distributions, that if not handled properly result in biased Machine Learning (ML) models. Therefore, class balancing is important to avoid overfitting, improve model generalisation and ensure fairness. Most state-of-the-art techniques used to balance datasets do not take into account the majority class samples that contain greater distributional information of the dataset. Therefore, in this article, we propose a method that generates counterfactuals using majority-class samples. The method takes an imbalanced dataset as input, normalises the dataset, and trains a Support Vector Machine (SVM) classifier on it. Afterwards, the majority class samples that lie near the decision boundary are extracted and perturbed until they are classified as minority class samples. The method is evaluated on two benchmark datasets i.e., the Diagnostic Wisconsin Breast Cancer dataset and the Eye State Classification Electroencephalogram (EEG) dataset. The results show that our approach produces reasonable accuracy, Area Under Curve (AUC), and Geometric Mean (Gmean) scores. Also, the F1-score also improved for minority classes when oversampled using counterfactuals. Moreover, the model achieved promising results when compared with state-of-the-art techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Boundary enhancement</kwd>
        <kwd>Over-sampling</kwd>
        <kwd>SVM</kwd>
        <kwd>decision boundary</kwd>
        <kwd>classification</kwd>
        <kwd>counterfactuals</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The class imbalance problem typically occurs when there are many more instances of one class called
the majority class than others [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is considered one of the significant challenges in relation to data
quality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Imbalanced datasets exist in numerous real-world fields such as text classification [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], object
detection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], network security [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], medical diagnosis [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and many more. Machine Learning (ML)
classifiers when trained on imbalanced datasets are skewed towards majority classes and frequently
misclassify instances from minority classes resulting in biased outcomes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. These biases may result
in discrimination in automated decision-making especially in critical sectors like healthcare [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. For
example, in a breast cancer dataset, if the number of data samples for the positive cancer diagnosis is
smaller than healthy patient samples then the classifier trained on such a dataset may misclassify the
patient as healthy which can lead to life-threatening consequences [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        There are several methods to balance datasets including, algorithm-level methods, data-level methods,
and hybrid methods [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Data-level methods are widely used because these methods directly address
the shortcomings of data thus improving the data quality on which the model is being built. These
methods tend to transform the original dataset to change the class distribution via re-sampling [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Re-sampling includes both under-sampling and over-sampling i.e., under-sampling involves the removal
of the majority class samples from the dataset whereas over-sampling is the process of increasing
minority class data samples by generating synthesised data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Under-sampling may remove data
points that contain important information, and it reduces the dataset size which may worsen the ML
model performance [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Conversely, over-sampling adds essential information to the minority class
without any information loss and prevents instances from being misclassified [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Several over-sampling methods use minority samples for new data generation. However, these
methods ignore the majority class entirely in favour of focusing on minority class characteristics, which
provide little distributional information. Consequently, they do not focus on the global properties of
the dataset that are defined by majority class distribution and produce inaccurate synthetic training
examples [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        In this paper, an over-sampling approach is proposed that uses majority class data samples to generate
minority class data. In this method, the majority class samples named actual samples are perturbed
to generate counterfactuals that lie in the minority sample region. The method takes an imbalanced
normalised binary class dataset as input. A Support Vector Machine (SVM) classifier is trained on the
dataset. The samples of the majority class samples that are closest to the classifier decision boundary
are extracted. These data samples are perturbed to a level so that they move to the minority class space.
Two publicly available binary class medical datasets are used to validate our proposed model. The
contributions of the paper are as follows:
• A method that uses majority class samples to generate minority data points. These newly
generated data points can be termed counterfactuals.
• In order to lower the computation overhead and enhance the decision boundary, we trained
an SVM classifier to extract data points closest to the decision boundary rather than selecting
random samples from the majority class [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
• The selection of data samples nearer to the decision boundary containing support vectors also
ensures minimum deviation of the majority class samples to generate samples of the minority
class rather than limiting the distance using a constant.
• The performance of the model is evaluated on two benchmark medical datasets using various
evaluation metrics.
      </p>
      <p>The remainder of the paper is structured as follows: Section 2 provides a literature review of relevant
oversampling techniques. Section 3 explains the overall methodology. Section 4 defines the dataset and
corresponding evaluation results. Finally, section 5 concludes the discussion and lists future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        The problem of class imbalance has drawn a lot of attention from the scientific community. This section
gives a summary of the techniques for over-sampling. For better understanding, we categorise the
literature into two streams: Statistical and Machine Learning (ML) Methods and Deep Learning (DL)
methods.
2.1. Statistical and Machine Learning (ML) methods
Several studies have been carried out to handle the issue of class imbalance within datasets. One of the
most used techniques is the Synthetic Minority Over-sampling Technique (SMOTE) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. It generates
new samples by utilising interpolation between decision minority samples nearest neighbours. Another
SMOTE variant is the Borderline SMOTE which generates minority samples at the borderline to enhance
the decision boundary of the classifier [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. There are more than 81 variants of SMOTE proposed in
the existing research work. The majority of these methods focus on utilising minority-class samples
to produce new artificial samples that may lead to overfitting. In another study by Sharma et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
the majority class samples were used to generate synthetic data. They utilised Mahalanbois distance
to generate minority samples that are at an equal distance from the majority class samples. However,
this technique does not consider boundary samples in their generation process. In another study [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ],
SVM-SMOTE is combined with ensemble learning to enhance the performance of the classifier. The
primary goal is to find borderline cases in the minority class by using Kernel Density Estimation (KDE).
After the identification of borderline instances, synthetic interpolating is used to generate new samples
between the marginal instances and their current minority-class neighbours. Moreover, Wang et al.
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] also presented a model that utilises majority-class samples to generate minority-class samples. The
model produces reasonable results, but the random selection of the majority class sample increases the
computational cost and results in multiple iterations to generate minority class samples that are at a
minimum distance from the majority samples.
2.2. Deep Learning (DL) methods
Deep learning (DL) has also been used to generate synthetic data due to its advanced capabilities. For
this purpose, Generative Adversarial Networks (GANs) are extensively used. In [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], the authors created
synthetic electroencephalography (EEG) datasets using a GAN. Also, to balance the dataset used for
automatic signal modulation classification, Patel et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] employed a Conditional-GAN (CGAN) for
data augmentation. However, the performance of the model was good but deep learning models are
computationally complex when compared to conventional methods. Additionally, deep learning models
lack explainability, thus providing minimal control over the parameters and the data-generating process
[21, 22].
      </p>
      <p>Therefore, we have presented a statistical over-sampling method that utilises the SVM classifier and
majority class samples, unlike other techniques to balance the dataset.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Figure 1 provides an overview of our proposed workflow diagram. Initially, the dataset is normalised
and an SVM classifier is trained on the imbalanced dataset. Then, the majority class samples near the
decision boundary are extracted using the Euclidean distance and their corresponding counterfactuals
are generated. If the generated counterfactual after the perturbation is classified as a minority class
sample by the SVM classifier, then it is added to the new dataset otherwise the sample is discarded.
This process is repeated until a balanced dataset is obtained. Afterwards, diferent machine learning
classifiers are trained on the newly generated balanced dataset and their performance is evaluated in
terms of accuracy, F1-score, Area Under Curve (AUC), and Geometric Mean (Gmean).
3.1. Data normalisation
Data normalisation includes the transformation of numerical features within a common range to prevent
bigger numerical feature values from dominating over smaller numerical feature values [23]. It is an
important preprocessing step to enhance the classification performance of the classifier. The dataset
was normalised as follows:
k′ = a + (b − a) ×</p>
      <p>k − kmin
kmax − kmin</p>
      <p>Where k′ is the normalised feature value, a and b are the desired minimum and maximum values for
the normalised range.k presents the original feature value and kmin and kmax represent the minimum
and maximum values of the original feature values. In our case, we kept the values of a and b to be 5
and 20 because normalising within a narrow range helps preserve the distribution shape and optimise
the performance of the data generation algorithm.</p>
      <sec id="sec-3-1">
        <title>3.2. Train SVM classifier</title>
        <p>After normalisation, an SVM classifier is trained on the original dataset to learn the decision boundary
that separates the minority and majority class instances. SVM is a supervised learning algorithm
that analyses the dataset linearly and divides the hyperplane by the widest possible gap to classify
the samples [23]. Then, the samples from the majority class that are nearest to the SVM classifier
decision boundary are extracted based on Euclidean distance using the imbalance ratio to generate
counterfactuals as shown in Figure 2.
(1)</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Counterfactual generation</title>
        <p>To generate counterfactuals, we employed regular perturbation on each of the selected samples from the
majority class. In order to perturb a sample, we used the truncated normal distribution F (Δ(kp)) that
presents the probability distribution obtained from normally distributed random variables by limiting
the generated counterfactuals from both below and above [25] as shown in Figure 3.</p>
        <p>
          For any qth feature of the actual sample k, we utilise the following conditional probabilities to estimate
the distribution of the perturbation Δ(kpq) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>Fpq Δkpq | Kpq, Kq− , Kq+, σ
=  Φ Kq+− σ Kσ1pψq ( Δ−xΦσnmK)q− − σ Kpq
0
if Kq− ≤</p>
        <p>Kpq + Δkpq ≤</p>
        <p>Kq+,
otherwise</p>
        <p>Where Kq− and Kq+ present the minimum and maximum values of the qth feature in the original
dataset K, respectively. σ presents the standard deviation of the qth feature. ψ Δ xσnm indicates the
standard normal distribution’s probability density function given below:
Φ is the cumulative distribution function given below:
where
ψ
Δxnm
σ</p>
        <p>1 e− 21 ( Δxσ nm )2
= √2π
Φ(g) =
1 g
2 1 + erf √2
=</p>
        <p>Z g
−∞
√2π</p>
        <p>1 e− t22 dt
g = Kq+ − σ Kpq and g = Kq− − σ Kpq (5)
where erf(.) presents the Gaussian error function. Using this method, any qth feature will not exceed
the corresponding range of the feature p.</p>
        <p>Now, to generate Δkpq that follows the distribution Fpq, we used the inverse transform method
where the perturbation is given as follows:</p>
        <p>
          Δkpq = Φ− 1 (Φ(α ) + R · (Φ(β ) − Φ(α ))) σ + Kpq
where R is any random number between the range [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ], and α and β are defined as below:
α =
        </p>
        <p>Kq+ −
σ</p>
        <p>Kpq</p>
        <p>Kq− − Kpq
β =</p>
        <p>σ
In the end, the perturbation on the actual data sample can be defined as:
(2)
(3)
(4)
(6)
(7)
(8)
Spq =</p>
        <p>Δkp | Δkpq ∼ Fnm, kp′ = kp + Δkp, kp ∈ K0, f (kp) = n, f (kp′) = m
where</p>
        <p>kp ∈ K0, f (kp) = n, f (kp′) = m
where kp and kp′ are the actual and counterfactual data samples respectively. f (kn) is the classifier
function, and n and m are the class labels. After generating counterfactuals i.e., new data samples that
are classified as minority class samples after perturbation by the SVM classifier, we obtained a new
balanced dataset that is a combination of actual and synthetic data samples.</p>
        <p>Algorithm 1 summarises the steps of generating counterfactuals.</p>
        <p>(9)
(10)
Algorithm 1 Oversampling via counterfactual generation
Input: imbalance binary label dataset K = {k1, k2, k3, . . . , kn}
Output: Kbalanced //normalise the dataset
Knorm = Normalise(K)
f (Knorm) = Train SVM classifier on the dataset Knorm
Knorm|near the decision boundary=Extract data points near the decision boundary f (Knorm)
Ksynthetic = {}
For each kp ∈ Knorm|near the decision boundary do
For j = 1 to T do //perturb each sample for T times to control the number of perturbation
Δkp = Δkpq ∼ Fnm //perturb features by sampling over Fnm
kp′ = kp + Δkp
Iff (kp) = n and f (kp′) = m then //n is majority class sample and m is minority class sample
Ksynthetic ← { kp′} //insert the counterfactual into the synthetic dataset
end if
end for
end for
Kbalanced = Knorm ∪ Ksynthetic //final balanced dataset
return Kbalanced
end</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Performance evaluation</title>
      <p>4.1. Datasets
To assess our model, we used two benchmark datasets i.e., Diagnostic Wisconsin Breast Cancer and
the Eye State Classification EEG datasets as these medical datasets have binary imbalance classes with
diferent imbalance ratios and only continuous features. Following is the description of both datasets:</p>
      <p>Diagnostic Wisconsin Breast Cancer Dataset: The Diagnostic Wisconsin Breast Cancer [24] is a
multivariate dataset consisting of 30 features and 569 samples. The binary output label classifies the
tumour as malignant (0) and benign (1). The majority class for this dataset is 1 and the minority class is
0.</p>
      <p>Eye State Classification Electroencephalogram (EEG) dataset: The Eye State Classification EEG
[25] is a multivariate time series dataset comprising 14 features and 14980 samples. The output label
classifies the eye state as 0 and 1 indicating the eye as open or close respectively. The majority class for
this dataset is 0 and the minority class is 1.</p>
      <p>
        Table 1 displays the imbalance ratio of both datasets as well as the number of samples to be generated
per class.
4.2. Evaluation of our method
To evaluate the generated counterfactual samples, we trained commonly used ML classifiers on the
dataset because they generalise well on diverse datasets. These classifiers include Random Forest (RF),
Logistic Regression (LR), K-nearest neighbour and Decision Tree (DT). All these classifiers are trained
using default parameter settings. The datasets are split into train and test sets of 70:30 ratio. We used
Accuracy, Area Under Curve (AUC), Geometric Mean (Gmean) and F1-score to evaluate the performance
of our proposed model. These metrics are more comprehensive and largely used in the literature to
assess the classifier performance for imbalanced datasets [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. These parameters are calculated as
follows:
where False Positive Rate (F P R) are actual negative cases that are classified as positive by the classifier.
      </p>
      <p>Figure 4 shows the comparison of accuracy and F1-score before and after applying our proposed
method.</p>
      <p>The original dataset was biased toward the majority class whereas the synthetic dataset generated
using counterfactuals is balanced for each class label. Therefore although the accuracy for the Wisconsin
dataset in Figure 4(a) is slightly lower than the original dataset we can say that overall our method
maintains good accuracy scores for both datasets. Moreover, Figure 4(b) demonstrates that the F1-score
particularly focusing on the minority class has improved for both datasets which represents a better
generalisation of the model on each class label. For example, for the Wisconsin breast cancer dataset,
the F1-score of the DT for class 0 (minority) has increased from 0.92 to 0.93. Similarly, for the eye state
classification dataset, the F1-score of RF for class 1 (minority) has increased from 0.91 to 0.94.
4.3. Comparison with other State-of-the-Art techniques
Moreover, the performance is also compared with other conventional methods including SMOTE,
Borderline, Safe-level, and ADASYN. Table 2 and Table 3 show the values for our evaluation parameters
(12)
(14)
(a)
(c)
(b)
(d)
for the Wisconsin breast cancer and Eye state classification dataset respectively.
The results indicate that the performance of our algorithm is comparable to the existing conventional
synthetic data generation models. Our approach yields comparative results to the Borderline approach
for all three metrics i.e., Accuracy, AUC, and Gmean. For the classifier performance, LR and KNN
performed well for the Wisconsin Breast Cancer and Eye Movement datasets respectively. Additionally, we
have statistically compared our method with the borderline, as it has better performance in comparison
to other approaches, using a paired t-test. The test is performed using AUC scores as it assesses the
classifier performance better in case of class imbalance. The obtained p-values of 0.91 and 0.31 on the
Wisconsin Breast Cancer and Eye Movement datasets respectively indicate that there is no significant
statistical diference between the performance of the two. Notably, our approach has the potential to
generate counterfactuals with minimum inversion that enhances the boundary of the classifier.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and future work</title>
      <p>In this article, we presented a new counterfactual generation method that generates samples of minority
class using the majority class samples in order to balance the dataset. The method makes use of the rich
distributional information that lies in the majority class with minimal inversions. The proposed method
is assessed on two benchmark datasets: the Diagnostic Wisconsin Breast Cancer dataset and the Eye
State Classification EEG dataset. The findings indicate that the F1-score for the minority class have
improved which represents better model generalisation. Furthermore, our method yields promising
AUC and Gmean values in comparison to existing approaches. In future, we will extend our model to
remove any outliers or noisy samples before generating counterfactuals. Also, we will evaluate our
model on more diverse medical datasets including diferent data types and multiclass labels to increase
its applicability to diversified real-world datasets. Also, we will extend our experiment by using other
classifiers to analyse and improve the shortcomings of SVM.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This publication has emanated from research conducted with the financial support of Research Ireland
(RI) under Grant number 21/FFP-A/9255.
learning, 2020, pp. 31–36.
[21] W. J. Von Eschenbach, Transparency and the black box problem: Why we do not trust ai, Philosophy
&amp; Technology 34 (2021) 1607–1622.
[22] S. F. Ahmed, M. S. B. Alam, M. Hassan, M. R. Rozbu, T. Ishtiak, N. Rafa, M. Mofijur, A. Shawkat Ali,
A. H. Gandomi, Deep learning modelling techniques: current progress, applications, advantages,
and challenges, Artificial Intelligence Review 56 (2023) 13521–13617.
[23] N. G. Ramadhan, Comparative analysis of adasyn-svm and smote-svm methods on the detection
of type 2 diabetes mellitus, Scientific Journal of Informatics 8 (2021) 276–282.
[24] UCI, Breast cancer wisconsin (diagnostic), 2024. URL: https://archive.ics.uci.edu/dataset/17/breast+
cancer+wisconsin+diagnostic, accessed: 2024-08-15.
[25] UCI, Eeg eye state, 2024. URL: https://archive.ics.uci.edu/dataset/264/eeg+eye+state, accessed:
2024-08-15.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Lalotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sasikala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Rajput</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakshmanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shorfuzzaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alsufyani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uddin</surname>
          </string-name>
          ,
          <article-title>Addressing binary classification over class imbalanced clinical datasets using computationally intelligent techniques</article-title>
          ,
          <source>in: Healthcare</source>
          , volume
          <volume>10</volume>
          ,
          <string-name>
            <surname>MDPI</surname>
          </string-name>
          ,
          <year>2022</year>
          , p.
          <fpage>1293</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y. F.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>On the data quality and imbalance in machine learning-based design and manufacturing-a systematic review</article-title>
          ,
          <source>Engineering</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Padurariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Breaban</surname>
          </string-name>
          ,
          <article-title>Dealing with data imbalance in text classification</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>159</volume>
          (
          <year>2019</year>
          )
          <fpage>736</fpage>
          -
          <lpage>745</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Quan,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kuang</surname>
          </string-name>
          , L. Liu,
          <article-title>A class imbalance loss for imbalanced object recognition</article-title>
          ,
          <source>IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing</source>
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>2778</fpage>
          -
          <lpage>2792</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hasanin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Khoshgoftaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Leevy</surname>
          </string-name>
          ,
          <article-title>A comparison of performance metrics with severely imbalanced network security big data</article-title>
          ,
          <source>in: 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>A novel ensemble learning paradigm for medical diagnosis with imbalanced data</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>171263</fpage>
          -
          <lpage>171280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Napierala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stefanowski</surname>
          </string-name>
          ,
          <article-title>Types of minority class examples and their influence on learning classifiers from imbalanced data</article-title>
          ,
          <source>Journal of Intelligent Information Systems</source>
          <volume>46</volume>
          (
          <year>2016</year>
          )
          <fpage>563</fpage>
          -
          <lpage>597</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ahmed</surname>
          </string-name>
          ,
          <article-title>Leveraging feature bias for scalable misprediction explanation of machine learning models</article-title>
          ,
          <source>in: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1559</fpage>
          -
          <lpage>1570</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Adinarayana</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Ilavarasan,</surname>
          </string-name>
          <article-title>An eficient decision tree for imbalance data learning using confiscate and substitute technique</article-title>
          ,
          <source>Materials Today: Proceedings</source>
          <volume>5</volume>
          (
          <year>2018</year>
          )
          <fpage>680</fpage>
          -
          <lpage>687</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Khushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shaukat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Hameed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Uddin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Reyes</surname>
          </string-name>
          ,
          <article-title>A comparative performance analysis of data resampling methods on imbalance medical data</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>109960</fpage>
          -
          <lpage>109975</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mohammed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rawashdeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdullah</surname>
          </string-name>
          ,
          <article-title>Machine learning with oversampling and undersampling techniques: overview study and experimental results</article-title>
          ,
          <source>in: 2020 11th international conference on information and communication systems (ICICS)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Douzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bacao</surname>
          </string-name>
          ,
          <article-title>Self-organizing map oversampling (somo) for imbalanced data set learning</article-title>
          ,
          <source>Expert systems with Applications</source>
          <volume>82</volume>
          (
          <year>2017</year>
          )
          <fpage>40</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Shelke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Deshmukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Shandilya</surname>
          </string-name>
          ,
          <article-title>A review on imbalanced data handling using undersampling and oversampling technique</article-title>
          ,
          <source>Int. J. Recent Trends Eng. Res</source>
          <volume>3</volume>
          (
          <year>2017</year>
          )
          <fpage>444</fpage>
          -
          <lpage>449</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bellinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krawczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zaiane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Japkowicz</surname>
          </string-name>
          ,
          <article-title>Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance, in: 2018 IEEE international conference on data mining (ICDM)</article-title>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>447</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          , G. Su, M. Liu,
          <article-title>Counterfactual-based minority oversampling for imbalanced classification</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>122</volume>
          (
          <year>2023</year>
          )
          <fpage>106024</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N. V.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Bowyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. O.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. P.</given-names>
            <surname>Kegelmeyer</surname>
          </string-name>
          ,
          <article-title>Smote: synthetic minority over-sampling technique</article-title>
          ,
          <source>Journal of artificial intelligence research 16</source>
          (
          <year>2002</year>
          )
          <fpage>321</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Han</surname>
          </string-name>
          , W.-Y. Wang,
          <string-name>
            <given-names>B.-H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <article-title>Borderline-smote: a new over-sampling method in imbalanced data sets learning</article-title>
          ,
          <source>in: International conference on intelligent computing</source>
          , Springer,
          <year>2005</year>
          , pp.
          <fpage>878</fpage>
          -
          <lpage>887</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nithya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kokilavani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L. A.</given-names>
            <surname>Beena</surname>
          </string-name>
          ,
          <article-title>Balancing cerebrovascular disease data with integrated ensemble learning and svm-smote</article-title>
          ,
          <source>Network Modeling Analysis in Health Informatics and Bioinformatics</source>
          <volume>13</volume>
          (
          <year>2024</year>
          )
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Fahimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Goh</surname>
            ,
            <given-names>K. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Guan</surname>
          </string-name>
          ,
          <article-title>Towards eeg generation using gans for bci applications</article-title>
          , in: 2019
          <source>IEEE EMBS International Conference on Biomedical &amp; Health Informatics (BHI)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <article-title>Data augmentation with conditional gan for automatic modulation classification</article-title>
          ,
          <source>in: Proceedings of the 2nd ACM Workshop on wireless security and machine</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>