<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The effect of the imbalanced training dataset on the quality of classification of lithotypes via whole core photos</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daria Makienko</string-name>
          <email>dmakienko@slb.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilya Seleznev</string-name>
          <email>iseleznev@slb.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilia Safonov</string-name>
          <email>isafonov@slb.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Schlumberger Moscow Research</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>132</fpage>
      <lpage>136</lpage>
      <abstract>
        <p>-Nowadays machine learning methods play an important role in many industries. However, the effectiveness of the predictive models depends on the quality of data sets used to train the model. In practice, the imbalanced datasets are quite common. For example, in the problems of lithotypes classification via whole core photos, some lithotypes often predominate the training dataset while some of the other lithotypes can be underrepresented. The significant imbalance in the dataset can affect the quality of the classification. In this case it is difficult to obtain good generalization for poorly represented classes. First, some characteristics of a given minor lithotype may be absent. Second, some features of a minor class can be ignored due to imbalance. In this paper, we analyze the oversampling of a minor class as one of the possible options to obtain the balanced dataset within the framework of the problem of speeding-up the geological core description. We considered examples with different dataset sizes and imbalance characteristics to study the effect of applying the oversampling approach on the quality of predictive models.</p>
      </abstract>
      <kwd-group>
        <kwd>imbalanced dataset</kwd>
        <kwd>oversampling</kwd>
        <kwd>classification of lithotypes</kwd>
        <kwd>geological core description</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        The lithological description of whole core specimens is a
time-consuming process. Using whole core photos to classify
rocks and mark depth intervals corresponding to these rock
classes can significantly reduce the time required for such
description. Modern methods for automating the description
of rocks by core photographs are based on machine learning.
The most informative features for machine learning are the
color characteristics of core image fragments [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ]. In this
paper we build predictive machine learning-based models
using color characteristics of whole core photos. We consider
an important factor that largely determines the quality of rock
classification, namely, the influence of data imbalance, on
which the predictive model is trained, and one of the
approaches to compensate the imbalance.
      </p>
      <p>The aim of our study is to determine the parameters of data
samples that can significantly affect the quality of predictive
models, as well as to assess the degree of such influence.
Analyzing characteristics of sample imbalances in a wide
range of values, we want to understand the limitations of the
dataset parameters at which such imbalance can be corrected
to improve the quality of predictive models.</p>
    </sec>
    <sec id="sec-2">
      <title>II. OVERVIEW OF TECHNIQUES FOR PROCESSING</title>
      <p>IMBALANCED DATA</p>
      <p>
        Real datasets often lack any data due to the difficulty of
obtaining them. Different methods are used to compensate for
missing data depending on the data type and the type of task
[
        <xref ref-type="bibr" rid="ref10 ref11 ref3 ref4 ref6 ref7 ref8 ref9">3-11</xref>
        ]. We consider the case of imbalance of classes in the
classification problem, when the data are presented in the form
of numerical features.
      </p>
      <p>
        In the classification problem, it is preferable that the
training examples are evenly distributed among the classes.
Some classifiers take into account the errors for different
classes with same weights and in case of imbalance they
become more focused on the overrepresented classes. The
reason for such behavior of classifiers is that identifying the
characteristics of the majority class contributes stronger to the
target value (quality functional or error function) than
identifying the characteristics of the minority class. However,
the imbalanced classification data sets are often observed in
applied problems [
        <xref ref-type="bibr" rid="ref10 ref11 ref4 ref6 ref7 ref8 ref9">4-11</xref>
        ]. Data sets for the lithological
description of core are no exception. The imbalance of classes
is associated with different rock occurrence. The following
methods can be used to train a model on imbalanced data
[911]:
      </p>
      <p>1. Balancing, that is changing the ratio of classes in the
sample by increasing the number of instances of the minority
class (oversampling) or reducing the number of instances of
the majority class (undersampling).</p>
      <p>2. Making adjustments to the learning algorithm. For
example, setting different penalties for classes in the support
vector machine, changing the probability threshold for
classifying an example as a class in trees.</p>
      <p>3. Establishing different error rates for classes. The cost
of errors can be taken into account both when changing the
ratio of classes in the sample, and when making adjustments
to the learning algorithm.</p>
      <p>4. The use of boosting. Several classifiers that correct
each other's errors can improve the quality of model
predictions based on examples of a minority class.</p>
      <p>For lithological description based on full-size core
photographs, we investigate the oversampling. This approach
balances samples by increasing the number of examples of the
minority class. Some of the existing oversampling techniques
are as follows:</p>
      <p>1. Random oversampling: Copies of randomly selected
elements of the minority class are created until the required
ratio is reached.</p>
      <p>
        2. SMOTE (Synthetic Minority Oversampling
Technique) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]: New examples are generated by
interpolating the examples of the minority class — some i-th
example and one of its k-nearest neighbors. There are several
options for choosing the i-th example. One can make a random
selection (Regular SMOTE), select an example depending on
the classes to which the surrounding examples (Borderline
SMOTE) belong, depending on the constructed support
vectors or on the constructed clusters.
      </p>
      <p>
        3. ADASYN (Adaptive Synthetic) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]: It works
similarly to the SMOTE method but selects the i-th example
of a minority class depending on the coefficient ri, which
shows the proportion of examples of other classes around the
i-th example. The greater the coefficient ri, the more examples
are generated in the vicinity of the i-th example.
      </p>
    </sec>
    <sec id="sec-3">
      <title>III. SETTING UP AN EXPERIMENT</title>
      <p>There are a lot of lithotypes. In general, lithology
classification is a multiclass problem. For our study we
simplify the problem at this stage, considering a binary solver
that is the one-vs-rest classifier. To study the influence of the
imbalance on the quality of classification the depth intervals
were selected in the manner to obtain balanced and variously
represented target lithotype data, reflecting the typical color
features of this lithotype as well. We denote the data obtained
after processing all of the images the initial sample. To study
the effect of imbalance on the quality of classification, we
form different size subsamples of the initial sample, which act
as minority class with different imbalance. We train predictive
models on such subsets and try to compensate the imbalance.</p>
      <p>We tested 4 initial data sets with different sizes of minority
and majority classes: 2330:4075, 1165:2038, 583:1019,
292:510, where the first value is the number of examples of
the minority, the second is the number of examples of the
majority class (other lithotypes). To create subsamples with
different class ratios and study the influence of the initial ratios
on the further complement of the sample, the minority class is
reduced by randomly choosing a subset of it of size m. The
value of m corresponds to some new proportion p relative to
the size of the majority class. Such subsamples are denoted as
p(m). To reduce the influence of a random factor on the
classification results, for each subsample p(m), examples are
selected 10 times and the results are averaged.</p>
      <p>
        While training sets have different levels of imbalance, the
test set is not changed and has the class ratio inherited from
the initial sample. To assess the quality of models, a 5 folds
cross-validation is used [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The training data sets consist of
4/5 of the initial data sets and have sizes of minority and
majority classes: 1864:3260, 932:1630, 466:815, 234:408.
Testing is performed 5 times on each of the folds and the
results are averaged.
      </p>
      <p>To balance the training set, we use SMOTE with random
selection of examples. After balancing and training, the
classification accuracy is estimated. We apply the linear
classification algorithm (logistic regression) and the
treebased algorithms (gradient boosting and random forest) to
train the classifier. We employ F1 score to evaluate the quality
of models. F1 is the harmonic mean of Precision and Recall:
R e c a ll </p>
      <p>T P
T P  F N</p>
      <p>F 1 
; P r e c is io n 
,
2T P  F P  F N
where TP (True Positive), TN (True Negative) are the number
of correctly predicted objects of the positive and negative
classes correspondingly; FN (False Negative), FP (False
Positive) are the number of objects incorrectly assigned to
negative and positive classes correspondingly. The positive
class is the minority class corresponding to the target rock, and
the negative class is the majority class, corresponding to other
rock types.</p>
    </sec>
    <sec id="sec-4">
      <title>IV. RESULTS</title>
      <p>The quality of models for determining the "silty-clay rock"
lithotype, trained on imbalanced subsamples without using
oversampling, increases with the growth of the proportion p
and their number m of examples of the minority class (Fig. 1).
At the same time, the quality reaches an acceptable level only
in subsamples where the level of imbalance is very small.
Therefore, it becomes necessary to correct the imbalance, as
well as to study the influence of parameters p and m on the
operation of the classifier. After applying oversampling to
balance classes, the quality of the models improves.
c)
Fig. 1. The effect of oversampling on classification quality for training set
1864:3260: (a) logistic regression, (b) gradient boosting, (c) random forest</p>
      <p>Fig. 2 shows plots of the dependence of the F1 score on
the proportion of minority class examples after oversampling
for the two training sets. The plots for training sets 932:1630
and 466:815 are not shown, because they look similar. One
can see the proportion p and the number m in the legend of
plots.</p>
      <p>By comparing of Fig. 2(a) and Fig. 2(b), for the
classification of the target lithotype, we conclude the
following:</p>
      <p>1. The quality of the model depends on the number of
examples m in the minority class before oversampling. The
dependence on p is not significant.</p>
      <p>2. The quality of the model grows with the increasing
number of examples representing the minority class before the
oversampling.</p>
      <p>3. The quality of the model increases when the fraction of
minority class examples increases due to the use of
oversampling.</p>
      <p>4. There is the threshold for the number of examples m, at
which the quality of the initial sample can be reached if
oversampling is applied.</p>
      <p>Fig. 3 shows plots of the F1 score dependence on the
proportion of minority class samples after oversampling for
random examples extracts from the initial sample at p = 0.002
(m = 5), p = 0.005 (m = 15), and p = 0.015 (m = 50). With a
small number of examples of the minority class, the random
factor in choosing these examples has a significant impact on
the accuracy of classification. If the initial training set
1864:3260 is trimmed to an imbalance p = 0.002 (m = 5), then
when the minority class is oversampled to balance with the
majority class, the average F1 score is 0.62, but the deviation
from the average reaches 0.09. As the number of examples
increases, the average value of the F1 score increases, and its
dispersion decreases. This is not true for all m, but in general
this trend persists. For a subsample with p = 0.015 (m = 50),
the average value of the F1 score after balancing is 0.85 and
the deviation is 0.01.</p>
      <p>To assess the similarity of the subsamples balanced by the
SMOTE method with the initial sample, we used histograms
and cross-plots constructed for the most significant features.
Fig. 3. Graphs showing the scatter of the F1 score for 10 random versions of
subsamples from the training set 1864:3260 with parameters (a) p = 0.002
(m = 5), (b) p = 0.005 (m = 15), (c) p = 0.015 (m = 50).</p>
      <p>For 10 versions of subsamples with the parameter m = 50,
balanced to an equal ratio of classes, the distribution of
features on histograms and cross-plots is visually similar to
the distribution of the initial sample (Fig. 4 (c), (d)). For
subsamples with the parameter m = 5, balanced to an equal
ratio of classes, the feature distributions may be close to the
distribution of the initial sample, but in most cases, they have
significant differences (Fig. 4 (a), (b)).</p>
      <p>V. IMBALANCE IN THE MULTICLASS CLASSIFICATION</p>
      <p>PROBLEM</p>
      <p>We consider a multiclass lithology classification and try to
verify if the approach we applied for the binary classification
can also improve predictive models for the multiclass
problem. As well as for binary classification, to study the
effect of imbalance, we change the size of the target class until
the equality with the largest class is achieved. We use random
forest classifier and consider nine class model. One of these
nine classes - carbonate sandstone, is underrepresented in our
dataset and we consider it as a target minority class.
the minority class fraction after oversampling led to growing
the F1 score (Fig. 5). The quality of the model is evaluated by
cross-validation.
d)
Fig. 4. Scattering diagrams and histograms for (a) - (b) balanced subsample
at m = 5, (c) - (d) balanced subsample at m = 50. Subsamples are balanced
to equal class sizes.</p>
      <p>To establish different levels of imbalance, we randomly
select 10, 30, and 100 examples from 1249 labeled ones.
Similar to our previous experiments, an increase in the number
of examples selected from the initial sample and increase of</p>
      <p>The cases A, B, C, and D correspond to the following:
A: The predictive model is trained on the initial sample,
containing 1249 target examples;</p>
      <p>B: The predictive model is trained on the initial sample
oversampled to equality with the majority class and contained
4884 target examples;</p>
      <p>C: The predictive model is trained on a sample of 100
randomly selected target examples;
of specimens available for training decreases, the variance of
the model quality criterion increases.</p>
      <p>D: The predictive model is trained on a sample of 100
randomly selected target examples, oversampled to equality
with the majority class.</p>
      <p>Cases A, B, C, and D contain in red the gaussian smoothed
curves of probability (confidence level) for the core specimens
to belong to the target class.</p>
      <p>Fig. 5 as well contains labels A, B, C, and D which are
related with corresponding cases</p>
      <p>Thus, it is seen that applying the oversampling technic can
improve the quality of the predictive model for the multiclass
problem, both in terms of F1 score and in terms of confidence
of the prediction.</p>
    </sec>
    <sec id="sec-5">
      <title>VI. CONCLUSION</title>
      <p>The paper considers the influence of data imbalance on the
quality of lithotypes classification by the whole core
photographs. It is shown that the quality of predictive models
trained on imbalanced data may depend on the degree of
imbalance and for some samples the imbalance can
dramatically affect the quality of classification.</p>
      <p>The level of imbalance at which it is possible to obtain a
predictive model that is close in quality to the model trained
on a balanced sample is not constant and depends on the size
of the data sample, as well as on the quality of the data sample.
Quality here refers to how fully the sample reflect the
characteristics of the target lithotype.</p>
      <p>Applying the oversampling technic of data balancing by
SMOTE method can increase the quality of the lithology
classification for binary problem (detection of silty-clay
rocks), and for the multiclass problem.</p>
      <p>The quality of predictive models, close to the quality of the
model built on the entire balanced data set, was achieved for
those imbalanced samples which let us restore the distribution
of the entire data set with the least influence of the random
factor.</p>
      <p>There is a minimum acceptable number of specimens,
weakly depending on the size of the entire sample, at which
we can claim the reproducible quality of model training (with
an acceptable variance of the quality criterion). As the number</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.E.</given-names>
            <surname>Baraboshkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.S.</given-names>
            <surname>Ismailova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.M.</given-names>
            <surname>Orlov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.A.</given-names>
            <surname>Zhukovskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.A.</given-names>
            <surname>Kalmykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.V.</given-names>
            <surname>Khotylev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.Y.</given-names>
            <surname>Baraboshkin</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.A.</given-names>
            <surname>Koroteev</surname>
          </string-name>
          , “
          <article-title>Deep convolutions for in-depth automated rock typing</article-title>
          ,
          <source>” Computers &amp; Geosciences</source>
          , vol.
          <volume>135</volume>
          ,
          <issue>104330</issue>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Curtis</surname>
          </string-name>
          and A. MacArthur, “
          <source>Automated lithology extraction from core photographs,” First Break</source>
          , vol.
          <volume>29</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>109</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.R.</given-names>
            <surname>Vorobeva</surname>
          </string-name>
          , “
          <article-title>Approach to the recovery of geomagnetic data by comparing daily fragments of a time series with equal geomagnetic activity,” Computer Optics</article-title>
          , vol.
          <volume>43</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>1053</fpage>
          -
          <lpage>1063</lpage>
          ,
          <year>2019</year>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>2412</fpage>
          -6179-2019-43-6-
          <fpage>1053</fpage>
          -1063.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.I.</given-names>
            <surname>Shakhuro</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.S.</given-names>
            <surname>Konushin</surname>
          </string-name>
          , “
          <article-title>Image synthesis with neural networks for traffic sign classification” Computer Optics</article-title>
          , vol.
          <volume>42</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          ,
          <year>2018</year>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>2412</fpage>
          -6179-2018-42-1-
          <fpage>105</fpage>
          -112.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>M.F. Sohan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          <string-name>
            <surname>Jabiullah</surname>
            ,
            <given-names>S.S.M.M.</given-names>
          </string-name>
          <string-name>
            <surname>Rahman</surname>
            and
            <given-names>S.M.H.</given-names>
          </string-name>
          <string-name>
            <surname>Mahmud</surname>
          </string-name>
          , “
          <source>Assessing the Effect of Imbalanced Learning on Cross-project Software Defect Prediction,” 10th International Conference on Computing, Communication and Networking Technologies, ICCCNT</source>
          ,
          <volume>8944622</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Huda</surname>
          </string-name>
          , K. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdelrazek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Alyahya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Al-Dossari</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          , “
          <article-title>An ensemble oversampling model for class imbalance problem in software defect prediction</article-title>
          ,
          <source>” IEEE Access</source>
          , vol.
          <volume>6</volume>
          , pp.
          <fpage>24184</fpage>
          -
          <lpage>24195</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shimizu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Asako</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Morinaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamada</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Kuroda</surname>
          </string-name>
          , “
          <article-title>Balanced mini-batch training for imbalanced image data classification with neural network,”</article-title>
          <source>1st IEEE International Conference on Artificial Intelligence for Industries</source>
          , vol.
          <volume>AI4I</volume>
          ,
          <volume>8665709</volume>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.B.</given-names>
            <surname>Paklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.V.</given-names>
            <surname>Ulanov</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.V.</given-names>
            <surname>Tsar</surname>
          </string-name>
          <article-title>'kov, “The construction of classifiers on imbalanced samples by the example of credit scoring</article-title>
          ,
          <source>” Artificial Intelligence, no. 3</source>
          , pp.
          <fpage>528</fpage>
          -
          <lpage>534</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.K.</given-names>
            <surname>Wong</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.S.</given-names>
            <surname>Kamel</surname>
          </string-name>
          , “
          <article-title>Classification of imbalanced data: A review,”</article-title>
          <source>International Journal of Pattern Recognition and Artificial Intelligence</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>04</issue>
          , pp.
          <fpage>687</fpage>
          -
          <lpage>719</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Seiffert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.M.</given-names>
            <surname>Khoshgoftaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Van</given-names>
            <surname>Hulse</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Napolitano</surname>
          </string-name>
          , “
          <article-title>Building Useful Models from Imbalanced Data with Sampling and Boosting</article-title>
          ,” FLAIRS conference, pp.
          <fpage>306</fpage>
          -
          <lpage>311</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.M.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McCarthy</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Zabar</surname>
          </string-name>
          , “
          <article-title>Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs?”</article-title>
          <source>International Conference on Data Mining</source>
          , vol.
          <volume>7</volume>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.V.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.W.</given-names>
            <surname>Bowyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.O.</given-names>
            <surname>Hall</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.P.</given-names>
            <surname>Kegelmeyer</surname>
          </string-name>
          , “
          <article-title>SMOTE: synthetic minority over-sampling technique</article-title>
          ,
          <source>” Journal of artificial intelligence research</source>
          , vol.
          <volume>16</volume>
          , pp.
          <fpage>321</fpage>
          -
          <lpage>357</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.A.</given-names>
            <surname>Garcia</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          , “ADASYN:
          <article-title>Adaptive synthetic sampling approach for imbalanced learning</article-title>
          ,
          <source>” IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)</source>
          , pp.
          <fpage>1322</fpage>
          -
          <lpage>1328</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raschka</surname>
          </string-name>
          , “
          <article-title>Python machine learning</article-title>
          ,
          <source>” Packt Publishing Ltd</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>