The effect of the imbalanced training dataset on the
quality of classification of lithotypes via whole core
                        photos
          Daria Makienko                                          Ilya Seleznev                                      Ilia Safonov
   Schlumberger Moscow Research                          Schlumberger Moscow Research                      Schlumberger Moscow Research
          Moscow, Russia                                         Moscow, Russia                                   Moscow, Russia
        dmakienko@slb.com                                      iseleznev@slb.com                                 isafonov@slb.com

    Abstract—Nowadays machine learning methods play an                           In the classification problem, it is preferable that the
important role in many industries. However, the effectiveness of             training examples are evenly distributed among the classes.
the predictive models depends on the quality of data sets used to            Some classifiers take into account the errors for different
train the model. In practice, the imbalanced datasets are quite              classes with same weights and in case of imbalance they
common. For example, in the problems of lithotypes                           become more focused on the overrepresented classes. The
classification via whole core photos, some lithotypes often                  reason for such behavior of classifiers is that identifying the
predominate the training dataset while some of the other                     characteristics of the majority class contributes stronger to the
lithotypes can be underrepresented. The significant imbalance                target value (quality functional or error function) than
in the dataset can affect the quality of the classification. In this
                                                                             identifying the characteristics of the minority class. However,
case it is difficult to obtain good generalization for poorly
represented classes. First, some characteristics of a given minor
                                                                             the imbalanced classification data sets are often observed in
lithotype may be absent. Second, some features of a minor class              applied problems [4-11]. Data sets for the lithological
can be ignored due to imbalance. In this paper, we analyze the               description of core are no exception. The imbalance of classes
oversampling of a minor class as one of the possible options to              is associated with different rock occurrence. The following
obtain the balanced dataset within the framework of the                      methods can be used to train a model on imbalanced data [9-
problem of speeding-up the geological core description. We                   11]:
considered examples with different dataset sizes and imbalance
                                                                                 1. Balancing, that is changing the ratio of classes in the
characteristics to study the effect of applying the oversampling
                                                                             sample by increasing the number of instances of the minority
approach on the quality of predictive models.
                                                                             class (oversampling) or reducing the number of instances of
     Keywords—imbalanced dataset, oversampling, classification               the majority class (undersampling).
of lithotypes, geological core description                                       2. Making adjustments to the learning algorithm. For
                                                                             example, setting different penalties for classes in the support
                            I. INTRODUCTION
                                                                             vector machine, changing the probability threshold for
    The lithological description of whole core specimens is a                classifying an example as a class in trees.
time-consuming process. Using whole core photos to classify
rocks and mark depth intervals corresponding to these rock                       3. Establishing different error rates for classes. The cost
classes can significantly reduce the time required for such                  of errors can be taken into account both when changing the
description. Modern methods for automating the description                   ratio of classes in the sample, and when making adjustments
of rocks by core photographs are based on machine learning.                  to the learning algorithm.
The most informative features for machine learning are the                      4. The use of boosting. Several classifiers that correct
color characteristics of core image fragments [1,2]. In this                 each other's errors can improve the quality of model
paper we build predictive machine learning-based models                      predictions based on examples of a minority class.
using color characteristics of whole core photos. We consider
an important factor that largely determines the quality of rock                  For lithological description based on full-size core
classification, namely, the influence of data imbalance, on                  photographs, we investigate the oversampling. This approach
which the predictive model is trained, and one of the                        balances samples by increasing the number of examples of the
approaches to compensate the imbalance.                                      minority class. Some of the existing oversampling techniques
                                                                             are as follows:
    The aim of our study is to determine the parameters of data
samples that can significantly affect the quality of predictive                  1. Random oversampling: Copies of randomly selected
models, as well as to assess the degree of such influence.                   elements of the minority class are created until the required
Analyzing characteristics of sample imbalances in a wide                     ratio is reached.
range of values, we want to understand the limitations of the                    2. SMOTE (Synthetic Minority Oversampling
dataset parameters at which such imbalance can be corrected                  Technique) [12]: New examples are generated by
to improve the quality of predictive models.                                 interpolating the examples of the minority class — some i-th
           II. OVERVIEW OF TECHNIQUES FOR PROCESSING                         example and one of its k-nearest neighbors. There are several
                                                                             options for choosing the i-th example. One can make a random
                                IMBALANCED DATA
                                                                             selection (Regular SMOTE), select an example depending on
    Real datasets often lack any data due to the difficulty of               the classes to which the surrounding examples (Borderline
obtaining them. Different methods are used to compensate for                 SMOTE) belong, depending on the constructed support
missing data depending on the data type and the type of task                 vectors or on the constructed clusters.
[3-11]. We consider the case of imbalance of classes in the
classification problem, when the data are presented in the form                 3. ADASYN (Adaptive Synthetic) [13]: It works
of numerical features.                                                       similarly to the SMOTE method but selects the i-th example


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

of a minority class depending on the coefficient ri, which                                               IV. RESULTS
shows the proportion of examples of other classes around the                  The quality of models for determining the "silty-clay rock"
i-th example. The greater the coefficient ri, the more examples           lithotype, trained on imbalanced subsamples without using
are generated in the vicinity of the i-th example.                        oversampling, increases with the growth of the proportion p
                 III. SETTING UP AN EXPERIMENT                            and their number m of examples of the minority class (Fig. 1).
                                                                          At the same time, the quality reaches an acceptable level only
    There are a lot of lithotypes. In general, lithology                  in subsamples where the level of imbalance is very small.
classification is a multiclass problem. For our study we                  Therefore, it becomes necessary to correct the imbalance, as
simplify the problem at this stage, considering a binary solver           well as to study the influence of parameters p and m on the
that is the one-vs-rest classifier. To study the influence of the         operation of the classifier. After applying oversampling to
imbalance on the quality of classification the depth intervals            balance classes, the quality of the models improves.
were selected in the manner to obtain balanced and variously
represented target lithotype data, reflecting the typical color
features of this lithotype as well. We denote the data obtained
after processing all of the images the initial sample. To study
the effect of imbalance on the quality of classification, we
form different size subsamples of the initial sample, which act
as minority class with different imbalance. We train predictive
models on such subsets and try to compensate the imbalance.
    We tested 4 initial data sets with different sizes of minority
and majority classes: 2330:4075, 1165:2038, 583:1019,
292:510, where the first value is the number of examples of
the minority, the second is the number of examples of the
majority class (other lithotypes). To create subsamples with                       a)
different class ratios and study the influence of the initial ratios
on the further complement of the sample, the minority class is
reduced by randomly choosing a subset of it of size m. The
value of m corresponds to some new proportion p relative to
the size of the majority class. Such subsamples are denoted as
p(m). To reduce the influence of a random factor on the
classification results, for each subsample p(m), examples are
selected 10 times and the results are averaged.
    While training sets have different levels of imbalance, the
test set is not changed and has the class ratio inherited from
the initial sample. To assess the quality of models, a 5 folds
cross-validation is used [14]. The training data sets consist of
                                                                                   b)
4/5 of the initial data sets and have sizes of minority and
majority classes: 1864:3260, 932:1630, 466:815, 234:408.
Testing is performed 5 times on each of the folds and the
results are averaged.
    To balance the training set, we use SMOTE with random
selection of examples. After balancing and training, the
classification accuracy is estimated. We apply the linear
classification algorithm (logistic regression) and the tree-
based algorithms (gradient boosting and random forest) to
train the classifier. We employ F1 score to evaluate the quality
of models. F1 is the harmonic mean of Precision and Recall:
                          TP                                TP                     c)
         R e c a ll              ; P r e c is io n              ;
                                                                           Fig. 1. The effect of oversampling on classification quality for training set
                        TP  FN                         TP  FP
                                                                           1864:3260: (a) logistic regression, (b) gradient boosting, (c) random forest
                                      2T P
                         F1                            ,                     Fig. 2 shows plots of the dependence of the F1 score on
                                2T P  F P  F N
                                                                          the proportion of minority class examples after oversampling
where TP (True Positive), TN (True Negative) are the number               for the two training sets. The plots for training sets 932:1630
of correctly predicted objects of the positive and negative               and 466:815 are not shown, because they look similar. One
classes correspondingly; FN (False Negative), FP (False                   can see the proportion p and the number m in the legend of
Positive) are the number of objects incorrectly assigned to               plots.
negative and positive classes correspondingly. The positive
class is the minority class corresponding to the target rock, and             By comparing of Fig. 2(a) and Fig. 2(b), for the
the negative class is the majority class, corresponding to other          classification of the target lithotype, we conclude the
rock types.                                                               following:


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                              133
Data Science


         a)                                                                         a)


                                                                                   b)
         b)
Fig. 2. The effect of proportion of minority class examples after
oversampling on classification quality for training sets (a) 233:408, (b)
1864:3260.

   1. The quality of the model depends on the number of
examples m in the minority class before oversampling. The
dependence on p is not significant.
   2. The quality of the model grows with the increasing
number of examples representing the minority class before the
oversampling.
   3. The quality of the model increases when the fraction of
minority class examples increases due to the use of                                c)
oversampling.
                                                                            Fig. 3. Graphs showing the scatter of the F1 score for 10 random versions of
   4. There is the threshold for the number of examples m, at               subsamples from the training set 1864:3260 with parameters (a) p = 0.002
which the quality of the initial sample can be reached if                   (m = 5), (b) p = 0.005 (m = 15), (c) p = 0.015 (m = 50).
oversampling is applied.                                                        For 10 versions of subsamples with the parameter m = 50,
    Fig. 3 shows plots of the F1 score dependence on the                    balanced to an equal ratio of classes, the distribution of
proportion of minority class samples after oversampling for                 features on histograms and cross-plots is visually similar to
random examples extracts from the initial sample at p = 0.002               the distribution of the initial sample (Fig. 4 (c), (d)). For
(m = 5), p = 0.005 (m = 15), and p = 0.015 (m = 50). With a                 subsamples with the parameter m = 5, balanced to an equal
small number of examples of the minority class, the random                  ratio of classes, the feature distributions may be close to the
factor in choosing these examples has a significant impact on               distribution of the initial sample, but in most cases, they have
the accuracy of classification. If the initial training set                 significant differences (Fig. 4 (a), (b)).
1864:3260 is trimmed to an imbalance p = 0.002 (m = 5), then                     V. IMBALANCE IN THE MULTICLASS CLASSIFICATION
when the minority class is oversampled to balance with the                                                  PROBLEM
majority class, the average F1 score is 0.62, but the deviation
from the average reaches 0.09. As the number of examples                        We consider a multiclass lithology classification and try to
increases, the average value of the F1 score increases, and its             verify if the approach we applied for the binary classification
dispersion decreases. This is not true for all m, but in general            can also improve predictive models for the multiclass
this trend persists. For a subsample with p = 0.015 (m = 50),               problem. As well as for binary classification, to study the
the average value of the F1 score after balancing is 0.85 and               effect of imbalance, we change the size of the target class until
the deviation is 0.01.                                                      the equality with the largest class is achieved. We use random
                                                                            forest classifier and consider nine class model. One of these
   To assess the similarity of the subsamples balanced by the               nine classes - carbonate sandstone, is underrepresented in our
SMOTE method with the initial sample, we used histograms                    dataset and we consider it as a target minority class.
and cross-plots constructed for the most significant features.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                             134
Data Science

                                                                              the minority class fraction after oversampling led to growing
                                                                              the F1 score (Fig. 5). The quality of the model is evaluated by
                                                                              cross-validation.
                                                                                  Fig. 6 (a, b) illustrates the difference of prediction
                                                                              confidence of the two predictive models – before and after
                                                                              oversampling. The first 3 columns show day light (DL),
                                                                              ultraviolet (UV), and gamma corrected ultraviolet (UV
                                                                              gamma corrected) photographs. Fig. 6 (a) contains depth
                                                                              intervals which were involved in training, and Fig. 6 (b)
                                                                              contains depth intervals which were not be involved in the
                                                                              training process.


            a)


                                                                              Fig. 5. Assessment of the classification of the target lithotype relative to the
                                                                              rest.
           b)


                                                                                a)

            c)


                                                                                b)
                                                                              Fig. 6. The predicted probabilities of the presence of the target lithotype in
                                                                              the corresponding core sections (a) areas involved in training, (b) areas that
            d)                                                                were not involved.
Fig. 4. Scattering diagrams and histograms for (a) - (b) balanced subsample
at m = 5, (c) - (d) balanced subsample at m = 50. Subsamples are balanced            The cases A, B, C, and D correspond to the following:
to equal class sizes.
                                                                                 A: The predictive model is trained on the initial sample,
    To establish different levels of imbalance, we randomly                   containing 1249 target examples;
select 10, 30, and 100 examples from 1249 labeled ones.
Similar to our previous experiments, an increase in the number                   B: The predictive model is trained on the initial sample
of examples selected from the initial sample and increase of                  oversampled to equality with the majority class and contained
                                                                              4884 target examples;


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                   135
Data Science

   C: The predictive model is trained on a sample of 100                  of specimens available for training decreases, the variance of
randomly selected target examples;                                        the model quality criterion increases.
   D: The predictive model is trained on a sample of 100                                                  REFERENCES
randomly selected target examples, oversampled to equality                [1]  E.E. Baraboshkin, L.S. Ismailova, D.M. Orlov, E.A. Zhukovskaya,
with the majority class.                                                       G.A. Kalmykov, O.V. Khotylev, E.Y. Baraboshkin and D.A. Koroteev,
                                                                               “Deep convolutions for in-depth automated rock typing,” Computers
    Cases A, B, C, and D contain in red the gaussian smoothed                  & Geosciences, vol. 135, 104330, 2020.
curves of probability (confidence level) for the core specimens           [2] A. Thomas, M. Rider, A. Curtis and A. MacArthur, “Automated
to belong to the target class.                                                 lithology extraction from core photographs,” First Break, vol. 29, no.
                                                                               6, pp. 103-109, 2011.
    Fig. 5 as well contains labels A, B, C, and D which are               [3] G.R. Vorobeva, “Approach to the recovery of geomagnetic data by
related with corresponding cases                                               comparing daily fragments of a time series with equal geomagnetic
                                                                               activity,” Computer Optics, vol. 43, no. 6, pp. 1053-1063, 2019. DOI:
    Thus, it is seen that applying the oversampling technic can                10.18287/2412-6179-2019-43-6-1053-1063.
improve the quality of the predictive model for the multiclass            [4] V.I. Shakhuro and A.S. Konushin, “Image synthesis with neural
problem, both in terms of F1 score and in terms of confidence                  networks for traffic sign classification” Computer Optics, vol. 42, no.
of the prediction.                                                             1, pp. 105-112, 2018. DOI: 10.18287/2412-6179-2018-42-1-105-112.
                                                                          [5] M.F. Sohan, M.I. Jabiullah, S.S.M.M. Rahman and S.M.H. Mahmud,
                         VI. CONCLUSION                                        “Assessing the Effect of Imbalanced Learning on Cross-project
                                                                               Software Defect Prediction,” 10th International Conference on
    The paper considers the influence of data imbalance on the                 Computing, Communication and Networking Technologies, ICCCNT,
quality of lithotypes classification by the whole core                         8944622, pp. 1-6, 2019.
photographs. It is shown that the quality of predictive models            [6] S. Huda, K. Liu, M. Abdelrazek, A. Ibrahim, S. Alyahya, H. Al-Dossari
trained on imbalanced data may depend on the degree of                         and S. Ahmad, “An ensemble oversampling model for class imbalance
imbalance and for some samples the imbalance can                               problem in software defect prediction,” IEEE Access, vol. 6, pp.
                                                                               24184-24195, 2018.
dramatically affect the quality of classification.
                                                                          [7] R. Shimizu, K. Asako, H. Ojima, S. Morinaga, M. Hamada and T.
     The level of imbalance at which it is possible to obtain a                Kuroda, “Balanced mini-batch training for imbalanced image data
predictive model that is close in quality to the model trained                 classification with neural network,” 1st IEEE International Conference
                                                                               on Artificial Intelligence for Industries, vol. AI4I, 8665709, pp. 27-30,
on a balanced sample is not constant and depends on the size                   2018.
of the data sample, as well as on the quality of the data sample.         [8] N.B. Paklin, S.V. Ulanov and S.V. Tsar'kov, “The construction of
Quality here refers to how fully the sample reflect the                        classifiers on imbalanced samples by the example of credit scoring,”
characteristics of the target lithotype.                                       Artificial Intelligence, no. 3, pp. 528-534, 2010.
                                                                          [9] Y. Sun, A.K. Wong and M.S. Kamel, “Classification of imbalanced
    Applying the oversampling technic of data balancing by                     data: A review,” International Journal of Pattern Recognition and
SMOTE method can increase the quality of the lithology                         Artificial Intelligence, vol. 23, no. 04, pp. 687-719, 2009.
classification for binary problem (detection of silty-clay                [10] C. Seiffert, T.M. Khoshgoftaar, J. Van Hulse and A. Napolitano,
rocks), and for the multiclass problem.                                        “Building Useful Models from Imbalanced Data with Sampling and
                                                                               Boosting,” FLAIRS conference, pp. 306-311, 2008.
    The quality of predictive models, close to the quality of the         [11] G.M. Weiss, K. McCarthy and B. Zabar, “Cost-sensitive learning vs.
model built on the entire balanced data set, was achieved for                  sampling: Which is best for handling unbalanced classes with unequal
those imbalanced samples which let us restore the distribution                 error costs?” International Conference on Data Mining, vol. 7, pp. 35-
of the entire data set with the least influence of the random                  41, 2007.
factor.                                                                   [12] N.V. Chawla, K.W. Bowyer, L.O. Hall and W.P. Kegelmeyer,
                                                                               “SMOTE: synthetic minority over-sampling technique,” Journal of
    There is a minimum acceptable number of specimens,                         artificial intelligence research, vol. 16, pp. 321-357, 2002.
weakly depending on the size of the entire sample, at which               [13] H. He, Y. Bai, E.A. Garcia and S. Li, “ADASYN: Adaptive synthetic
we can claim the reproducible quality of model training (with                  sampling approach for imbalanced learning,” IEEE International Joint
an acceptable variance of the quality criterion). As the number                Conference on Neural Networks (IEEE World Congress on
                                                                               Computational Intelligence), pp. 1322-1328, 2008.
                                                                          [14] S. Raschka, “Python machine learning,” Packt Publishing Ltd, 2015.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                             136