=Paper=
{{Paper
|id=Vol-2718/paper07
|storemode=property
|title=Classification of a Small Imbalanced Dataset of Vine Leaves Images using Deep Learning Techniques
|pdfUrl=https://ceur-ws.org/Vol-2718/paper07.pdf
|volume=Vol-2718
|authors=Amjad Balawi,Abdullah Al Zoabi,José Luis Seixas Junior,Tomáš Horváth
|dblpUrl=https://dblp.org/rec/conf/itat/BalawiZJH20
}}
==Classification of a Small Imbalanced Dataset of Vine Leaves Images using Deep Learning Techniques==
<pdf width="1500px">https://ceur-ws.org/Vol-2718/paper07.pdf</pdf>
<pre>
     Classification of a Small Imbalanced Dataset of Vine Leaves Images using
                             Deep Learning Techniques

                        Amjad Balawi, Abdullah Al Zoabi, José Luis Seixas Junior, and Tomáš Horváth

                                      Department of Data Science and Engineering
                                           ELTE – Eötvös Loránd University
                                                   http://t-labs.elte.hu/
                           Faculty of Informatics, 3in Research Group, Martonvásár, Hungary
      amjad.balawi20@gmail.com,abdullah.al.zoabi@outlook.com,{tomas.horvath,jlseixasjr}@inf.elte.hu

Abstract: Convolutional Neural Network (CNN) has be-                         Convolutional Neural Network (CNN or ConvNets) is
come one of the most popular techniques in image classifi-                a sort of Neural Network mostly popular in image clas-
cation. Usually CNN models are trained on a large amount                  sification [3] but it has a fewer number of connections,
of data, but in this paper, it is discussed CNN usage on                  which means, a fewer number of model parameters mak-
data shortage and class imbalance issues. The study is                    ing it less sensitive to over-fitting. The second reason why
conducted on a small dataset of vine leaves images on a                   CNN is powerful in computer vision tasks is the parame-
classification task with five classes using two different ap-             ter sharing, which means, if the filter is useful on a part
proaches. In the first approach, a simple CNN model is                    of the image it could be useful on another one. Further-
used, while in the second approach, the Visual Geometry                   more, CNNs preserves the spatial information of the image
Group (VGG) model with transfer learning is used. It is                   which makes the classifier more robust against the affine
shown that using different deep learning techniques such                  transformations like translation and rotation.
as transfer learning, stratified sampling, data augmenta-                    In many cases, especially in the current times, image
tion, and the state of arts CNN models such as VGG gives                  data scarcity can be dealt by frequent acquisition, but there
a relatively very good model performance with up to 87%                   are still some situations in which acquisition is not easy or
accuracy.                                                                 may not be frequent, as in agriculture, where a plant can
                                                                          not be created in an hour or a day. There are also cases
                                                                          where synthetic images creation is far from real world im-
1    Introduction                                                         ages, so training any model in this situation would create
                                                                          good controlled results but would not solve real problems.
Deep Learning (DL) was inspired by the human brain and                       The goal of this article is to find techniques, proce-
try to simulate how humans learn. In DL, networks of neu-                 dures or functions that can deal with the problems of using
rons organized in multiple layers analyze large amounts of                CNNs in small and imbalanced databases. For such, two
data to find the underlying structure or pattern, the main                different structures of CNN are implemented, with com-
idea is to do that automatically without explicitly program-              bination of different DL techniques and procedures such
ming it, the computer learns how to classify text, sounds                 as data augmentation, transfer learning, stratified sam-
and images. In Computer Vision (CV) tasks, the computer                   pling and model picking based on validation accuracy, also
is trained on huge amount of images by encoding these                     showing the transition from a simple CNN model to a state
images pixels into internal representation, so the classifier             of art model like VGG.
can find the patterns on the input images [1].                               This paper is organized as follow: Section 2 presents
   DL outperforms other solutions in multiple domains, in-                the techniques and definitions used in the proposals of this
cluding speech, vision, video and natural language pro-                   work, followed by Section 3 which describes the steps
cessing, it also reduces the use of feature engineering stage             for constructing the models. Section 4 shows the results
which is one of the most time-consuming tasks in machine                  obtained and in Section 5 the conclusions that can be in-
learning [2]. The other reason, that made DL so famous in                 ferred.
the last few years, is a huge improvement in terms of com-
putational power that can be utilized to accomplish such
tasks. However, one common problem is to preform badly                    2   Proposed Approaches
on unseen data (test dataset), due to over-fitting, usually,
a large dataset is required to increase the model perfor-                 There are many Machine Learning (ML) techniques that
mance. Another problem is that it is hard to choose the                   could be used for general classification problems like K-
right model for any given problem.                                        Nearest Neighbor (KNN), Logistic Regression, Support
                                                                          Vector Machines (SVM) and Artificial Neural Networks
      Copyright c 2020 for this paper by its authors. Use permitted un-
                                                                          (ANN), but in term of the image classification problems
der Creative Commons License Attribution 4.0 International (CC BY         the most popular technique is the Convolution Neural Net-
4.0).                                                                     works. CNN is a class of ANNs that has become dominant
in various CV tasks [4], due to its ability to extract relevant   makes it require a large amount of memory which makes
features from raw data [5].                                       it a tedious task. However, in this paper, we suggested
                                                                  methods to overcome this issue and speeding up the train-
                                                                  ing process.
2.1   CNN and VGG architectures
                                                                                  Table 1: VGG architecture
In general, the CNN architecture is like an ordinary Neural
Network, but it is stronger and deeper because it preserves                  Convolution network configuration
the spatial information of images to overcome the problem                    11 weights layer 16 weights layer
of affine transformations. It also makes the classifier more                    Input (224 × 224) RGB image
robust by adding a stack of convolution layers just before                      Conv3-64           Conv3-64
the dense layers, besides it reduces the number of trained                                         Conv3-64
parameters which speeds up the learning process. CNN                                     Max pooling
architecture includes several building blocks, such as con-                    Conv3-128           Conv3-128
volution layers, pooling layers, and fully connected layers.                                       Conv3-128
A typical architecture consists of repetitions of a stack of                             Max pooling
several convolution layers and a pooling layer, followed                       Conv3-256           Conv3-256
by one or more fully connected layers [4].                                     Conv3-256           Conv3-256
                                                                                                   Conv1-256
                                                                                         Max pooling
                                                                               Conv3-512           Conv3-512
                                                                               Conv3-512           Conv3-512
                                                                                                   Conv1-512
                                                                                         Max pooling
                                                                               Conv3-512           Conv3-512
       Figure 1: Overview of the CNN architecture.                             Conv3-512           Conv3-512
                                                                                                   Conv1-512
   Figure 1 shows a general overview of the CNN archi-                                   Max pooling
tecture. Convolutions layers take the raw image as an in-                                 FC-4096
put, perform convolutions using different sized trainable                                 FC-4096
sliding windows which are typically named kernels and                                     FC-1000
produce a vector which goes as an input for the dense lay-                              SoftMax layer
ers. Each kernel has its own parameters which are trained
just like the dense layer parameters, the output of convo-
lutions layer goes as input to the next layer which looks
for a higher level of input details and so on. The pooling
                                                                  2.2   Stratified Sampling
layers come after a stack of one or more convolution lay-
ers, the purpose of pooling is to reduce the input size and       Stratified sampling is a probability sampling technique
overcome the small translations, there are multiple types         that takes the group size into account while doing the
of polling like Average, Min and Max polling.                     sampling process. The elements in target population are
   The Visual Geometry Group (VGG) network was intro-             divided into distinct groups or so-called “strata”, where
duced by Simonyan and Zisserman [6] and is, in general,           within each stratum, the elements have similar characteris-
characterized by its simplicity since its only using 3 × 3        tics to each other [7]. This technique is used widely in ML
convolution layers on top of each other with increasing           especially when the data suffers from class imbalance is-
depth. In order to reduce the volume size or resolution,          sue [8, 9, 10, 11]. This sampling technique is implemented
max-pooling was used in this network. After the convolu-          in the scikit-learn library which is a free ML library for
tion layers, there are two dense layers with 4,096 neurons        python. Sampling technique was used while splitting the
each, followed by a softmax classifier, which is a general-       data into train, validation and test sets using the attribute
ization of the logistic regression to support the multiclass      stratify inside train_test_split function and defining the
probability distribution. There are two version of VGG,           target variable from which the sample was required.
16 and 19, referring to the number of weight layers in the
network.
   Simonyan and Zisserman found the convergence of                2.3   Data Augmentation
VGG16 and VGG19 on the deeper networks quite chal-
lenging so they trained smaller versions of the model as          DL models, including CNNs, are usually trained on a large
the one shown in Table 1. The main drawbacks with VGG             amount of data to have a reasonable performance [12], in
network it is slow to train and weights are quite large. Due      case of data shortage, like in this paper, these models tend
to the depth and the number of fully connected neurons            to over-fit training data and lose the generalization ability
which leads to bad performance on the test dataset. Af-         main idea of this technique is to use a pretrained model
ter the cleaning stage, our dataset contains around 1600        which was trained on a similar problem, then apply this
images, training data was 80% of those images, while the        model on the new problem [14]. In most cases, the last
reaming 20% were divided equally to testing and valida-         few layers are refined and a simple dense or a linear model
tion datasets. Roughly, this amount of data may not be          added on top of that.
enough to train a deep neural network and produce a good           ImageNet dataset was used in this paper, which is a
accuracy, thus to increase the accuracy, generalization and     large visual dataset designed for object recognition tasks
prevent over-fitting a data augmentation stage was added        which contains more than 14 million images and have
to the architecture.                                            been hand-annotated to indicate what objects are pictured
   Data augmentation means to create more training im-          in at least one million of the images, bounding boxes are
ages based on the existing ones by applying some simple         also provided [15, 16]. ImageNet contains more than 20
effects and affine transformations like shifting, flipping,     thousand categories with typical categories, such as “bal-
rotating, zooming and so on. This augmentation will in-         loon” or “strawberry”, consisting of several hundred im-
crease the number of training images and leads to more          ages [17].
generalization and smoother training curve, it also pro-
vides information on small deformations images may con-
tain due to acquisition processes [13]. Figure 2 shows the
result of applying the data augmentation on a the first im-
age resized to 256 × 256 which produced the second and          3     Research Methods
third images by applying rotation and flipping.

                                                                All strategies were implemented on Google Colab cloud
                                                                service using Tensorflow 2.0 GPU and Keras API abstrac-
                                                                tion framework. Tensorflow is one of the famous libraries
                                                                that is commonly used for image classification in DL. Ten-
                                                                sorflow is an end-to-end open source software ML plat-
                                                                form developed by the Google in 2015 for numerical pro-
                                                                cessing and computation. Keras is an open source neural-
                                                                network library written in python, with the main purpose
                                                                of simplify code complexity, it also offers a simple/effi-
                                                                cient API able to run on top of Tensorflow, Theano and
                                                                other DL frameworks.


                                                                3.1   Dataset creation


                                                                In this study, images were collected by our department
                                                                from the fields of Hungary in the summer of 2019. This
                                                                study has an industrial background in the wine produc-
                                                                tion and the purpose is to predict the type of wine pro-
                                                                duced by each vine. Around 2200 images were collected
                                                                by different people and devices which produced images
                                                                with different sizes, formats and background, so filter-
Figure 2: Example of Data Augmentation after Resizing           ing and preparation stage was needed. The dataset is di-
the Original Image to 256 × 256.                                vided into five classes, each class is named in Hungar-
   As possible to see, some important shapes or features        ian after the wine produced from the tree as “Cabernet
for classification that could be discarded if the acquisition   Franc”, “Kékfrankos”, “Sárgamuskotály”, “Szürkebarát”,
was made only with the leaf upright, now also becomes           and “Tramini”. Figure 3 shows eight random samples
part of training set.                                           from dataset with their original sizes.
                                                                   The two main problems faced and discussed in this
2.4   Transfer Learning                                         study are data shortage and class imbalance, and both of
                                                                them can be seen from histogram presented in Figure 4,
Transfer Learning is widely used in machine learning            which shows how many images there are in the dataset for
when there is not enough data for model training and the        each class.
Figure 3: Random samples from the Dataset with their
Original Sizes.

                                                                          Figure 6: Histogram of the cleaned Dataset.
                                                                   were resized into two resolutions 224 × 224 and 256 × 256
                                                                   pixels which are practically preferred by different CNN
                                                                   architectures such as VGG16 and ResNet34. In order to
                                                                   speed up the training process, the raw images were con-
                                                                   verted into NumPy which is a vectorized implementation.
                                                                   Figure 7 shows an image sample from cleaned dataset.


         Figure 4: Histogram of the Raw Dataset.

   Since data were collected by non experts and this is the
first time using it, the first step was to clean this dataset by
removing noisy images, as shown in Figure 5, so it would
not affect the training process in a small dataset, while Fig-
ure 6 shows the distribution of the cleaned dataset.


                                                                   Figure 7: Example from the Prepared Dataset Resized to
                                                                   256 × 256.
                                                                      As it is noticeable from histogram, the dataset is rela-
                                                                   tively small, especially for deep learning models and the
                                                                   data suffer from the imbalance classes issue. So, in order
                                                                   to tackle these issues, data was split into training, valida-
                                                                   tion and testing sets using stratified sampling, which takes
                                                                   samples from each class proportional to the class size [7].

                                                                      The split used in the experiments was 80%-10%-10%
                                                                   for the training, validation (which is used for hyper-
                                                                   parameters tuning) and testing sets respectively. We used
                                                                   this split because the data is relatively small and we in-
           Figure 5: Example of a Noisy Image.                     corporate the stratified sampling which took the samples
                                                                   proportional to the class size for better generalization. Af-
   Then all the different images format were unified into          ter splitting, the data was normalized using MinMax scaler
a common format (PNG), which was selected to keep as               in order to speed up the training process by making the ob-
much information as possible in the images since its uses          jective function more round, smooth and easy to optimize
a lossless compression algorithm. After that, the images           [18].
3.2    Simple CNN Model
This architecture was built by trial and error starting from
a straightforward model inspired by LetNet-5 [19] archi-
tecture.
   The first model consisted of two sets of one convolu-
tion and one pooling layers followed by two dense layers,
but it showed bad accuracy due to under-fitting. So, layers
were added, one layer per experiment, until no improve-
ment was detected.
   Then, multiple experiment were made by trying differ-
ent combinations of kernel sizes, hidden layers sizes and             Figure 8: Model Performance in Two Classes.
pooling types. The best accuracy-wise model based on the
two classes classification performance as the following:        classification was not the intended classification, a robust
                                                                model was rather interesting. While verifying the model
    • Three convolution blocks with 4, 8, and 16 filters.       in four classes, two problems were faced, huge over-fitting
                                                                and the largest class tend to have a large number of False
    • Each block consists of two convolutional layers fol-      Positives which leads to bad Precision and Recall. At this
      lowed by a Max pooling layer.                             point, some steps were taken to smooth the effects of the
    • Stack of three dense layers of 64, 32 and 5 units each.   problems:

                                                                  • Increased the number of epochs to 300.
3.3    VGG
                                                                  • Every 50 epochs, the train and validation datasets
Like the simple model, some attempts have been made for             were merged and split randomly again to train and
a better starting point. In the case of the VGG model, the          validation datasets.
Transfer Learning technique using the ImageNet dataset
was the very first step and, from different experiments,          • While training, the model was saved from the epoch
it was noticeable that training only the last few layers of         with best validation accuracy. At the end, it was com-
VGG model would provide the best results.                           pared with the final model based on the test accuracy.
   The reason for this behavior is that, in CNNs, the first
                                                                   Among all the experiments with four classes, the best
few layers capture the low-level features which in most
                                                                result were 88.4%, 88.4% and 88.1% for Accuracy, Preci-
cases are useful in image classification issue. However, the
                                                                sion and Recall, respectively. Figure 9 brings the perfor-
last few layers are capturing the high-level features which
                                                                mance of the model while training with four classes, based
are, in most cases, dataset (problem) specific. At the top
                                                                on training and validation accuracy and loss through the
of the model, the 1000 classes were removed which are
                                                                epochs.
related to ImageNet dataset and added the last dense 5-
classes layer. Adam optimizer with 0.001 learning rate
was also used.
   The other technique used to handle the class imbalance
issue was data augmentation on the training set. For re-
producibility purposes, random seed was set while split-
ting the data into training, validation and test sets and the
model weights with the lowest validation loss was saved
using HD5 format.


4     Results
                                                                     Figure 9: Model Performance in Four Classes.
For the simple CNN model, the best result obtained among
all experiments was 90%, 90% and 90% for Accuracy,                 Finally, the model was trained with five classes and the
Precision and Recall, respectively on the pair of classes       best results among all experiments where 83.8%, 84.4%
“Szürkebarát” and “Tramini”. This method of training was        and 84% for Accuracy, Precision and Recall.
chosen to start as it is not time consuming and gives us the       Figure 10 shows the same information as Figure 9 while
ability to do more trials. Also, this way enables the di-       training the model with all five available classes using the
vision of the five classes dataset into multiple two classes    simple model.
datasets and monitor the model performance among them.             While for the VGG model, some transformations (width
   Overfitting is noticeble from Figure 8, but at this point    shift, height shift, zooming, shearing and rotation) were
there was no need to seek improvement since two classes         used in Data Augmentation, which led the model to
                                                                   Results indicate that even if a large amount of data is
                                                                preferable, it is possible to overcome the previously men-
                                                                tioned issues with satisfactory results. In addition, the ap-
                                                                plied techniques contributed to non-appearance of overfit-
                                                                ting, making the models not database dependent.
                                                                   It is also possible to realize that, in cases where the re-
                                                                quired level of accuracy is very high, above 90% or 95%,
                                                                the techniques applied may not be recommended without
                                                                further database analysis, since these techniques may sac-
                                                                rifice accuracy to avoid other problems.
                                                                   Also important to notice that one of the models is al-
      Figure 10: Model Performance in Five Classes.             ready known in literature and the other did not required
achieve almost 87% accuracy on the test set, which served       any major framework to be built, only applying system-
as an unbiased estimator. Precision, Recall and F1-score        atic and incremental analysis while interpreting obtained
also reached about the same value.                              results during each step.
             Class Precision Recall f1-score
                 0        0.89     0.93        0.91             Acknowledgement
                 1        0.82     0.90        0.86
                 2        0.92     0.79        0.85             We would like to thank Telekom who has us as one of
                 3        0.93     0.84        0.88             its technology partners on Telekom Innovation Laborato-
                 4        0.80     0.86        0.83             ries and the Tempus Public Foundation for the financial
                                                                support through the Stipendium Hungaricum Scholarship
           accuracy                              0.87           Programme.
          macro avg         0.87     0.86        0.87              The research has been supported by the European
      weighted avg          0.87     0.87        0.87           Union, co-financed by the European Social Fund (EFOP-
    Table 2: Precision, Recall and F1-score of the model        3.6.2-16-2017-00013, Thematic Fundamental Research
                                                                Collaborations Grounding Innovation in Informatics and
   Table 2 shows the Precision, Recall, and F1-score us-
                                                                Infocommunications).
ing the VGG model. The metrics used to measure the
model’s performance were chosen considering they take
into account the class imbalance issue and the general in-      References
tuition behind them, that precision means how much noisy
data is provided, in other words, it is more related to False    [1] W. J. Zhang, G. Yang, Y. Lin, C. Ji, and M. M. Gupta.
Positive rates, while recall means how much good data is             On definition of deep learning. In 2018 World Automation
missed, and finally the f1-score is the harmonic mean of             Congress (WAC), pages 1–5, 2018.
precision and recall. The main reason that harmonic mean         [2] Guillaume Chassagnon, Maria Vakalopolou, Nikos Para-
used in f1-score is to punish the large difference between           gios, and Marie-Pierre Revel. Deep learning: definition
precision and recall. For example, if there were 100% pre-           and perspectives for thoracic imaging. European Radiol-
                                                                     ogy, 30:2021 – 2030, 2019.
cision and 0% recall, the f1-score will be 0%, while the
arithmetic mean would be 50%.                                    [3] Sakshi Indolia, Anil Kumar Goswami, S.P. Mishra, and
                                                                     Pooja Asopa. Conceptual understanding of convolutional
                                                                     neural network- a deep learning approach. Procedia Com-
                                                                     puter Science, 132:679 – 688, 2018. International Confer-
5    Conclusion                                                      ence on Computational Intelligence and Data Science.
                                                                 [4] Rikiya Yamashita, Mizuho Nishio, Richard Do, and Kaori
In this research, we investigated different deep learning            Togashi. Convolutional neural networks: an overview and
techniques to overcome data shortage and class imbalance             application in radiology. Insights into Imaging, 9, 06 2018.
issues. With experiments, we noticed that even the deep          [5] J. Moreira, A. Carvalho, and T. Horvath. A General Intro-
leaning models which require a lot of data can be per-               duction to Data Analytics. Wiley, 2018.
formed very well even on a small imbalanced dataset using        [6] Karen Simonyan and Andrew Zisserman. Very deep convo-
techniques such as stratify sampling, data augmentation,             lutional networks for large-scale image recognition. arXiv
and transfer learning. In our first experiment, which is us-         arXiv:1409.1556v6 (ICLR 2015), 10 Apr 2015.
ing a simple CNN model we got an accuracy around 83.8%           [7] Van L. Parsons. Stratified Sampling, pages 1–11. American
and almost the same for other metrics (Precision, Recall,            Cancer Society, 2017.
and F1-score), while in the second experiment a VGG              [8] Elizabeth Tipton. Stratified sampling using cluster analy-
model was used with a combination of different techniques            sis: A sample selection strategy for improved generaliza-
reaching very good results of about 87% for the accuracy             tions from experiments. Evaluation Review, 37(2):109–
and other metrics.                                                   139, 2013. PMID: 24647924.
 [9] Kevin Lang, Edo Liberty, and Konstantin Shmakov. Strat-
     ified sampling meets machine learning. In Proceedings of
     the 33rd International Conference on International Con-
     ference on Machine Learning - Volume 48, ICML’16, page
     2320–2329. JMLR.org, 2016.
[10] Longhua Qian, Guodong Zhou, Fang Kong, and Qiaoming
     Zhu. Semi-supervised learning for semantic relation clas-
     sification using stratified sampling strategy. In Proceedings
     of the 2009 Conference on Empirical Methods in Natural
     Language Processing, pages 1437–1445, Singapore, Au-
     gust 2009. Association for Computational Linguistics.
[11] Uchida S Goldstein M. A comparative evaluation of un-
     supervised anomaly detection algorithms for multivariate
     data. PLoS ONE 11(4): e0152173, 2016.
[12] Luke Taylor and Geoff Nitschke. Improving deep learning
     using generic data augmentation. CoRR, abs/1708.06020,
     2017.
[13] Luis Perez and Jason Wang. The effectiveness of data
     augmentation in image classification using deep learning,
     2017.
[14] Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang.
     A survey of transfer learning. Journal of Big Data, 3(1):9,
     May 2016.
[15] New computer vision challenge wants to teach robots to see
     in 3D. New Scientist, 7 April 2017. Retrieved 3 February
     2018.
[16] John Markoff. For Web Images, Creating New Technology
     to Seek and Find. The New York Times, Retrieved 3 Febru-
     ary 2018.
[17] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,
     Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
     Karpathy, Aditya Khosla, Michael Bernstein, et al. Im-
     agenet large scale visual recognition challenge. Interna-
     tional journal of computer vision, 115(3):211–252, 2015.
[18] Ramírez-Gallego S. Luengo J. et all García, S. Big data
     preprocessing: methods and prospects. Big Data Anal 1, 1,
     2016.
[19] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick
     Haffner. Gradient-based learning applied to document
     recognition. In Proceedings of the IEEE, pages 2278–2324,
     1998.

</pre>