=Paper=
{{Paper
|id=Vol-2718/paper07
|storemode=property
|title=Classification of a Small Imbalanced Dataset of Vine Leaves Images using Deep Learning Techniques
|pdfUrl=https://ceur-ws.org/Vol-2718/paper07.pdf
|volume=Vol-2718
|authors=Amjad Balawi,Abdullah Al Zoabi,José Luis Seixas Junior,Tomáš Horváth
|dblpUrl=https://dblp.org/rec/conf/itat/BalawiZJH20
}}
==Classification of a Small Imbalanced Dataset of Vine Leaves Images using Deep Learning Techniques==
Classification of a Small Imbalanced Dataset of Vine Leaves Images using
Deep Learning Techniques
Amjad Balawi, Abdullah Al Zoabi, José Luis Seixas Junior, and Tomáš Horváth
Department of Data Science and Engineering
ELTE – Eötvös Loránd University
http://t-labs.elte.hu/
Faculty of Informatics, 3in Research Group, Martonvásár, Hungary
amjad.balawi20@gmail.com,abdullah.al.zoabi@outlook.com,{tomas.horvath,jlseixasjr}@inf.elte.hu
Abstract: Convolutional Neural Network (CNN) has be- Convolutional Neural Network (CNN or ConvNets) is
come one of the most popular techniques in image classifi- a sort of Neural Network mostly popular in image clas-
cation. Usually CNN models are trained on a large amount sification [3] but it has a fewer number of connections,
of data, but in this paper, it is discussed CNN usage on which means, a fewer number of model parameters mak-
data shortage and class imbalance issues. The study is ing it less sensitive to over-fitting. The second reason why
conducted on a small dataset of vine leaves images on a CNN is powerful in computer vision tasks is the parame-
classification task with five classes using two different ap- ter sharing, which means, if the filter is useful on a part
proaches. In the first approach, a simple CNN model is of the image it could be useful on another one. Further-
used, while in the second approach, the Visual Geometry more, CNNs preserves the spatial information of the image
Group (VGG) model with transfer learning is used. It is which makes the classifier more robust against the affine
shown that using different deep learning techniques such transformations like translation and rotation.
as transfer learning, stratified sampling, data augmenta- In many cases, especially in the current times, image
tion, and the state of arts CNN models such as VGG gives data scarcity can be dealt by frequent acquisition, but there
a relatively very good model performance with up to 87% are still some situations in which acquisition is not easy or
accuracy. may not be frequent, as in agriculture, where a plant can
not be created in an hour or a day. There are also cases
where synthetic images creation is far from real world im-
1 Introduction ages, so training any model in this situation would create
good controlled results but would not solve real problems.
Deep Learning (DL) was inspired by the human brain and The goal of this article is to find techniques, proce-
try to simulate how humans learn. In DL, networks of neu- dures or functions that can deal with the problems of using
rons organized in multiple layers analyze large amounts of CNNs in small and imbalanced databases. For such, two
data to find the underlying structure or pattern, the main different structures of CNN are implemented, with com-
idea is to do that automatically without explicitly program- bination of different DL techniques and procedures such
ming it, the computer learns how to classify text, sounds as data augmentation, transfer learning, stratified sam-
and images. In Computer Vision (CV) tasks, the computer pling and model picking based on validation accuracy, also
is trained on huge amount of images by encoding these showing the transition from a simple CNN model to a state
images pixels into internal representation, so the classifier of art model like VGG.
can find the patterns on the input images [1]. This paper is organized as follow: Section 2 presents
DL outperforms other solutions in multiple domains, in- the techniques and definitions used in the proposals of this
cluding speech, vision, video and natural language pro- work, followed by Section 3 which describes the steps
cessing, it also reduces the use of feature engineering stage for constructing the models. Section 4 shows the results
which is one of the most time-consuming tasks in machine obtained and in Section 5 the conclusions that can be in-
learning [2]. The other reason, that made DL so famous in ferred.
the last few years, is a huge improvement in terms of com-
putational power that can be utilized to accomplish such
tasks. However, one common problem is to preform badly 2 Proposed Approaches
on unseen data (test dataset), due to over-fitting, usually,
a large dataset is required to increase the model perfor- There are many Machine Learning (ML) techniques that
mance. Another problem is that it is hard to choose the could be used for general classification problems like K-
right model for any given problem. Nearest Neighbor (KNN), Logistic Regression, Support
Vector Machines (SVM) and Artificial Neural Networks
Copyright c 2020 for this paper by its authors. Use permitted un-
(ANN), but in term of the image classification problems
der Creative Commons License Attribution 4.0 International (CC BY the most popular technique is the Convolution Neural Net-
4.0). works. CNN is a class of ANNs that has become dominant
in various CV tasks [4], due to its ability to extract relevant makes it require a large amount of memory which makes
features from raw data [5]. it a tedious task. However, in this paper, we suggested
methods to overcome this issue and speeding up the train-
ing process.
2.1 CNN and VGG architectures
Table 1: VGG architecture
In general, the CNN architecture is like an ordinary Neural
Network, but it is stronger and deeper because it preserves Convolution network configuration
the spatial information of images to overcome the problem 11 weights layer 16 weights layer
of affine transformations. It also makes the classifier more Input (224 × 224) RGB image
robust by adding a stack of convolution layers just before Conv3-64 Conv3-64
the dense layers, besides it reduces the number of trained Conv3-64
parameters which speeds up the learning process. CNN Max pooling
architecture includes several building blocks, such as con- Conv3-128 Conv3-128
volution layers, pooling layers, and fully connected layers. Conv3-128
A typical architecture consists of repetitions of a stack of Max pooling
several convolution layers and a pooling layer, followed Conv3-256 Conv3-256
by one or more fully connected layers [4]. Conv3-256 Conv3-256
Conv1-256
Max pooling
Conv3-512 Conv3-512
Conv3-512 Conv3-512
Conv1-512
Max pooling
Conv3-512 Conv3-512
Figure 1: Overview of the CNN architecture. Conv3-512 Conv3-512
Conv1-512
Figure 1 shows a general overview of the CNN archi- Max pooling
tecture. Convolutions layers take the raw image as an in- FC-4096
put, perform convolutions using different sized trainable FC-4096
sliding windows which are typically named kernels and FC-1000
produce a vector which goes as an input for the dense lay- SoftMax layer
ers. Each kernel has its own parameters which are trained
just like the dense layer parameters, the output of convo-
lutions layer goes as input to the next layer which looks
for a higher level of input details and so on. The pooling
2.2 Stratified Sampling
layers come after a stack of one or more convolution lay-
ers, the purpose of pooling is to reduce the input size and Stratified sampling is a probability sampling technique
overcome the small translations, there are multiple types that takes the group size into account while doing the
of polling like Average, Min and Max polling. sampling process. The elements in target population are
The Visual Geometry Group (VGG) network was intro- divided into distinct groups or so-called “strata”, where
duced by Simonyan and Zisserman [6] and is, in general, within each stratum, the elements have similar characteris-
characterized by its simplicity since its only using 3 × 3 tics to each other [7]. This technique is used widely in ML
convolution layers on top of each other with increasing especially when the data suffers from class imbalance is-
depth. In order to reduce the volume size or resolution, sue [8, 9, 10, 11]. This sampling technique is implemented
max-pooling was used in this network. After the convolu- in the scikit-learn library which is a free ML library for
tion layers, there are two dense layers with 4,096 neurons python. Sampling technique was used while splitting the
each, followed by a softmax classifier, which is a general- data into train, validation and test sets using the attribute
ization of the logistic regression to support the multiclass stratify inside train_test_split function and defining the
probability distribution. There are two version of VGG, target variable from which the sample was required.
16 and 19, referring to the number of weight layers in the
network.
Simonyan and Zisserman found the convergence of 2.3 Data Augmentation
VGG16 and VGG19 on the deeper networks quite chal-
lenging so they trained smaller versions of the model as DL models, including CNNs, are usually trained on a large
the one shown in Table 1. The main drawbacks with VGG amount of data to have a reasonable performance [12], in
network it is slow to train and weights are quite large. Due case of data shortage, like in this paper, these models tend
to the depth and the number of fully connected neurons to over-fit training data and lose the generalization ability
which leads to bad performance on the test dataset. Af- main idea of this technique is to use a pretrained model
ter the cleaning stage, our dataset contains around 1600 which was trained on a similar problem, then apply this
images, training data was 80% of those images, while the model on the new problem [14]. In most cases, the last
reaming 20% were divided equally to testing and valida- few layers are refined and a simple dense or a linear model
tion datasets. Roughly, this amount of data may not be added on top of that.
enough to train a deep neural network and produce a good ImageNet dataset was used in this paper, which is a
accuracy, thus to increase the accuracy, generalization and large visual dataset designed for object recognition tasks
prevent over-fitting a data augmentation stage was added which contains more than 14 million images and have
to the architecture. been hand-annotated to indicate what objects are pictured
Data augmentation means to create more training im- in at least one million of the images, bounding boxes are
ages based on the existing ones by applying some simple also provided [15, 16]. ImageNet contains more than 20
effects and affine transformations like shifting, flipping, thousand categories with typical categories, such as “bal-
rotating, zooming and so on. This augmentation will in- loon” or “strawberry”, consisting of several hundred im-
crease the number of training images and leads to more ages [17].
generalization and smoother training curve, it also pro-
vides information on small deformations images may con-
tain due to acquisition processes [13]. Figure 2 shows the
result of applying the data augmentation on a the first im-
age resized to 256 × 256 which produced the second and 3 Research Methods
third images by applying rotation and flipping.
All strategies were implemented on Google Colab cloud
service using Tensorflow 2.0 GPU and Keras API abstrac-
tion framework. Tensorflow is one of the famous libraries
that is commonly used for image classification in DL. Ten-
sorflow is an end-to-end open source software ML plat-
form developed by the Google in 2015 for numerical pro-
cessing and computation. Keras is an open source neural-
network library written in python, with the main purpose
of simplify code complexity, it also offers a simple/effi-
cient API able to run on top of Tensorflow, Theano and
other DL frameworks.
3.1 Dataset creation
In this study, images were collected by our department
from the fields of Hungary in the summer of 2019. This
study has an industrial background in the wine produc-
tion and the purpose is to predict the type of wine pro-
duced by each vine. Around 2200 images were collected
by different people and devices which produced images
with different sizes, formats and background, so filter-
Figure 2: Example of Data Augmentation after Resizing ing and preparation stage was needed. The dataset is di-
the Original Image to 256 × 256. vided into five classes, each class is named in Hungar-
As possible to see, some important shapes or features ian after the wine produced from the tree as “Cabernet
for classification that could be discarded if the acquisition Franc”, “Kékfrankos”, “Sárgamuskotály”, “Szürkebarát”,
was made only with the leaf upright, now also becomes and “Tramini”. Figure 3 shows eight random samples
part of training set. from dataset with their original sizes.
The two main problems faced and discussed in this
2.4 Transfer Learning study are data shortage and class imbalance, and both of
them can be seen from histogram presented in Figure 4,
Transfer Learning is widely used in machine learning which shows how many images there are in the dataset for
when there is not enough data for model training and the each class.
Figure 3: Random samples from the Dataset with their
Original Sizes.
Figure 6: Histogram of the cleaned Dataset.
were resized into two resolutions 224 × 224 and 256 × 256
pixels which are practically preferred by different CNN
architectures such as VGG16 and ResNet34. In order to
speed up the training process, the raw images were con-
verted into NumPy which is a vectorized implementation.
Figure 7 shows an image sample from cleaned dataset.
Figure 4: Histogram of the Raw Dataset.
Since data were collected by non experts and this is the
first time using it, the first step was to clean this dataset by
removing noisy images, as shown in Figure 5, so it would
not affect the training process in a small dataset, while Fig-
ure 6 shows the distribution of the cleaned dataset.
Figure 7: Example from the Prepared Dataset Resized to
256 × 256.
As it is noticeable from histogram, the dataset is rela-
tively small, especially for deep learning models and the
data suffer from the imbalance classes issue. So, in order
to tackle these issues, data was split into training, valida-
tion and testing sets using stratified sampling, which takes
samples from each class proportional to the class size [7].
The split used in the experiments was 80%-10%-10%
for the training, validation (which is used for hyper-
parameters tuning) and testing sets respectively. We used
this split because the data is relatively small and we in-
Figure 5: Example of a Noisy Image. corporate the stratified sampling which took the samples
proportional to the class size for better generalization. Af-
Then all the different images format were unified into ter splitting, the data was normalized using MinMax scaler
a common format (PNG), which was selected to keep as in order to speed up the training process by making the ob-
much information as possible in the images since its uses jective function more round, smooth and easy to optimize
a lossless compression algorithm. After that, the images [18].
3.2 Simple CNN Model
This architecture was built by trial and error starting from
a straightforward model inspired by LetNet-5 [19] archi-
tecture.
The first model consisted of two sets of one convolu-
tion and one pooling layers followed by two dense layers,
but it showed bad accuracy due to under-fitting. So, layers
were added, one layer per experiment, until no improve-
ment was detected.
Then, multiple experiment were made by trying differ-
ent combinations of kernel sizes, hidden layers sizes and Figure 8: Model Performance in Two Classes.
pooling types. The best accuracy-wise model based on the
two classes classification performance as the following: classification was not the intended classification, a robust
model was rather interesting. While verifying the model
• Three convolution blocks with 4, 8, and 16 filters. in four classes, two problems were faced, huge over-fitting
and the largest class tend to have a large number of False
• Each block consists of two convolutional layers fol- Positives which leads to bad Precision and Recall. At this
lowed by a Max pooling layer. point, some steps were taken to smooth the effects of the
• Stack of three dense layers of 64, 32 and 5 units each. problems:
• Increased the number of epochs to 300.
3.3 VGG
• Every 50 epochs, the train and validation datasets
Like the simple model, some attempts have been made for were merged and split randomly again to train and
a better starting point. In the case of the VGG model, the validation datasets.
Transfer Learning technique using the ImageNet dataset
was the very first step and, from different experiments, • While training, the model was saved from the epoch
it was noticeable that training only the last few layers of with best validation accuracy. At the end, it was com-
VGG model would provide the best results. pared with the final model based on the test accuracy.
The reason for this behavior is that, in CNNs, the first
Among all the experiments with four classes, the best
few layers capture the low-level features which in most
result were 88.4%, 88.4% and 88.1% for Accuracy, Preci-
cases are useful in image classification issue. However, the
sion and Recall, respectively. Figure 9 brings the perfor-
last few layers are capturing the high-level features which
mance of the model while training with four classes, based
are, in most cases, dataset (problem) specific. At the top
on training and validation accuracy and loss through the
of the model, the 1000 classes were removed which are
epochs.
related to ImageNet dataset and added the last dense 5-
classes layer. Adam optimizer with 0.001 learning rate
was also used.
The other technique used to handle the class imbalance
issue was data augmentation on the training set. For re-
producibility purposes, random seed was set while split-
ting the data into training, validation and test sets and the
model weights with the lowest validation loss was saved
using HD5 format.
4 Results
Figure 9: Model Performance in Four Classes.
For the simple CNN model, the best result obtained among
all experiments was 90%, 90% and 90% for Accuracy, Finally, the model was trained with five classes and the
Precision and Recall, respectively on the pair of classes best results among all experiments where 83.8%, 84.4%
“Szürkebarát” and “Tramini”. This method of training was and 84% for Accuracy, Precision and Recall.
chosen to start as it is not time consuming and gives us the Figure 10 shows the same information as Figure 9 while
ability to do more trials. Also, this way enables the di- training the model with all five available classes using the
vision of the five classes dataset into multiple two classes simple model.
datasets and monitor the model performance among them. While for the VGG model, some transformations (width
Overfitting is noticeble from Figure 8, but at this point shift, height shift, zooming, shearing and rotation) were
there was no need to seek improvement since two classes used in Data Augmentation, which led the model to
Results indicate that even if a large amount of data is
preferable, it is possible to overcome the previously men-
tioned issues with satisfactory results. In addition, the ap-
plied techniques contributed to non-appearance of overfit-
ting, making the models not database dependent.
It is also possible to realize that, in cases where the re-
quired level of accuracy is very high, above 90% or 95%,
the techniques applied may not be recommended without
further database analysis, since these techniques may sac-
rifice accuracy to avoid other problems.
Also important to notice that one of the models is al-
Figure 10: Model Performance in Five Classes. ready known in literature and the other did not required
achieve almost 87% accuracy on the test set, which served any major framework to be built, only applying system-
as an unbiased estimator. Precision, Recall and F1-score atic and incremental analysis while interpreting obtained
also reached about the same value. results during each step.
Class Precision Recall f1-score
0 0.89 0.93 0.91 Acknowledgement
1 0.82 0.90 0.86
2 0.92 0.79 0.85 We would like to thank Telekom who has us as one of
3 0.93 0.84 0.88 its technology partners on Telekom Innovation Laborato-
4 0.80 0.86 0.83 ries and the Tempus Public Foundation for the financial
support through the Stipendium Hungaricum Scholarship
accuracy 0.87 Programme.
macro avg 0.87 0.86 0.87 The research has been supported by the European
weighted avg 0.87 0.87 0.87 Union, co-financed by the European Social Fund (EFOP-
Table 2: Precision, Recall and F1-score of the model 3.6.2-16-2017-00013, Thematic Fundamental Research
Collaborations Grounding Innovation in Informatics and
Table 2 shows the Precision, Recall, and F1-score us-
Infocommunications).
ing the VGG model. The metrics used to measure the
model’s performance were chosen considering they take
into account the class imbalance issue and the general in- References
tuition behind them, that precision means how much noisy
data is provided, in other words, it is more related to False [1] W. J. Zhang, G. Yang, Y. Lin, C. Ji, and M. M. Gupta.
Positive rates, while recall means how much good data is On definition of deep learning. In 2018 World Automation
missed, and finally the f1-score is the harmonic mean of Congress (WAC), pages 1–5, 2018.
precision and recall. The main reason that harmonic mean [2] Guillaume Chassagnon, Maria Vakalopolou, Nikos Para-
used in f1-score is to punish the large difference between gios, and Marie-Pierre Revel. Deep learning: definition
precision and recall. For example, if there were 100% pre- and perspectives for thoracic imaging. European Radiol-
ogy, 30:2021 – 2030, 2019.
cision and 0% recall, the f1-score will be 0%, while the
arithmetic mean would be 50%. [3] Sakshi Indolia, Anil Kumar Goswami, S.P. Mishra, and
Pooja Asopa. Conceptual understanding of convolutional
neural network- a deep learning approach. Procedia Com-
puter Science, 132:679 – 688, 2018. International Confer-
5 Conclusion ence on Computational Intelligence and Data Science.
[4] Rikiya Yamashita, Mizuho Nishio, Richard Do, and Kaori
In this research, we investigated different deep learning Togashi. Convolutional neural networks: an overview and
techniques to overcome data shortage and class imbalance application in radiology. Insights into Imaging, 9, 06 2018.
issues. With experiments, we noticed that even the deep [5] J. Moreira, A. Carvalho, and T. Horvath. A General Intro-
leaning models which require a lot of data can be per- duction to Data Analytics. Wiley, 2018.
formed very well even on a small imbalanced dataset using [6] Karen Simonyan and Andrew Zisserman. Very deep convo-
techniques such as stratify sampling, data augmentation, lutional networks for large-scale image recognition. arXiv
and transfer learning. In our first experiment, which is us- arXiv:1409.1556v6 (ICLR 2015), 10 Apr 2015.
ing a simple CNN model we got an accuracy around 83.8% [7] Van L. Parsons. Stratified Sampling, pages 1–11. American
and almost the same for other metrics (Precision, Recall, Cancer Society, 2017.
and F1-score), while in the second experiment a VGG [8] Elizabeth Tipton. Stratified sampling using cluster analy-
model was used with a combination of different techniques sis: A sample selection strategy for improved generaliza-
reaching very good results of about 87% for the accuracy tions from experiments. Evaluation Review, 37(2):109–
and other metrics. 139, 2013. PMID: 24647924.
[9] Kevin Lang, Edo Liberty, and Konstantin Shmakov. Strat-
ified sampling meets machine learning. In Proceedings of
the 33rd International Conference on International Con-
ference on Machine Learning - Volume 48, ICML’16, page
2320–2329. JMLR.org, 2016.
[10] Longhua Qian, Guodong Zhou, Fang Kong, and Qiaoming
Zhu. Semi-supervised learning for semantic relation clas-
sification using stratified sampling strategy. In Proceedings
of the 2009 Conference on Empirical Methods in Natural
Language Processing, pages 1437–1445, Singapore, Au-
gust 2009. Association for Computational Linguistics.
[11] Uchida S Goldstein M. A comparative evaluation of un-
supervised anomaly detection algorithms for multivariate
data. PLoS ONE 11(4): e0152173, 2016.
[12] Luke Taylor and Geoff Nitschke. Improving deep learning
using generic data augmentation. CoRR, abs/1708.06020,
2017.
[13] Luis Perez and Jason Wang. The effectiveness of data
augmentation in image classification using deep learning,
2017.
[14] Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang.
A survey of transfer learning. Journal of Big Data, 3(1):9,
May 2016.
[15] New computer vision challenge wants to teach robots to see
in 3D. New Scientist, 7 April 2017. Retrieved 3 February
2018.
[16] John Markoff. For Web Images, Creating New Technology
to Seek and Find. The New York Times, Retrieved 3 Febru-
ary 2018.
[17] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
Karpathy, Aditya Khosla, Michael Bernstein, et al. Im-
agenet large scale visual recognition challenge. Interna-
tional journal of computer vision, 115(3):211–252, 2015.
[18] Ramírez-Gallego S. Luengo J. et all García, S. Big data
preprocessing: methods and prospects. Big Data Anal 1, 1,
2016.
[19] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick
Haffner. Gradient-based learning applied to document
recognition. In Proceedings of the IEEE, pages 2278–2324,
1998.