<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Convolutional Architecture Capabilities for Image Classification Tasks with Insufficient Amount of Data*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrii Matsevytyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>prof. Vytautas Rudzionis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vilnius University, Kaunas Faculty</institution>
          ,
          <addr-line>Muitinės g. 8</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Nowadays Convolutional Neural Networks are used everywhere from facial recognition to malware detection and flat evaluation and are considered to bring significant changes to computer vision. They introduce solutions of such problems as insufficient and low-quality dataset. However, they tend to possess same problems as other Machine Learning and Deep Learning techniques. The paper considers and analyses the most commons methods for image classification, involving usage of feed-forward convolutional architecture. The object of the study is self- collected dataset, consisting of 7 classes, that provide of low-, middle- and highlevel features. The subject of the study is to explore the capabilities of CNNs key architecture blocks and their combinations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Convolutional neural networks</kwd>
        <kwd>image classification</kwd>
        <kwd>kernel</kwd>
        <kwd>pooling</kwd>
        <kwd>accuracy metrics</kwd>
        <kwd>optimization</kwd>
        <kwd>low quality dataset</kwd>
        <kwd>small dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Classifying images is one of fundamental tasks in Machine Learning and Data Analysis. It plays a
crucial role in our everyday life – from facial recognition[1] to flats filtering, evaluation and
classification[2].There are very many image recognition methods used by various researchers. Among
them, can be mentioned linear discrimination methods, nearest neighbours approach, SVM and other
approach. Using these methods different degrees of success were achieved for various tasks. The relative
drawback of many of those methods is the necessity to use somehow derived features. Many feature
extraction procedures have been proposed for use in a biometric system, including principal component
analysis (PCA), independent component analysis (ICA), local binary patterns (LBP), the histogram
method and others.</p>
      <p>However, these methods require big amount of good quality data to achieve some competitive
benchmark. And nowadays there are fields where it is very hard to collect such dataset. A good example
is the task of recognizing the symbols, widely used by various youth subcultures. There are amateur
periodicals containing images with those symbols but the amount of available images is restricted.
Additionally, many of those images are of low quality since were done (captured or painted) by amateur
authors using not the best techniques available. This means that it is hard to increase the available
amount of data for training. But as good recognition as possible could be of big help for the people
interested in these youth groups, in particular psychologists and anthropologists.</p>
      <p>Convolutional Deep Neural Networks appear to be extremely powerful tool in image classification. In
fact, they have revolutionized computer vision[3]. The choice of convolution and pooling in CNNs is
motivated by the desire to endow the networks with invariance to irrelevant cues such as image
translations, scalings, and other small deformations[4-5]. For the task of helping the people working in
these areas, CNN seems to be one of the best approaches due to its high efficiency and robustness. At the
same time, it is necessary to find the best way to solve the particular task of symbol recognition using
low-quality data of limited amount.</p>
      <p>The aim of this work is to conduct research on how different layers and their combinations influence
the accuracy of the convolutional network in the context of elusive dataset and to address such
issues as
overfitting predisposition and usage of small dataset of insufficient quality. Here main layers of
convolutional network, their hyperparameters and capabilities of extracting different and similar
lowlevel, middle-level and high-level features in image classificationtasks will be overviewed.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>This research was inspired by Chiheb Chebbi’s book “Mastering Machine Learning for Penetration
Testing”, particularly in chapter 4, where CNN is used to detect malware applications[7]. One of the first
successful approaches in using CNN for image classification is LeNet-5, introduced in 1989, where a
simple CNN was used for handwritten digits recognition[8]. This was a simple model, but it became a
powerful tool and with best test error rate of less than 0.3% approached the human level[9]. However,
while being good at digits recognition, that time CNN-like approaches brought pure capabilities in
realworld scenarios[10]. Probably the first powerfulexample to overcome the issue is AlexNet, introduced by
Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton in 2012 [11]. It achieved only 17% loss on top-5
error rate2 and was a breakthrough in this field, despite in fact it looked similar to LeNet-5, having 9
times bigger input height and width as well as additional Convolutional and Dense layers. The next
approach taken was using similar architecture but pushing model depth to 16-19 model layers, and as a
result, VGG model appeared with result of 7.3% top-5 error [12].</p>
      <p>Convolutional Neural Networks still face few weaknesses, in particular issues with transfer learning,
limitations of interpretability and computational complexity. These drawbacks are not covered in the
paper. This paper will try to address the issues, related to overfitting predisposition and usage of small
dataset of insufficient quality.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Related theory</title>
      <p>A convolutional neural network is a deep learning model for processing data that has a grid structure,
for example, images. Appearance of CNNs was inspired by the organization of the visual cortex of the
brain of animals and designed for automatic and adaptive learning and extraction of the spatial hierarchy
of features present in the image, from low features level (i.e. angles, straight lines, horizontal lines) to
more complex ones, high level patterns.[13]</p>
      <p>This network usually consists of three types of layers (groups of neurons): convolution, pooling
(pooling, subsampling) and fully connected layers, as a reduced fully connected neural network. The first
two types of layers, convolution, and pooling, perform the function of feature extraction, while fully
connected layers translate the extracted features into a final result, such as the probabilities of image
belonging to classes in the case of a classificationtask.</p>
      <sec id="sec-3-1">
        <title>Convolutional Layer</title>
        <p>Convolutional layer is the cornerstone of all convolutional networks. This layer applies sliding of
different kernel filters to capture different patterns and consists of combination of linear and non-linear
operations – convolution operations and activation functions.</p>
        <p>Convolution is a special type of linear operation, used for feature extraction, where a small array of
numbers, called a kernel, is applied to the input data (which is also an input data).
2 The “top-5 error” is the percentage of times that the target label does not appear among the 5 highest- probability
predictions, and many methods cannot get below 25%</p>
      </sec>
      <sec id="sec-3-2">
        <title>Commonly used activation functions</title>
        <sec id="sec-3-2-1">
          <title>The output feature maps of the</title>
          <p>last layer of convolution or
pooling are usually flatten, i.e.
turn into a one-dimensional array
of numbers, and connected to one
or more fully connected layers.</p>
          <p>Here each input is being
associated with each output by the
weights that are learned. After
creating patterns of features,
extracted by convolution layers
and reduced by layers of
aggregation (pooling), they are
connected by a subset of fully
connected layers with class
probabilities as final network
outputs.</p>
          <p>In practice, maximum aggregation is traditionally used from the core (filter)of size 2 × 2 with a
step of 2. This helps reducing main dimensions in two times. Unlike height and width, depth feature
size remains unchanged since pooling is done for each layer patch in depth separately.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Fully Connected Layer (Flatten and Dense Layers)</title>
        <p>Activation functions are used to choose the best option from the given features. In neural network
each neuron in same layer has same activation function. As model is trained by computing gradient
descent and backpropagating it through error signals for each neuron, and CNN consists of millions of
neurons, to simplify the computational complexity, for hidden layers usually ReLU (Rectified Linear
Unit) function is used. Also, due to its capabilities to handle values, dropping below 0, sometimes Leaky
ReLU is used[14]. One of examples of such situation are networks with a prevailing number of negative
inputs.</p>
        <p>Output layer is responsible for classifying signals, passed from hidden layers into classes. Depending
on the task, either softmax or logistic activation function may be used. Softmax is used for multi-class
classification, while logistic is its version for binary classification. Tanh (Hyperbolic Tangent) maps the
output to the range (-1, 1), and this brings benefits in some situations. [15-16] It is symmetrical in
comparison to softmax/logistic function and is claimed as being more balanced in binary classification
tasks. Also, tanh 0 as the fastest point (representing highest gain), and for logistic 0 is the lowest point
and it becomes a trap for anything going below.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Deep learning model</title>
      <p>In this study, different architectures of CNNs will be tested to figure out their capabilities for
classifying different level features on a low-quality dataset with an insufficient amount of data. The
choice of a Convolutional Neural Network (CNN) architecture is particularly suitable for this study due
to its inherent ability to extract hierarchical features from images, which is crucial for classifying
symbols and patterns in low-quality datasets. CNNs are designed to capture spatial relationships in data,
making them robust to variations such as image translations, scaling, and deformations, which are
common challenges in recognizing symbols captured by amateur authors using diverse techniques. Also,
convolutional approach is chosen due to its ability of learn on comparingly small amounts of data.</p>
      <p>The initial configuration of the model is inspired by the LeNet-5 architecture[7]. It is also influenced
by AlexNet approach[11], due to its good performance in image classification tasks on large datasets.
However, in this research the task provides dataset with small amount of classes and insufficient amount
of data with not the best quality, therefore the original AlexNet may be not the best solution and
higherscale approaches such as VGG[12] is not taken into account. However, there will be modified the number
of layers and their configurations to explore different model complexities and performance outcomes. All
models consist of combinations of Conv2D and MaxPooling2D layers, followed by Flatten and Dense
layers. For the Conv2D layer, a 3/3 kernel is used. To cover all possible amounts of default kernel
variation, each time the amount of kernels will be doubled, starting from 32. For the MaxPooling2D
layer, a 2/2 kernel will be used.</p>
      <p>Firstly, different amount of pairs of Convolutional-MaxPooling layers will be tested. Since dataset is
small, the best approach should be having two or three pairs of Conv2D+MaxPooling2D layers. Then,
the best result is taken, analysis of its performance is done, and additional Convolutional Layer is added
in different parts of network with further analysis. Eventually different amount of Dense layers is tried
to better reveal how captured features were useful in this task.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Experimental setup</title>
    </sec>
    <sec id="sec-6">
      <title>4.1 Dataset and data preprocessing</title>
      <p>Due to the lack of existing datasets in this filed, for testing the capabilities of the neural network is
used a self-collected balanced dataset, which consists of 700 open-source images, belonging to 7 classes:
Horn gesture, Anarchy graffiti, Pentagram, Inverse Pentagram, Scull, Panks and Metalists. Below are
examples of each dataset:</p>
      <p>Pentagram, Anarchy and Inverse Pentagram mainly require low-level features to be extracted. Horn
gesture and Scull contain mostly middle-level features, while Panks and Metalists have high-level
features prevailing. During data preprocessing, each photo is converted to shape (244, 244, 3) – to square
244/244 pixels with 3 color (RGB) channels, similarly to how it is done in AlexNet[11]. After that in order
to decrease model overfitting, there is applied augmentation3, a very powerful technique in reducing
overfitting[17]. Inparticular, the next techniques were applied:</p>
    </sec>
    <sec id="sec-7">
      <title>4.2 Software and hardware stack that was used</title>
      <p>For experiments in this work was used Kaggle Environment – a cloud-based Python environment
with access to x2 Tesla T4 GPU and 16GB connected memory. For data preprocessing PIL and NumPy
libraries were used. For model training, evaluation and results visualization were used such open-source
libraries as Keras, scikitlearn, seaborne and mathplotlib.</p>
    </sec>
    <sec id="sec-8">
      <title>4.3 Training parameters</title>
      <p>All hidden layers use ReLU activation function, while the last Dense layer uses Softmax function.
Kernel size for Conv2D layers is 3x3, for MaxPooling2D layers is 2x2. Kernel amount for Conv2D starts
from 32 and gets doubled each layer in depth except from separately added layer, that keeps the amount
the same on its turn. For choosing hyperparameters manual approach was used, and main emphasis was
done on existing conclusions[5, 11, 12, 18]. While compiling model, rmsprop (Root Mean Square
Propagation) optimizer,
3 Data augmentation is generating new data from existing ones by applying different transformations.[17]
categorical crossentropy loss and accuracy metrics were used. Below there are formulas of Root Mean
Square Propagation optimizer and categorical crossentropy loss.</p>
    </sec>
    <sec id="sec-9">
      <title>4.4 Performance metrics</title>
      <p>While common evaluation technique among classification tasks is top-5 error rate, in this research
classification is done only over 7 classes, therefore top-1 error rate (known also as test accuracy) will be
used. Test Accuracy represents correlation of amount of correctly guessed test datapoints to all test
batch. Test dataset contained 10 images, while validation dataset – 20 images before augmentation. It
will be observed over 200 epochs to track model’s learning progress and assess its performance stability.
This duration allows for sufficient training iterations without risking overfitting or excessive
computational resources. Early stopping isn’t used, because this research is focused on tracking and
comparing model performance, especially in context of overfittingpredisposition. Monitoring the curve
helps to identify when the model converges and whether further training or adjustments are needed.
Below there is a formula for calculating test accuracy.</p>
      <p>Also, to better understand model mispredictions towards classes and specific feature types, confusion
matrix will be used. It calculates True Positive Rate for each class and False Negative Rate for all other
classes with respect to the current class[19]. It may be represented by formula:</p>
    </sec>
    <sec id="sec-10">
      <title>5. Obtained Results</title>
      <sec id="sec-10-1">
        <title>Firstly, there was tested performance of</title>
        <p>different amount of pairs of
ConvolutionalMaxPooling layers. The highest benchmark of
72.006% was achieved by model with 4 pairs
of Conv2D-MaxPool2D layers, while CMx2
and CMx3 were able to achieve almost same
results.</p>
        <p>Its notable, that even CNN with one pair
of convolutional-pooling layers was able to
capture some simple and middle level
features
– Horn gesture (1) and Inverse Pentagram (2)
significantly differ from others. At the same
time it is hard for first model to distinguish
between classes with similar low-level
features, like Inverse Pentagram (2) and
Pentagram (5), and high-level features aren’t
captured at all (Panks-Metalists, 3-4).</p>
        <p>Second Model with two pairs of feature extraction layers provides much better results: it can
distinguish high-level features and similar low-level features. However, it still has a small problem in
mispredicting Pentagram (5) as Inverse Pentagram (2) or Anarchy (1). Third model tends to mispredict
high-level features in Metalists (3) and Punks (4), while fourth model behaves similarly to second one,
despite Pentagram (5) is being misclassified more with respect to Reverse pentagram (2). Consequently,
having 2 pairs of features extracting layers is enough to capture all types of features, their increasing
doesn’t bring any improvement, moreover it may decrease the performance. This may be caused by
having too much MaxPooling2D layers, which reduce some useful but small features, that therefore can’t
be recognized by new kernel sliding. To address this problem, there will be only an extra Conv2D layer
in different parts of 2xCM model.</p>
      </sec>
      <sec id="sec-10-2">
        <title>It may be seen already from the</title>
        <p>testing curve that model 2 with
additional Conv2D layer inside
captures too much noise, while adding
Conv2D layer in the end helped to
reach benchmark of 83.382%. The next
option to try is to increase amount of
Dense layers to make model better
understanding the features it
extracted.</p>
        <p>While there are almost no differences in confusion matrices between adding one or two layers,
however, they significantly differ from the original model by building much stronger relationship
between groupds of features and classes. In particular, they were much more successful in distinguishing
high-level features and objects with similar low-level features. The models with one and two additional
Dense layers approached benchmarks of 89.9496% and 91.3690% respectfully.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>6. Conclusion</title>
      <p>In this work there was studied Convolutional approach from scratch and there was practiced how
modifying its parts influences model’s performance. Biggest attention was brought to CNN application in
classification tasks that require elusive dataset with questionable quality. In particular, the result of this
research may be used in youth symbols classification tasks. There was used a self-collected dataset, that
helped to provide the necessary evidence. The dataset consisted of 700 images, belonging to 7 classes,
while different classes contained only low-level, low- and middle-level or all levels of possible features.
Additionally, different approaches in augmentation were applied, which helped to increase model
accuracy by 20% (in relative score). The obtained results from experiments showed that to extract more
features, and to better distinguish similar features additional convolutional layer should be added. Adding
arbitrary amount of Pooling and Dense layers helps to prevent overfitting in long perspective. However,
increasing only one part of model (i.e. responsible for feature extractions or their manipulations) leads
back to model overfitting. During the experiments different combinations of layers were tried, finding
the most suitable ones. The results showed that model, consisting of two pair of Conv2D-MaxPooling2D
layers, followed by Conv2D and three Dense layer was able to achieve more than 91% accuracy
benchmark. To further optimize the results, more expanded research should be conducted, especially
there should be paid high attention to Reccurent Neural Networks and existing solutions in Image
Segmentation tasks[21-23].</p>
      <p>7. References
[1] C. Ranjeeth Kumar, Saranya N, M. Priyadharshini, Derrick Gilchrist E, Kaleel Rahman M, Face
recognition using CNN and siamese network, 2023, https://doi.org/10.1016/j.measen.2023.100800.</p>
      <p>[2] V. Kubytskyi, T. Panchenko, “An Effective Approach to Image Embeddings for E-Commerce”, 2023,
https://ceur-ws.org/Vol-3347/Short_5.pdf</p>
      <p>[3] A. Azulay, Y. Weiss, “Why do deep convolutional networks generalize so poorly to small image
transformations?”, 2019, https://jmlr.org/papers/volume20/19-519/19-519.pdf</p>
      <p>[4] K. Fukushima, S. Miyake “Neocognitron: A self-organizing neural network model for a
mechanism of visual pattern recognition.”, in “Competition and cooperation in neural nets”, 1982,
https://link.springer.com/book/10.1007/978-3-642-46466-9</p>
      <p>[5] Matthew D Zeiler, Rob Fergus, “Visualizing and understanding convolutional networks”, in
“European conference on computer vision”, 2014,
https://link.springer.com/chapter/10.1007/978-3-31910590-1_53</p>
      <p>[6] Tejaswi Potluri, Somavarapu Jahnavi &amp; Ravikanth Motupalli, “Mobilenet V2-FCD: Fake Currency
Note Detection”, 2021, https://link.springer.com/chapter/10.1007/978-981-16-3660-8_26
[7] Chiheb Chebbi, “Mastering Machine Learning for Penetration Testing”, 2018, by Packt Publishing,
https://github.com/PacktPublishing/Mastering-Machine-Learning-for-Penetration-Testing
[8] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel,
“Backpropagation applied to handwritten zip code recognition. Neural Computation”, 1989,
https://doi.org/10.1162%2Fneco.1989.1.4.541</p>
      <p>[9] D. Cires [an, U. Meier, J. Schmidhuber, “Multi-column deep neural networks for image
classification”,2012, https://doi.org/10.48550/arXiv.1202.2745</p>
      <p>[10] Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang,
Serge Belongie, “Fine-Grained Image Analysis with Deep Learning: A Survey”, 2021, IEEE,
https://doi.org/10.48550/arXiv.2111.06119</p>
      <p>[11] A. Krizhevsky, I. Sutskever, Geoffrey E. Hinton, “ImageNet Classification with Deep
Convolutional Neural Networks”, 2012,
https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45bPaper.pdf</p>
      <p>[12] K. Simonyan, A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
Recognition”, 2014, https://doi.org/10.48550/arXiv.1409.1556</p>
      <p>[13] Haiguang Wen, Junxing Shi, Yizhen Zhang, Kun-Han Lu, Jiayue Cao, Zhongming Liu, “Neural
Encoding and Decoding with Deep Learning for Dynamic Natural Vision”, 2016,
https://doi.org/10.1093/cercor/bhx268</p>
      <p>[14] Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li, “Empirical Evaluation of Rectified Activations in
Convolution Network”, 2015, https://doi.org/10.48550/arXiv.1505.00853</p>
      <p>[15] C. E. Nwankpa, W. Ijomah, A.Gachagan, S. Marshall, “Activation Functions: Comparison of
Trends in Practice and Research for Deep Learning”, 2018, https://doi.org/10.48550/arXiv.1811.03378
[16] S. R. Dubey, S. K. Singh, B. B.Chaudhuri, “Activation Functions in Deep Learning: A
Comprehensive Survey and Benchmark”, 2022, https://doi.org/10.48550/arXiv.2109.14545
[17] Alex Hernandez-Garcia, “Data augmentation and image understanding”, 2020,
https://doi.org/10.48550/arXiv.2012.14185</p>
      <p>[18] S. Pandian, “A Comprehensive Guide on Hyperparameter Tuning and its Techniques”, 2022,
https://www.analyticsvidhya.com/blog/2022/02/a-comprehensive-guide-on-hyperparameter-tuning-andits- techniques/</p>
      <p>[19] Margherita Grandini, Enrico Bagli, Giorgio Visani, “Metrics for multi-class classification: an
overview”, 2020, https://doi.org/10.48550/arXiv.2008.05756</p>
      <p>[20] Wen Zhu, Nancy Zeng, Ning Wang, “Sensitivity, Specificity, Accuracy, Associated Confidence
Interval and ROC Analysis with Practical SAS® Implementations”, 2010,
https://lexjansen.com/nesug/nesug10/hl/hl07.pdf</p>
      <p>[21] Song Yuheng, Yan Hao, “Image Segmentation Algorithms Overview”, 2017,
https://doi.org/10.48550/arXiv.1707.02051</p>
      <p>[22] Hebei Li, Yueyi Zhang, Zhiwei Xiong, Zheng-jun Zha, Xiaoyan Sun,“Deep Spiking-UNet for
Image Processing”, 2023, https://doi.org/10.48550/arXiv.2307.10974</p>
      <p>[23] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, Wei Xu, “CNN-RNN: A
Unified Framework for Multi-label Image Classification”,2016, https://doi.org/10.48550/arXiv.1604.04573</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>