<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exponential data augmentation methods for improving YOLO performance in computer vision tasks⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yurii Myshkovskyi</string-name>
          <email>yurii.i.myshkovskyi@lpnu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariia Nazarkevych</string-name>
          <email>mariia.a.nazarkevych@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victoria Vysotska</string-name>
          <email>victoria.a.vysotska@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rostyslav Yurynets</string-name>
          <email>rostyslav.v.yurynets@lpnu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ivan Franko University of Lviv</institution>
          ,
          <addr-line>Lviv, 1 Universytetska str., 79007 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>12 S. Bandery str., 79000 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>120</fpage>
      <lpage>132</lpage>
      <abstract>
        <p>This article examines data augmentation methods in the task of image recognition, specifically introducing the exponential augmentation approach to enhance the performance of deep neural networks, particularly YOLO, in object detection tasks. The proposed methodology is based on the sequential and repeated application of various transformations, including horizontal and vertical flipping, 90° rotation, Gaussian Blur, brightness and contrast adjustment. This approach ensures exponential dataset growth and significantly increases the diversity of training data, which is critical for improving the model's generalisation capability. Experimental results demonstrate that applying exponential augmentation leads to a significant improvement in detection performance, as indicated by increased mean Average Precision (mAP), Precision, and Recall, even when the initial dataset is limited. Additionally, the integration of the proposed approach with other effective augmentation techniques, such as Mosaic and MixUp, has been explored. The results indicate that combining exponential augmentation with these methods leads to more robust models that can better recognise objects under different lighting conditions, viewpoints, and noise levels. Beyond accuracy analysis, the study also investigates the impact of exponential augmentation on training stability, including the convergence speed of gradient descent and resistance to overfitting. It is shown that multiple data enrichment cycles allow neural networks to adapt more efficiently to challenging conditions and reduce the likelihood of memorising only specific examples from the training set. The proposed method can be particularly useful in computer vision tasks with limited or imbalanced datasets, as well as in scenarios where improving model accuracy is required without significantly increasing computational costs. The obtained results confirm that exponential augmentation is a promising approach for enhancing the performance of YOLO and other modern object detectors in complex image recognition scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;exponential augmentation</kwd>
        <kwd>YOLO</kwd>
        <kwd>object detection</kwd>
        <kwd>computer vision</kwd>
        <kwd>small datasets</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Modern computer vision methods, in particular the YOLO (You Only Look Once) architecture, have
become widespread in object detection and classification tasks. They have been successfully
applied in various fields, including autonomous driving, video surveillance systems, and robotics.
Despite significant progress in improving deep learning models, the diversity and volume of
training data remains a critical issue. Insufficient number or unrepresentative distribution of
images in the dataset can lead to a decrease in the accuracy and reliability of object detection,
especially in difficult shooting conditions (changes in angles, lighting, noise, etc.).</p>
      <p>One of the most common approaches to dealing with data limitations is augmentation, which is
an artificial increase in the volume and diversity of the dataset using geometric and colour
transformations. However, most existing methods involve either a simple random application of
transformations or a limited set of operations, which does not always guarantee a significant
improvement in the quality of training. In addition, in many tasks, it is crucial to ensure that all
image variants are balanced so that the model learns on the full range of possible situations and
avoids overfocusing on certain types of data.</p>
      <p>To address these challenges, an exponential data augmentation method for YOLO is proposed,
which consists of a step-by-step and consistent expansion of the training set of images. At each
step, transformations are performed (horizontal and vertical reflections, 90° rotation, blurring, etc.),
and the resulting new images are also subject to the following augmentation stages. This approach
allows us to exponentially increase the number of different examples and potentially significantly
improve the generalisability of the model. In addition, the use of more advanced techniques
(Mosaic, MixUp, etc.) at the final stages further enhances the result by combining several images
into one and training the network to recognise objects in unusual, artificially created scenes.</p>
      <p>The scientific novelty lies in the creation and experimental confirmation of the effectiveness of a
step-by-step “exponential” expansion of training data, which involves the sequential application of
transformations to already augmented images. This makes it possible to cover a variety of possible
angles, lighting and distortions more evenly and comprehensively. Unlike traditional random or
one-step augmentations, the proposed approach generates a much wider sample of training
examples, increasing the generalisability of YOLO and thus providing better detection results in
real-world applications.</p>
      <p>Thus, the relevance of the proposed topic is due to the growing need to improve the reliability
and accuracy of computer vision systems, especially in the context of limited or unbalanced
training sets. The proposed exponential augmentation approach meets modern challenges,
allowing to effectively expand the space of training examples and increase the robustness of
models to input data variability. This makes the study a significant contribution to the development
of applied deep learning and is practically relevant for a wide range of applications where the
accuracy of object detection is critical.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Analysis of the latest research and publications</title>
      <p>
        The original idea of a holistic (one-step) approach to object detection was proposed in the work of
YOLO (You Only Look Once) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This model proved that a convolutional neural network can
directly predict the coordinates and classes of objects without sequentially dividing the task into
feature extraction and classification. The next important step was the release of YOLOv4[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which
paid increased attention to both the detector architecture and augmentation methods such as
Mosaic. Thanks to this combination, the COCO set managed to achieve a balance between accuracy
of ~43.5% mAP and speed of ~65 FPS, which was a significant improvement over previous versions
of YOLO.
      </p>
      <p>
        Further development in this direction has taken place in Scaled-YOLOv4 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This paper
demonstrates the ability to scale the model both towards lightweight versions (YOLOv4- tiny with
mAP of about 20–25%) and advanced versions (YOLOv4-large, exceeding 55% mAP). This provides
a flexible choice of the trade-off between Precision, Recall, mAP and performance. According to the
authors, augmentation plays an important role in these achievements, as it increases the variety of
training examples and reduces the risk of model overfitting.
      </p>
      <p>
        In the context of the variety of augmentation methods, the Albumentations library stands out
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It covers more than 70 different transformations–from basic (shifts, reflections, rotations) to
complex (CutOut, GridDistortion, etc.). A significant advantage of Albumentations is its
optimisation for OpenCV and NumPy, which allows for faster processing of large image sets. More
specialised compositional techniques include MixUp [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which linearly mixes pixels of two
examples, and CutMix [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which cuts random parts of one image and pastes them into another. In
classification tasks, both of these methods have demonstrated an increase in generalisability;
however, in detection, their effectiveness requires more fine-tuning so that the model does not lose
context for object localisation.
      </p>
      <p>
        A separate area is approaches to synthesis or random distortion of a part of the image. For
example, Random Erasing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] paints over random fragments, simulating the conditions of partial
shading or frame damage. The Copy-Paste method [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] offers to “cut” entire objects from one image
and paste them into another, which is especially useful for increasing the number of samples of
complex scenes. Such techniques further enhance the ability of models to recognise objects under
non-standard conditions and add examples to the training set that are not present in the original
data.
      </p>
      <p>
        In a series of review papers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], researchers provide an overview of modern
augmentation methods for deep learning, emphasising the importance of this process in the
context of limited or non-standard datasets. The authors emphasise that the right augmentation
strategy increases mAP and Precision, as the model is able to “see” more variations in potential
objects and background situations.
      </p>
      <p>
        In addition, in the works “AutoAugment: Learning augmentation policies from data” [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and
“AutoAugment: Learning augmentation strategies from data” [11] in the context of AutoAugment
demonstrated that the automatic search for optimal augmentation policies can significantly
improve the accuracy of detectors. For example, in experiments with the RetinaNet model on
COCO, mAP increased by about 2–3% only due to a carefully selected combination of
transformations, without any changes in the architecture or hyperparameters of the optimiser. This
approach is particularly useful when the researcher is unable to manually search through all
augmentation options.
      </p>
      <p>At the same time, [12] demonstrated that knowledge distillation technologies can efficiently
transfer high-level representations from larger networks to smaller ones while maintaining
competitive accuracy. This approach can be combined with advanced augmentation in the future to
create more compact yet robust detectors that are resistant to various distortions.</p>
      <p>
        Therefore, the study of the method of exponentially increasing the diversity of the training set
is a logical step, since the analytical data of [
        <xref ref-type="bibr" rid="ref5">5, 13, 14</xref>
        ], “AutoAugment: Learning augmentation
policies from data” [15] and [16] show the ability of augmentation to significantly improve the
mAP, Recall and Precision of models in computer vision tasks. This approach, when properly
configured with sequential transformations, can significantly improve results even on relatively
small datasets, expanding the range of possible scenarios and reducing the risk of overfitting.
      </p>
      <p>The object of the study is the process of training YOLO networks in object detection tasks,
including data preparation, model parameter optimisation, and the general YOLO architecture for
different types of images.</p>
      <p>The subject of the study is the method of exponential data augmentation, which includes
multiple geometric and colour transformations of images, as well as their combination (Mosaic,
MixUp) to improve the quality of training and generalisation of YOLO networks.</p>
      <p>The aim of this study is to develop, substantiate and experimentally test the method of
exponential data augmentation to improve the accuracy and generalisability of YOLO models in
object detection tasks. The study is aimed at identifying the impact of gradual sample expansion
using multiple augmentations on the quality of detector training, in particular, on the mAP (Mean
Average Precision), Precision, and Recall indicators. Achieving the research goal involves solving
the following tasks:</p>
      <p>To study the effect of exponential augmentation on the efficiency of YOLO learning by
conducting a series of experiments with different sample expansion configurations.</p>
      <p>Comparative analysis of the effectiveness of the developed method with basic or random
augmentation strategies</p>
      <p>Study of the impact of the proposed methodology on the stability of the learning process—in
particular, analysis of the rate of convergence of the gradient descent and the model’s resistance to
overtraining.</p>
      <p>It is expected that the proposed exponential augmentation method will significantly improve
the accuracy of YOLO detection without the need to expand the original dataset. In addition, its
application can be especially useful for computer vision tasks where there are limitations on the
number of available training images or where the dataset contains an imbalance of classes.</p>
      <p>The results of this research can be applied in a wide range of real-world scenarios, including
video surveillance, autonomous driving, military and industrial applications where the accuracy
and robustness of the object detector to changing shooting conditions are important [17].</p>
      <p>
        The exponential increase in the training dataset through the sequential application of various
transformations is a logical development of the ideas of classical augmentation, which has long
been proven effective in improving the accuracy and robustness of computer vision models [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
The essence of the exponential approach is that the results of the previous step (new images)
become the “input” to the next stage of augmentation, which leads to a geometric (exponential)
increase in the total number of examples.
2.1.
      </p>
      <sec id="sec-2-1">
        <title>The concept of data augmentation</title>
        <p>
          Classical random augmentations (rotations, shifts, brightness changes) often do not cover the full
range of spatial and colour distortions [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], while their sequential combination significantly expands
the possible variations. At the same time, the more unique images the model sees, the lower the
risk of “remembering” a particular sample, and thus the higher the generalisability, which is
confirmed by a number of studies within YOLO [12, 16]. In addition, exponential augmentations
can be applied to any dataset, do not require complex setup, and consist of basic transformations
(flip, rotate, blur, etc.) that can be easily implemented using libraries such as Albumentations [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
2.2.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Step-by-step implementation of exponential data augmentation</title>
        <p>The first step is a simple copying of the original images and YOLO label files from the
dataset_converted/ directory to the dataset_converted_augmented/ directory, while maintaining
the division into train, valid and test. At this stage, no transformation is performed—only a basic
set of files is created, on which augmentation will be performed later. This approach is consistent
with the practice of researchers first saving the “clean” original data and then working with a copy
of it [13].</p>
        <p>With a horizontal flip (A.HorizontalFlip), the goal is to mirror the images from left to right and
increase the set by a factor of 2: each image is read, a flip is performed, and the image is saved with
the _hflip suffix (labels are processed in the same way), so the number of images is doubled. In the
case of a vertical flip (A.VerticalFlip), the method is identical, but it uses top-down mirroring and
reads all files (including those that have already been flip-flopped), which causes another doubling
(4× the original number). Rotate by 90° (A.Rotate(limit=(90, 90))) changes the image orientation and
allows you to get new angles: all images are rotated by 90° clockwise and saved with the _rot90
suffix, creating 8× the original volume. Finally, Gaussian Blur (A.GaussianBlur(blur_limit=(37, 37)))
is designed to simulate various shooting conditions (out-of-focus or camera shake) (Buslaev, A. et
al., 2020), implemented by passing each image through a Gaussian blur filter (kernel up to 37×37)
and saving the result with a _blur file, which once again doubles the total size (16× of the original).
It’s important to note that at each step, the transformations are performed on all images, including
those that appeared at the previous stage. Thanks to this, the total number of examples grows
exponentially rather than linearly.</p>
        <p>
          After the “exponential” phase is complete, an additional pass is performed on the current 16×
images, in which RandomBrightnessContrast (p=0.2), RandomGamma (p=0.2), and CoarseDropout
(p=0.5) are applied with a certain probability. Each image receives only one additional copy
(depending on whether the transformations “worked”), which roughly doubles the total number
compared to the folder after the previous steps. In the last step, the entire set is scanned again, and
two advanced augmentations known as Mosaic [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and MixUp are applied with probability
mosaic_prob=0.3 or mixup_prob=0.3. Mosaic randomly selects 4 images (including the current one)
and places them in a 2×2 grid with a given input_size, adjusting the bounding boxes to account for
the offset, which helps the model to “see” composite scenes with different types of objects [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
MixUp, in turn, selects two images and mixes them linearly using the formula
(1)
where α is a random variable from the Beta(0.5, 0.5) distribution, and merged bounding boxes
reduce the risk of overlearning by increasing the variety of examples. Although Mosaic and MixUp
do not increase the number of images as aggressively as flips or rotations, they further expand the
set of possible scenes and angles.
        </p>
        <p>The first step involves copying the original N images to a new folder without changes,
resulting in N images in each sample. Next, exponential augmentation takes place: four operations
(horizontal flip, vertical flip, 90 degree rotation, and GaussianBlur with probability p =1.0) are
performed sequentially on all images in the folder, doubling the number each time. First, the
horizontal flip yields 2 N instead of N , then the vertical flip doubles 2 N (4 N in total), the
90degree rotation generates 8 N , and GaussianBlur brings the total number to 16 N . At the stage of
random augmentation “in one pass” (all_at_once_augmentation function), each of the 16 N images
receives another copy, which again doubles the number to 32 N . Then, for each of the 32 N , the
probability of applying Mosaic augmentation is 0.3, and MixUp is also 0.3; each of them (if
triggered) generates one new image, and if both are triggered, two new images are obtained for one
original. With a probability of 0.49, no additional images are created, with a probability of 0.42,
only one new image is created, and with a probability of 0.09, two new images are created. On
average, 0.6 new images are generated per input image, i.e. 32 N +0.6 × 32 N ≈ 51.2, although the
actual value can be from 32 N (when neither augmentation is “triggered”) to 96N (if both are
activated for each image). Examples of the obtained images are shown in Figure 1.
2.3.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Transform labels during geometric transformations</title>
        <p>It is critical to correctly convert the coordinates of bounding boxes (labels) during all operations
that change the geometry of the image. If a flip or rotation is performed without proper label
correction, the detector will receive incorrect training data.</p>
        <p>
          In the YOLO format, each label is described by 5 numbers:( class , x , y , w , h ) , where x, y are
the coordinates of the centre of the object, w, h are the width and height of the bounding box,
normalised by the size of the image (i.e. in the range [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]). Below are the transformation formulas.
        </p>
        <p>Horizontal flip: When you mirror an image from left to right (horizontal flip), the point (x,y)(x,y)
(x,y) goes to</p>
        <p>Accordingly, the width and height of the object do not change, as the image is simply displayed.
Vertical flip: Mirroring from top to bottom changes the y -coordinate:
x ' =1−x , y ' =y , w ' =w , h' =h .</p>
        <p>x ' =x , y ' =1−y , w ' =w , h' =h .</p>
        <p>Rotate by 90° (clockwise): During a 90° rotation, the coordinates ( x , y ) are transformed using
the formula:</p>
        <p>The width and height w and h are also interchanged here.</p>
        <p>Photometric transformations (blur, contrast, etc.): If the augmentation only changes pixels
(brightness, contrast, blur, fill, fragmentation), without affecting spatial dimensions, then the
bounding boxes remain unchanged:
x ' =y , y ' =1−x , w ' =h , h' =w .</p>
        <p>x ' =x , y ' =y , w ' =w , h' =h .</p>
        <p>When placing 4 images in a 2×2 matrix, each image has an offset O x , O y for its quadrant:
x 'abs=x abs+O x , y 'abs=y abs+O y .</p>
        <p>The resulting collage (of size 2⋅input_size×2⋅input_size) is usually reduced back to
input_size×input_size. Accordingly:</p>
        <p>Label correction for Mosaic: For Mosaic (or similar “collages”, the logic is more complex,
namely the transition to absolute coordinates: If the image is reduced or enlarged to
input_size×input_size then:
x abs=x ⋅ input size ,
y abs=y ⋅ input size ,
w abs=w ⋅ input size ,
habs=h ⋅ input size .</p>
        <p>'
x abs
x final= 2 ⋅ input size</p>
        <p>'
y abs
y final= 2 ⋅ input size</p>
        <p>w abs
w final= 2 ⋅ input size</p>
        <p>
          habs
h final= 2 ⋅ input size
,
,
.
Label correction for MixUp: For MixUp (or similar techniques like CutMix [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], both images are
usually pre-scaled and overlaid on top of each other. The pixels are summed using the formula (1),
and the box coordinates are simply merged:
        </p>
        <p>AllBoxes= Boxes1 ∪ Boxes2 .
(9)</p>
        <p>Therefore, each geometric transformation requires a strict correspondence between the
transformed image and its annotation. Errors in the implementation inevitably lead to a
deterioration in the detector’s accuracy
2.4.</p>
      </sec>
      <sec id="sec-2-4">
        <title>YOLO model</title>
        <p>
          YOLO (You Only Look Once) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is one of the most popular deep learning architectures for
realtime object detection tasks. The main idea behind the model is a one-step approach: the image is
divided into a grid, and for each cell, the network predicts the object’s coordinates, class, and
confidence in its presence. This allows for significant image processing speeds while maintaining a
high level of accuracy, which is especially valuable for real-time applications.
        </p>
        <p>In the basic YOLO implementation, the input image is scaled to a fixed size (e.g. 416×416 or
640×640 pixels) and fed to a convolutional neural network consisting of several feature processing
units. The architecture uses a backbone (e.g., CSPDarknet in YOLOv4/YOLOv5 versions), a neck
(PANet or FPN), and a head to predict boundaries and object classes.</p>
        <p>One of the key advantages of YOLO is that it performs object detection in just one run of the
image through the model, unlike two-step approaches (such as Faster R-CNN) that first generate
candidate regions and then classify them. This results in outstanding performance and
compactness, allowing YOLO to be used even on devices with limited computing resources.</p>
        <p>Today, there are several versions of YOLO, from the original YOLOv1 to the modern YOLOv8.
This study utilized the YOLOv8S (small) version, which provides a good balance between accuracy
and performance, making it suitable for training with augmented datasets even on
mediumperformance systems. YOLOv8S has an improved architecture compared to previous versions,
including the use of a modular structure and block optimisation for more efficient spatial feature
learning. The model is adapted to accept data in the YOLO markup format (.txt files with
normalised rectangle coordinates), which facilitates integration with various image sets and
automated augmentation pipelines. YOLO also supports training using advanced strategies such as
Mosaic, MixUp, and in this case, exponential augmentation, which allows us to significantly
increase the number of training examples and improve detection results on test samples.</p>
        <p>With its combination of speed, accuracy and flexibility, YOLO is an ideal platform for
experimentally investigating the effect of augmentation on the performance of an object detector.
2.5.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Research metrics</title>
        <p>
          According to [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and [11], one of the key metrics for evaluating pattern recognition systems is
Accuracy, which shows the ratio of correctly identified examples to their total number. An
important addition is the error matrix, which helps to visualise how often the model confuses
classes with each other. The Precision parameter characterises the quality of the predicted positive
alarms, while Recall shows how well the model finds positive samples among all available ones.
The F-measure, in turn, is a generalised metric that balances Precision and Recall values, allowing
for a comprehensive assessment of the algorithm’s performance.
2.6.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Results</title>
        <p>
          The proposed method of exponential augmentation, which uses basic flips, rotations, blurring, as
well as more complex techniques (Mosaic, MixUp), allows to significantly increase the variety of
training examples. Unlike single-stage or random augmentation, where transformations are applied
only once, the consistent “multiplication” of the set increases the model’s chances of “seeing” a
wide variety of scenes, angles, and shooting conditions.
This methodology is consistent with the findings of the studies[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and [12], which emphasise
the importance of multidimensional augmentation during YOLO training. Correct label correction
is an essential component for maintaining detector accuracy, which has been repeatedly
emphasised in augmentation work [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], [13], [18]. The following sections will demonstrate the
experimental results of the described scheme and compare it with simpler (one-step) augmentation
strategies.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment and results</title>
      <p>
        To evaluate the impact of exponential augmentation on the detector accuracy, a number of
experiments were conducted with YOLOv8s, based on the ideas of YOLO [
        <xref ref-type="bibr" rid="ref1 ref9">1, 9, 12</xref>
        ]. All input
images were scaled to 640×640, and training continued until the gradient descent stopped
producing tangible progress. First, we considered a scenario without augmentation, where the
model was trained for 75 epochs with a batch size of 16, using 5000 images for training and
validation, and limited to basic preprocessing (normalisation, scaling). Next, we tested a
configuration with exponential augmentation using stepwise transformations (horizontal/vertical
flip, 90-degree rotation), as well as All-at-One random transformations, Mosaic and MixUp [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4,
12</xref>
        ]; in this case, the training lasted 22 epochs with a batch of 64 and covered a total of 220,000
images for training and validation. Importantly, in order to select the best model, the validation
sample was also significantly expanded with similar augmentations, so the detector was evaluated
on a much larger range of scenes and angles than in the baseline “no augmentation” mode.
3.1.
      </p>
      <sec id="sec-3-1">
        <title>Results without augmentation</title>
        <p>The YOLOv8s model, trained without exponentially increasing the dataset, demonstrates the
following features:</p>
        <p>Precision and recall averaged 0.88–0.91 depending on the class, and mAP@50 was around 0.90.</p>
        <p>The normalised error matrix indicates about 0.87 correct classifications for the bmp class, about
0.89 for btr and about 0.95 for tank.</p>
        <p>At the same time, a significant number of objects were confused with the background, especially
in cases of partial visibility or fragments of equipment.</p>
        <p>Accuracy for all classes (including background) reached approximately 0.85–0.86, and the
F1measure ranged from 0.85–0.88.</p>
        <p>Therefore, the baseline scenario without augmentation generally successfully recognises most
of the key classes, but the model still confuses some related objects (bmp—btr) or misidentifies
background as technique (and vice versa).</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results with exponential augmentation</title>
        <p>By applying step-by-step augmentation (flips, rotations, GaussianBlur, Mosaic, MixUp, etc.), the
training set grew tenfold, and the validation set was also augmented (proportionally, with the same
transformations). This allowed the model to see a much wider range of angles, lighting, and
background scenes.</p>
        <p>Precision and recall increased to 0.90–0.91 and 0.86–0.88, respectively, which confirms a more
balanced recognition of all classes.</p>
        <p>The mAP@50 reached about 0.92 (versus ~0.90 without augmentation), and the mAP@50-95
was about 0.74.</p>
        <p>The error matrix shows approximately 0.86 correct detections for bmp, 0.93 for btr and 0.97 for
tank. However, the proportion of confusion with background has decreased, although it remains
(around 0.24–0.45) depending on the class.</p>
        <p>Accuracy increased by 2–3% on average, and F1-measure by 1–2% depending on the class.</p>
        <p>
          Thus, the exponentially increased data contributed to improved results, with a particularly
noticeable reduction in errors between related classes and some reduction in false positives for
background as a technique. Similar conclusions about the benefits of augmentation are made in[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],
[14], and [15], where it is emphasised that a balanced increase and diversity of the sample
significantly increase the robustness of the detector.
        </p>
        <p>Comparison of the models without and with augmentation shows an increase in Accuracy from
about 0.91 to 0.92–0.93, higher mAP values (by 2–3% depending on the IoU threshold),
improvements in Precision and Recall by 1–3 percentage points, and a decrease in the frequency of
confusion with background and between certain classes (bmp/btr).
In addition, the model demonstrates better robustness even when validated on an extended set (the
same augmentations as during training), which indicates that there is no “memorisation” of specific
angles. Thus, exponential augmentation has a positive effect on the final quality of the detector,
reducing the number of errors and increasing the generalisability of all major metrics, including
Accuracy, Precision, Recall, F1 and mAP, which is consistent with the findings of a number of
studies, on the importance of proper expansion and diversification of datasets in computer vision.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>
        The results correlate with the findings of a number of researchers who emphasise the importance
of diverse and balanced datasets to improve detector accuracy. In particular, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [12] emphasise
that the introduction of diverse augmentations (Mosaic, MixUp) significantly enhances the ability
of YOLO family architectures to capture complex patterns and reduces the risk of overfitting.
Experiments combining an exponential increase in sample size with stepwise colour and geometric
transformations confirmed these claims: Precision, recall, and mAP increased by 1–3% compared to
the baseline scenario.
      </p>
      <p>
        At the same time, the proposed approach is consistent with general reviews on image
augmentation [
        <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
        ], which emphasise that artificially increasing and diversifying data has a
positive effect not only on the quality of detection but also on the stability of the learning process.
This is especially true when the initial dataset is relatively small or has significant heterogeneity.
Also in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and “AutoAugment: Learning augmentation strategies from dataV [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] emphasise that a
successful combination of simple basic transformations (flip, rotate, blur) with techniques such as
MixUp [14], CutMix [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or Copy-Paste [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is more effective than using any one method alone. In
the implementation, the exponential transformation chain provided significant coverage of diverse
variations, and the subsequent “mixing” of images (Mosaic, MixUp) allowed us to further increase
the variety of artificially created scenes.
      </p>
      <p>
        The findings are also consistent with the results of AutoAugment “AutoAugment: Learning
augmentation policies from data”, “AutoAugment: Learning augmentation strategies from data”
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where the authors showed that a carefully selected augmentation policy can increase mAP by
several percentage points without changing the network architecture. It has been demonstrated
that a similar effect can be achieved without automated policy search, but by consistently
“multiplying” the set.
      </p>
      <p>
        In contrast to classical augmentation, which is mostly implemented once or randomly, the
stepwise approach allows us to more effectively “disperse” various variations in the training space,
reducing the sensitivity of the detector to specific shooting conditions or angles. An important
confirmation of the effectiveness is the reduction of confusion between similar classes and an
increase in overall accuracy by 2–3%, which is in line with the trends described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Thus, in comparison to the reviewed works, the proposed exponential augmentation method
stands out for its holistic approach to sequential expansion and multi-stage data modification. This
makes it possible to combine the advantages of basic spatial transformations and more complex
techniques such as MixUp without the risk of overtraining, as evidenced by the empirical results.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This study provides a detailed overview of modern image augmentation methods and their
application to improve the accuracy of deep learning models in detection tasks. Particular attention
is paid to YOLO architectures, which have proven to be effective in real-time object analysis in
recent years.</p>
      <p>The main achievement of the work is the development and implementation of exponential
augmentation, a step-by-step approach to artificially increasing the training set, which involves the
consistent application of basic geometric and photometric transformations (flips, rotations,
GaussianBlur), as well as more complex techniques such as Mosaic and MixUp. Comparative
analysis has shown that the proposed method:

</p>
      <p>Significantly increases the diversity and volume of the dataset, improving the model’s
resistance to different angles, backgrounds and sudden noise;
It improves key accuracy metrics (mAP, Precision, Recall) by 1–3% and overall accuracy by
about 2–3% compared to a model trained without augmentations;</p>
      <p>Reduces the risk of overlearning by creating a wider space of training examples and
accelerates convergence in the early stages of training.</p>
      <p>Thus, exponential augmentation demonstrates its effectiveness in improving the quality of the
YOLO detector and can be easily adapted to any computer vision tasks where there is a need to
further expand or diversify the image sample. Further development in this area could include
automating the search for optimal transformation sequences or combining exponential
augmentation with other regularisation techniques, which would further enhance the
generalisability of deep learning models.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The research was carried out with the grant support of the Ministry of Education and Science of
Ukraine, “Methods and tools for detecting disinformation in social networks based on deep
learning technologies” under Project No. 0125U001852. During the preparation of this
manuscript/study, the author(s) used [ChatGPT 4o Available, Gemini 2.5 flash, Grammarly] to
correct the style and improve the quality of the text, as well as to eliminate grammatical errors.
The research results obtained in the article are entirely original. The authors have reviewed and
edited the output and take full responsibility for the content of this publication.</p>
      <p>Declaration on Generative AI
While preparing this work, the authors used the AI programs Grammarly Pro to correct text
grammar and Strike Plagiarism to search for possible plagiarism. After using this tool, the authors
reviewed and edited the content as needed and took full responsibility for the publication’s content.
[11] C. Y. Wang, A. Bochkovskiy, H. Y. M. Liao, Scaled-YOLOv4: Scaling Cross-Stage Partial
Network, in: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, 13029–13038.
doi:10.1109/CVPR46437.2021.01283
[12] S. Yun, et al., CutMix: A Regularisation Strategy to Train Strong Classifiers with Localisable
Features, in: IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, 6023–6032.
doi:10.1109/ICCV.2019.00612
[13] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, Mixup: Beyond Empirical Risk Minimisation,
in: Int. Conf. Learn. Represent. (ICLR), 2018. https://openreview.net/forum?id=r1Ddp1-Rb
[14] Z. Zhong, et al., Random Erasing Data Augmentation, in: Proc. AAAI Conf. Artif. Intell., 34(7)
(2020) 13001–13008. doi:10.1609/aaai.v34i07.7000
[15] M. Nazarkevych, V. Buriachok, N. Lotoshynska, S. Dmytryk, Research of Ateb-Gabor filter in
biometric protection systems, in: IEEE 13th Int. Sci. Tech. Conf. Comput. Sci. Inf. Technol.
(CSIT), vol. 1, 2018, 310–313. doi:10.1109/STC-CSIT.2018.8526650
[16] M. Nazarkevych, et al., Application Perfected Wave Tracing Algorithm, in: IEEE 1st Ukraine</p>
      <p>Conf. Electr. Comput. Eng. (UKRCON), 2017, 1011–1014. doi:10.1109/UKRCON.2017.8100532
[17] V. Sokolov, P. Skladannyi, A. Platonenko, Video Channel Suppression Method of Unmanned
Aerial Vehicles, in: IEEE 41st Int. Conf. on Electronics and Nanotech-nology (2022) 473–477.
doi:10.1109/ELNANO54667.2022.9927105
[18] M. Nazarkevych, et al., Identification ofBiometric Images by Machine Learning, in: IEEE 12th
Int. Conf. Electron. Inf. Technol. (ELIT), 2021, 95–98. doi:10.1109/ELIT53502.2021.9501178</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Y. M.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <article-title>YOLOv4: Optimal Speed and Accuracy of Object Detection</article-title>
          , arXiv preprint,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2004</year>
          .10934
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Buslaev</surname>
          </string-name>
          , et al.,
          <source>Albumentations: Fast and Flexible Image Augmentations, Information</source>
          ,
          <volume>11</volume>
          (
          <issue>2</issue>
          ) (
          <year>2020</year>
          )
          <article-title>125</article-title>
          . doi:
          <volume>10</volume>
          .3390/info11020125
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          , et al.,
          <source>AutoAugment: Learning Augmentation Policies from Data, in: IEEE/CVF Conf. Comput. Vis. Pattern Recognit</source>
          .
          <source>(CVPR)</source>
          ,
          <year>2019</year>
          ,
          <fpage>113</fpage>
          -
          <lpage>123</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2019</year>
          .00020
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ghiasi</surname>
          </string-name>
          , et al.,
          <article-title>Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation</article-title>
          , in: IEEE/CVF Conf.
          <source>Comput. Vis. Pattern Recognit</source>
          .
          <source>(CVPR)</source>
          ,
          <year>2021</year>
          ,
          <fpage>2918</fpage>
          -
          <lpage>2928</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR46437.
          <year>2021</year>
          .00293
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          , et al.,
          <article-title>Face Model Compression by Distilling Knowledge from Neurons, in: AAAI Conf</article-title>
          . Artif. Intell.,
          <volume>30</volume>
          (
          <issue>1</issue>
          ) (
          <year>2016</year>
          ). https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/ 12311
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mumuni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mumuni</surname>
          </string-name>
          ,
          <article-title>Data Augmentation: A Comprehensive Survey of Modern Approaches</article-title>
          , Array,
          <volume>16</volume>
          (
          <year>2022</year>
          )
          <article-title>100258</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.array.
          <year>2022</year>
          .100258
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Myshkovskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nazarkevych</surname>
          </string-name>
          ,
          <article-title>Method of fingerprint identification based on convolutional neural networks</article-title>
          ,
          <source>SISN</source>
          ,
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          . doi:
          <volume>10</volume>
          .23939/sisn2024.
          <fpage>15</fpage>
          .001
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>You only Look Once: Unified, Real-Time Object Detection</article-title>
          ,
          <source>in: IEEE Conf. Comput. Vis. Pattern Recognit</source>
          .
          <source>(CVPR)</source>
          ,
          <year>2016</year>
          ,
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .91
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shorten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Khoshgoftaar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Image Data Augmentation for Deep Learning</article-title>
          ,
          <source>J. Big Data</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ) (
          <year>2019</year>
          )
          <article-title>60</article-title>
          . doi:
          <volume>10</volume>
          .1186/s40537-019-0197-0
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          ,
          <source>in: 36th Int. Conf. Mach. Learn. (ICML)</source>
          ,
          <year>2019</year>
          ,
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          . http://proceedings.mlr.press/v97/tan19a.html
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>