<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A method for efficient training of road sign recognition models for resource-dependent ADAS systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maksym Hovorukha</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anatoliy Doroshenko</string-name>
          <email>a-y-doroshenko@ukr.net</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Software Systems of the National Academy of Sciences of Ukraine</institution>
          ,
          <addr-line>Akademika Glushkova Ave 40, 03187 Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”</institution>
          ,
          <addr-line>Peremohy Ave 37, 03056 Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The article is devoted to the development and training of a deep learning model for automatic road sign recognition based on computer vision technologies. It thoroughly examines the process of forming and preprocessing the training dataset, including scaling, normalization, and the use of data augmentation methods to improve model accuracy and generalization. Special attention is given to comparing different approaches to neural network design - including recurrent networks, transformers, and convolutional neural networks (CNNs) - in order to determine the most effective architecture for real-time image classification of traffic signs. As a result of the architectural analysis, the MobileNetV2 model was selected - a lightweight, fast, and accurate neural network specifically adapted for use on devices with limited computational resources. Within the scope of the study, the network was optimized through regularization techniques, the addition of dropout layers, quantization, and the use of data variation methods to enhance training quality. The model was implemented in Python using the TensorFlow and Keras libraries, which provide ease of development, scalability, and hardware acceleration support. Training was performed on the Kaggle platform with GPU usage, enabling high efficiency without compromising performance. The proposed approach lays the foundation for deploying efficient, low-cost, and accessible road sign recognition systems that can be integrated into driver assistance systems and mobile applications, contributing to improved road safety. The article also discusses the results of experimental validation, where the model demonstrated impressive accuracy and robustness in recognizing road signs under various conditions. These results confirm that the proposed approach can be effectively deployed in real-world scenarios, further enhancing its potential for integration into driver assistance systems and mobile applications. This system has the potential to significantly improve road safety by providing drivers with real-time, accurate information about traffic signs, thereby reducing the risk of accidents and improving overall traffic flow.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;deep learning</kwd>
        <kwd>road sign recognition</kwd>
        <kwd>computer vision</kwd>
        <kwd>training dataset</kwd>
        <kwd>data augmentation</kwd>
        <kwd>convolutional neural network (cnn)</kwd>
        <kwd>model optimization</kwd>
        <kwd>real-time processing</kwd>
        <kwd>mobile application</kwd>
        <kwd>video-based detection</kwd>
        <kwd>driver assistance</kwd>
        <kwd>road safety</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The increase in the number of vehicles on the road increases the risk of accidents, many of which
are caused by human error, such as inattention, fatigue or misinterpretation of road signs.
Therefore, the development of effective automatic sign recognition systems is an important area
for improving road safety.</p>
      <p>Despite the availability of modern ADAS solutions, their high cost and the need for additional
equipment limit their widespread use. Low-cost alternatives often have inferior accuracy or
significant delays in operation, making it impossible to use them effectively in real time. The
variability of road signs, which can vary in size, lighting, viewing angles and weather conditions,
creates additional complexity.</p>
      <p>This article discusses an approach to developing an affordable and efficient model that can be
implemented in road sign recognition systems that do not require specialised hardware, working
on the basis of a video stream from a smartphone camera or dashcam. The main challenge is to
ensure high accuracy and processing speed, which requires model optimisation and the use of data
augmentation methods for training.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Choose the type of neural network for road sign recognition</title>
      <p>Traffic sign recognition is a fundamental and challenging task in the field of computer vision, as
it requires not only high classification accuracy but also real-time processing and
computational efficiency—especially in the context of autonomous driving and advanced
driver-assistance systems (ADAS). One of the most critical factors influencing the success of
such systems is the choice of neural network architecture, as it directly affects both
performance and resource consumption. Therefore, selecting the most appropriate model for
this specific application is essential.</p>
      <p>
        A variety of neural network architectures can be utilized for analyzing visual data, including
recurrent neural networks (RNNs), Vision Transformers (ViTs), and convolutional neural
networks (CNNs). Each of these paradigms has unique strengths and trade-offs. RNNs are
wellsuited for sequential data, ViTs have shown great potential in capturing long-range
dependencies in images, and CNNs excel at extracting local spatial features through
hierarchical layers. However, in the context of traffic sign recognition, convolutional neural
networks remain the most effective and practical choice due to their balance of accuracy,
speed, and relatively low computational demands [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Recurrent neural networks (RNNs) and their limitations</title>
        <p>
          Recurrent neural networks (RNNs) are designed to work with sequential data. The ability to take
into account the temporal context is their main characteristic. This is achieved through the use of
hidden states that ‘remember’ the results of previous processing stages. The main idea of the model
is described by the formula [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]:
        </p>
        <p>h!  =  f W!x!  +  W!h!!!  +  b , (1)
where ht – is the hidden state at step t, which stores information about the current and previous
elements of the sequence;
xt – is the input vector at step t;</p>
        <p>Wx,Wh – weight matrices responsible for the connections between the input and the state, as
well as between the current and the previous states and b is the displacement vector;
f – is a nonlinear activation function (usually ReLU or tanh).</p>
        <p>
          However, they should not be confused with regression models, which, on the contrary, are used
to predict outcomes based on input variables, either continuous or categorical [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Although RNN
networks are sometimes used for image sequence processing tasks (e.g. video stream), road sign
recognition in a video stream is a task that requires local processing of each frame, the context
between frames plays a minor role, as each sign is processed independently. The use of RNNs will
not be appropriate for this task and will only complicate the model without adding any significant
advantages. However, it should be noted that this type of neural network can also be actively used
in ADAS, for example, to solve the problem of predicting the movement of potential obstacles.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Transformers and their disadvantages for character recognition</title>
        <p>
          Transformers, in particular Vision Transformers (ViT), are modern constructs that have
demonstrated high accuracy in many computer vision tasks. Their work is based on the
selfattention mechanism, which allows the model to analyse global dependencies between parts of the
data (in our case, an image).This mechanism is described by the formula [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]:
        </p>
        <p>A , ,  = softmax</p>
        <p>,
!
where Q, K, V – are matrices of queries, keys and values obtained through linear
transformations from the input data;
dk – the dimension of the key space;
QKT – characterises the similarity between data elements;
softmax – is a function of normalising weights to make them probabilities.</p>
        <p>To process images, transformers break them into small patches:
!
(2)
patch ∈  !×!×!   →   patch ⋅  ∈ !, (3)
where P×P – is the size of the patch, C is the number of channels (for images, this is usually
RGB: C=3);</p>
        <p>W – is a weighting magic that converts the patch into a vector of dimension D.</p>
        <p>
          Transformers are able to process global context, which is important for classifying complex
scenes or analysing interactions between objects. However, they have a quadratic computational
complexity (O(n2)), even on medium-resolution images, which makes them difficult to use on
mobile devices. Transformers are also overpowered for the task of road sign recognition. Road
signs usually have distinct local features that can be efficiently extracted using simpler types of
neural networks. However, transformer-based methods can be beneficial in tasks that require
complex environment analysis and trajectory planning, such as maze navigation using
coevolutionary algorithms like SAFE, where spatial awareness and exploratory behavior are
prioritized [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Convolutional neural networks (CNN) - the optimal solution</title>
        <p>Convolutional Neural Networks (CNNs) are the best for image processing and visual data
analysis. Their architecture is based on the biological laws of the human visual system, especially
on the mechanisms of local feature extraction in the brain. Classical CNNs consist of a combination
of several types of layers in different architectural approaches, each of which performs recognition
and selection of certain features. The convolutional layer is the main one, which uses a set of filters
(called convolutional kernels) to scan the image and extract local features such as edges, corners,
and textures. The formula for the convolution operation:</p>
        <p>,  = !!!!!! !!!!!!   + ,  +  ⋅  ,  , (4)
where x(i,j) – is the value of the input image pixel at coordinates (i,j);
w(m,n) – convolution kernel at position (m,n);
k×k – the size of the convolution kernel;
b – is the offset added for each output neuron;
y(i,j) – is the output value resulting from the convolution.</p>
        <p>
          Convolutional kernels allow the model to automatically detect local features such as edges,
corners, and textures, which is critical for road sign recognition. After convolutional layers,
pooling is usually used to reduce the dimensionality of the feature map and increase the robustness
to small shifts [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]:
        </p>
        <p>,  = max!,!{   + ,  +  }, (5)
where m,n – is the size of the subsample;
x(i,j), y(i,j) – are the same as in formula (4).</p>
        <p>Where max-pooling is used to select the maximum value in each region. The final stage is the
fully connected layers that perform the classification:</p>
        <p>=  + , (6)
where x – is a feature vector obtained from the previous convolutional layers;
W – weight matrix;
b – offset;
y – is the output vector of classes (in our case, the probability for each sign).</p>
        <p>
          Thus, in the initial convolutional layers, CNNs detect basic features such as edges or corners of
each object. Then, as the depth increases, the network begins to identify more complex patterns,
such as geometric shapes or even individual parts of road signs. Compared to fully connected
networks, convolutional operations significantly reduce the number of parameters. For example,
for a 64 × 64 × 3 image, if only a fully connected layer is used, 12.3 million parameters are required
for a layer of 1000 neurons. Using the principle of local filters described above, this figure in CNN
can be reduced to several thousand. For road sign recognition, local patterns such as sign shape,
textures, or contrast are important, and CNNs automatically extract these patterns due to their
architecture [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Thus, convolutional neural networks are the best choice for road sign recognition due to their
ability to extract local features, robustness to biases, computational efficiency, and ability to work
in real time.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Architecture selection</title>
      <p>Convolutional neural networks (CNNs) are well-suited for traffic sign recognition due to their
ability to extract local visual features like shape and texture, which are crucial under varying
conditions. A wide range of CNN architectures exist, from simple models like LeNet-5 to more
advanced ones such as VGG, Xception, ResNet, EfficientNet, and MobileNet.</p>
      <p>For real-time or resource-constrained applications, lightweight models like MobileNet offer a
good balance of speed and accuracy. More complex networks like ResNet-50 are better suited for
high-performance environments. This section focuses on selecting an architecture that aligns with
the system’s computational constraints and performance requirements.</p>
      <sec id="sec-3-1">
        <title>3.1. Xception architecture</title>
        <p>
          The Xception architecture is an extension of the Inception model, built on the concept of depthwise
separable convolutions. This technique splits standard convolutions into two steps: a depthwise
convolution, which processes each input channel separately, and a pointwise convolution, which
combines information across channels. This significantly reduces the number of parameters and
computations. Xception takes input images of size 299x299x3 (RGB). It begins with a standard
convolution layer, followed by a series of depthwise separable convolutions combined with ReLU
activations and batch normalization. The core of the model consists of 36 convolutional layers
organized into 14 modules with skip connections, similar to those in ResNet [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. A schematic
representation is shown in Figure 1.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. EfficientNet architecture</title>
        <p>
          EfficientNet, introduced by Google Research in 2019, is an optimized architecture for image
classification. Its core idea is Compound Scaling, a method that uniformly scales a model’s depth,
width, and input resolution to improve performance efficiently. The baseline model,
EfficientNetB0, is built upon MobileNetV2 and incorporates inverted residual blocks along with the Swish
activation function. The architecture includes convolutional layers combined with batch
normalization and activation, ending with global average pooling and a final fully connected layer
[
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ].
        </p>
        <p>EfficientNet achieves a strong balance between accuracy and efficiency through compound
scaling of depth, width, and resolution. As shown in Figure 2, it outperforms many traditional
models, though even its compact versions are typically more resource-intensive than MobileNet,
making the latter more suitable for real-time or resource-constrained applications.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. MobileNet architecture</title>
        <p>
          MobileNet is specifically designed for mobile devices, employing depthwise separable convolutions
to minimize computational complexity. Unlike traditional convolutions, which process both spatial
data (within an image) and channel dependencies (across different color channels) in a single
operation, depthwise separable convolutions split this into two stages: the depthwise convolution
operates on each channel individually, focusing only on spatial data, while the pointwise
convolution uses 1x1 filters to merge information across channels. This method significantly
reduces the number of computations required. While a conventional convolution has a
computational complexity of (!!× ⋅ ), where Dk is the size of the convolution kernel, M is the
number of input channels, and N is the number of output channels, a deeply separated convolution
has a complexity of (!! ⋅  +  ⋅ ). The key difference is also shown in Figure 3 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>MobileNet is optimized for speed and efficient use of limited memory resources, making it ideal
for mobile and embedded devices. Its architecture is composed of a series of sequential blocks that
include depthwise separable convolutions, Batch Normalization, and ReLU6 activations, which help
prevent oversaturation of activations in resource-constrained environments. Pointwise (1x1)
convolutions are employed to combine information across channels, while convolutional pooling is
used in some configurations to reduce the spatial dimensions of data.</p>
        <p>MobileNet’s flexibility is further enhanced by key parameters: the Width Multiplier (α), which
adjusts the number of channels in each layer (e.g., α=0.5 reduces the number of channels by half),
and the Resolution Multiplier (ρ), which alters the resolution of input data to balance speed and
performance. In its default configuration, MobileNet processes 224×224×3 images, generates a class
probability vector, and uses Global Average Pooling to minimize the risk of overfitting, ensuring
efficient learning in resource-limited settings.</p>
        <p>
          Given the system's requirements, the MobileNetV2 architecture was selected, as it utilizes
backpropagation to minimize parameters without compromising accuracy. While Xception and
EfficientNet offer higher accuracy, their computational complexity exceeds the needs of a mobile
application. MobileNetV2 guarantees performance and compatibility with most devices, which is a
critical factor [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Additionally, accuracy is improved by augmenting the data during training,
incorporating Dropout layers to mitigate overfitting, and quantizing the model to reduce its size
and accelerate computation without a notable loss in performance.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Technologies for implementing a neural network</title>
      <p>The choice of programming language is essential for implementing a neural network for traffic
sign classification. Python, C++, and C# are the main contenders, each with unique advantages and
drawbacks. Python is widely used in machine learning due to its simple syntax and a vast
ecosystem of libraries like TensorFlow, Keras, PyTorch, and NumPy, which support data
processing, model training, and deployment. It also integrates well with cloud platforms like
Google Colab and Kaggle, providing easy access to computational resources. However, Python’s
performance is lower than C++ for computationally intensive tasks.</p>
      <p>C++ is known for its high performance and control over hardware, making it ideal for
resourcedemanding applications. It supports GPU computations via CUDA and libraries like TensorFlow
and PyTorch, but its complexity and limited tools for neural network development make it less
flexible than Python. C# is commonly used for Windows and mobile app development but has a
less developed machine learning ecosystem, with tools like ML.NET not offering the same
functionality as Python-based frameworks.</p>
      <p>Given Python’s advantages, it was chosen for the neural network development due to its
flexibility, ease of use, and extensive library support, which facilitates rapid model prototyping and
integration with cloud platforms. Python’s ability to leverage GPUs and optimize models with
TensorFlow Lite compensates for its performance limitations, ensuring real-time processing speed.</p>
      <p>
        For the framework, TensorFlow with Keras was selected due to its support for distributed
computing, easy model building, and deployment on various platforms. TensorFlow’s extensive
tools for data handling, optimization, and integration make it ideal for this project. Keras simplifies
the process of creating, training, and evaluating models with its high-level API, while TensorFlow’s
advanced features like the TensorFlow Data API and TensorFlow Addons provide additional
support for data augmentation and model customization [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>For model training, the Kaggle platform was chosen due to its significant advantages over other
free environments. Kaggle provides access to two NVIDIA Tesla T4 GPUs simultaneously, allowing
efficient processing of large datasets and faster model training. The platform offers up to 30 hours
per week of free GPU usage, with long sessions of up to 9 hours, enabling continuous
experimentation and lengthy computations. In contrast, Google Colab provides free GPU access but
limits continuous sessions to 4 hours and a total of 12 hours per day, with breaks between sessions.
Kaggle also offers an easy way to upload custom datasets, which will be useful during model
training.</p>
      <p>For data preparation, Python is used due to its versatility and extensive library support.
Libraries like NumPy are used for working with numerical arrays, pandas for handling tabular
data, OpenCV for image preprocessing (resizing, normalization, augmentation), and Matplotlib for
visualizing preparation stages.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Creating a dataset</title>
      <p>The problem of creating an effective dataset for traffic sign recognition is crucial for quality
model training. Most existing datasets, like GTSRB, use images with a size of 33×33 pixels. While
this size is optimal for neural network training due to its compactness, it doesn't reflect real-world
conditions. In practice, traffic signs are often captured in high-resolution video streams (e.g.,
512×512 pixels or more), and resizing them to 33×33 pixels leads to a loss of important details. This
is especially critical for real-time systems on client devices.</p>
      <p>One possible solution is to "cut" the input image into smaller sections that match the model’s
input size. For instance, a large image can be split into 33×33 pixel fragments, and predictions are
made for each fragment. However, this approach has several drawbacks: it significantly increases
the number of predictions, affecting real-time processing speed, requires more memory and
computational power, and complicates the client-side application architecture.</p>
      <p>Another approach is to use deep learning methods like Super-Resolution to enhance image
resolution before inputting them into the model. While this provides more details and potentially
improves prediction accuracy, it has its own limitations: Super-Resolution increases processing
time per frame, and artificially enhanced images may contain artifacts, which could reduce model
accuracy.</p>
      <p>Considering these limitations and popular solutions for adapting models to existing datasets, it
was concluded that this approach would not significantly improve performance. Therefore, a
decision was made to create a custom dataset tailored to the specific requirements of the task.
The main idea is to use road sign images with an initially higher resolution, which avoids the need
to reduce the input layer of the model. However, collecting ready-made images of a higher
resolution will be much more time-consuming, which does not fit into the timeframe of the system
development. Therefore, it was decided to emulate the dataset through the following preparation
stages:
•
•
•
to create the background and prepare for the generation of compositions, about 2000
random images with natural environment, city roads, etc. were collected. These images will
correspond to the type of data that the system will receive in real use
due to the limited resources for training the model in this work, it was decided to use only
20 classes of road signs (Figure 4), which are the most common and important for traffic
part of the sign images were taken from open datasets, such as Traffic Signs in Post-Soviet
States, which contain real high resolution images of road signs. The rest of the images were
collected manually.</p>
      <p>After completing the preparatory stage, a Python script was developed that contains an
algorithm for creating a dataset with the overlay of objects (road signs) on random images. A
diagram of this algorithm is shown in Figure 5.</p>
      <p>At the initial stage, the necessary directories for image processing are created. The input
directory contains base images that will serve as backgrounds for overlaying. The second directory
holds the set of objects (traffic signs) to be overlaid on the base images. An output directory is also
created to store the results. At this stage, it is checked whether all images are in compatible formats
for processing and whether there are enough base images and objects to ensure the required data
volume. To ensure proper scaling of objects before overlaying, the average size of the base images
is calculated. The algorithm computes the average width and height of all images in the input
directory, and these values are used as a reference to resize the objects to the correct proportions,
maintaining the natural appearance of the overlaid elements.</p>
      <p>Next, the base images are scaled or made square, if necessary, to ensure consistency with other
images in the dataset. The objects to be overlaid are loaded and resized according to the average
size of the base images. During processing, random rotation is applied within a specified range
(20° to +20°), adding variation to the appearance of the overlaid elements. For each base image, a
random position is selected for the object overlay. The algorithm ensures that the object fully fits
within the bounds of the base image and does not extend beyond its edges. The overlay is applied
considering transparency, if present in the object images. After processing, the image is saved in
the output directory, typically in PNG format. If the dataset includes multiple object classes, the
algorithm repeats the steps for each class separately.</p>
      <p>The finished images are divided into classes and split into training (80%) and test (20%)
samples. An example of the generated data is shown in Figure 6. This approach ensures the
variability of the dataset and its adaptation to the real conditions of the model.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Creating and training a model</title>
      <p>
        The architecture of our modified MobileNetV2 is designed to efficiently extract object features
while minimizing computational costs. It consists of several groups of layers, each of which
processes certain aspects of the input data and gradually builds the feature space (Figure 7). The
model's input data is 224×224×3, and it starts with an augmentation block that applies random
scaling and rotation. Reflections are not used because they could, for example, turn a “Turn Left”
sign into a “Turn Right” sign while maintaining the original class, which would confuse the model
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>The model begins with a standard 3×3 convolution layer with stride 2 and ReLU6 activation,
reducing spatial dimensions and detecting basic features. The core of the architecture is a series of
inverted residual blocks, each following a specific pattern:
1. First, an expansion phase increases the number of channels in the input tensor by a certain
factor, allowing the model to shift toward higher-level features that better capture complex
structures like sign contours and textures.
2. Then, depthwise convolutions process each channel individually to extract localized spatial
features, such as circular shapes or distinctive angles common in traffic signs.
3. Next, a compression phase reduces the number of channels back to the original size,
improving computational efficiency by retaining only the most relevant features.
4. Finally, a skip connection links the input and output of the block, preserving low-level
features like colors or contrast, which are crucial for recognizing traffic signs.</p>
      <p>After passing through the bottleneck layers, the model performs GlobalAveragePooling, which
compresses each feature map into a single value. This operation creates a compact, high-level
feature vector that summarizes the entire input image and significantly reduces the number of
parameters. This vector is then passed to a fully connected (dense) layer, where each output
corresponds to a specific class. The softmax activation function ensures the outputs represent class
probabilities that sum to one. To improve training stability and generalization, the model uses
BatchNormalization to normalize intermediate layer outputs and Dropout (rate 0.5) to randomly
deactivate neurons during training. These techniques help prevent overfitting, especially when
training data is limited or imbalanced.</p>
      <p>Initially, the base MobileNetV2 model is loaded with pretrained weights (e.g., from ImageNet)
and frozen, meaning its parameters remain unchanged. Only the new top layers are trained to
adapt to the specific traffic sign recognition task, allowing fast and stable convergence. After the
top layers are trained, the model enters a fine-tuning phase. Some base layers are unfrozen, and
training continues with a lower learning rate. This gradual adaptation refines deeper features to
better match the new dataset while preserving the benefits of pretraining, often resulting in
significantly improved accuracy.</p>
      <p>Now that the model is ready for training, we proceed to load the dataset. For deep learning tasks
with large image collections, loading all images into memory at once is inefficient and often leads
to memory overflow. In our case, with 16,000 images sized 224×224 pixels, doing so could crash the
runtime environment. To handle this, we use TensorFlow’s tf.data.Dataset API, which allows for
streaming and preprocessing data on-the-fly in small batches, greatly reducing memory usage.</p>
      <p>The dataset is organized into train and test folders, each containing subfolders for every class.
File paths and labels are generated automatically based on these subfolder names. Each image is
read from disk using tf.io.read_file, decoded into RGB format, resized to 224×224 pixels with
tf.image.resize, and normalized to values between 0 and 1. Using tf.data.Dataset.from_tensor_slices,
the file paths and labels are combined into a dataset object. The pipeline then applies:
•
•
•
shuffle to randomize data order
batch(32) to process in small chunks
prefetch to load future batches in the background for better performance</p>
      <p>Training proceeds in two stages. In the first stage, the base MobileNetV2 layers remain frozen to
retain knowledge from pretraining. Only the added classification layers are trained, allowing the
model to adapt to the new dataset. The model is compiled with the Adam optimizer, which
balances gradient direction and variance, and sparse categorical crossentropy is used as the loss
function, suitable for multiclass classification with integer labels. The process is controlled by
several callbacks:
•
•
•</p>
      <p>ModelCheckpoint – saving the best version of the model
EarlyStopping – stopping when there is no improvement for 15 epochs</p>
      <p>ReduceLROnPlateau – reduce the learning rate in case of stagnation</p>
      <p>At this stage, the model is trained for a limited number of epochs (15 in our case), primarily to
quickly adapt the newly added classification layers to the new dataset. The goal is not full
convergence, but rather initial tuning of the top layers.</p>
      <p>Once this initial training is complete, we unfreeze a portion of the base model—typically the
upper layers, which capture more task-specific features. This allows the model to refine not only
the new top layers but also adjust deeper feature representations for better accuracy.</p>
      <p>In our case, we unfreeze all layers after index 100, keeping the earlier ones frozen to maintain
stability. For this fine-tuning phase, we use a lower learning rate (1e-5) to avoid drastic weight
updates that could disrupt the pretrained knowledge. This phase runs longer (around 50 epochs) to
allow the model to gradually refine its internal representations. Training curves (Figure 8) show
how the model’s performance improves over time. Initially, validation accuracy may remain low as
the model focuses on learning basic patterns. However, with continued training and fine-tuning,
both accuracy and loss improve significantly, indicating successful adaptation.</p>
      <p>Throughout both training stages, the model was trained within optimal limits to avoid
underfitting or overfitting. Early in fine-tuning, training accuracy reached 75% while validation
accuracy was 68%, indicating initial adaptation. As training progressed, these improved to 85% and
76%, respectively. By the final epochs, the model achieved 99% training accuracy and 87%
validation accuracy, demonstrating strong generalization to unseen data. These results confirm the
effectiveness of the chosen training strategy and parameter settings.</p>
      <p>The confusion matrix analysis (Figure 9) confirms strong model performance, with most
predictions correctly aligned along the diagonal, indicating high classification accuracy across
categories. Some misclassifications are observed, primarily between visually similar signs—such as
left and right turn warnings—which is expected and acceptable within the task's scope. The
classification report shows an overall accuracy and average metrics (macro and weighted) of 86%,
which is a solid result for multi-class image recognition. Classes like 0, 1, 4, 9–14, and 16 achieved
excellent precision, recall, and F1-scores (0.91–1.00), while classes such as 2, 5, 7, 15, and 18 showed
lower scores due to visual similarity. Notably, class 7 had high recall but low precision, indicating
overprediction. Still, balanced macro and weighted averages confirm that the model performs
reliably across all classes.</p>
      <p>An additional manual test was conducted using a separate set of 30 randomly selected images
not involved in training or validation. The model correctly classified 28 out of 30, achieving an
accuracy of 93.3%, with only 2 misclassifications (Figure 10). Overall, the results confirm that the
model is both effective and reliable for traffic sign classification. Despite initial fluctuations in
accuracy, the model consistently improved and maintained strong performance across training,
validation, and independent testing stages.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>The study presented an efficient method for developing a road sign recognition model suitable for
mobile devices with limited computational resources. A key accomplishment was the enhancement
of the training sample generation process, where data augmentation and synthetic data generation
techniques increased the model's robustness to variations in lighting, perspective, and noise.</p>
      <p>The selection of the MobileNetV2 architecture proved to be a practical choice for image
classification in resource-constrained environments. By employing stepwise training—freezing the
initial layers to preserve pre-trained features and adapting the model to a new class set—we
achieved high classification accuracy. Additionally, model optimisation through quantisation
significantly reduced its size while maintaining the necessary predictive accuracy.</p>
      <p>Experimental results validated the effectiveness of the proposed approach: the model achieved
high accuracy in road sign recognition under real-world conditions, demonstrating its potential for
practical deployment. These methods can be applied to further enhance computer vision models
designed for environments with limited computational capacity.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this article, the authors used Gemini 2.5 Flash artificial intelligence tools
to assist with grammar and spelling correction, as well as to check the translation of some syntactic
structures. The final content has been carefully reviewed and edited by the authors, who are solely
responsible for the accuracy and integrity of the publication.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Graupe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Deep Learning Neural Networks: design and case studies</article-title>
          .
          <source>World Scientific Publishing Company.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cardot</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Recurrent neural networks for temporal data processing</article-title>
          . BoD - Books on Demand.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Anatoliy</given-names>
            <surname>Doroshenko</surname>
          </string-name>
          , Dmitry Zhora, Olena Savchuk, and
          <string-name>
            <given-names>Olena</given-names>
            <surname>Yatsenko</surname>
          </string-name>
          .
          <article-title>Application of Machine Learning Techniques for Forecasting Electricity Generation and Consumption in Ukraine. Information Technology and Implementation (IT&amp;I-</article-title>
          <year>2023</year>
          ),
          <source>November 20-21</source>
          ,
          <year>2023</year>
          , Kyiv, Ukraine.
          <source>CEUR Workshop Proceedings (CEUR-WS.org)</source>
          , vol-
          <volume>3624</volume>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>146</lpage>
          . https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3624</volume>
          /Paper_12.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Doshi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2021</year>
          )
          <article-title>Transformers explained visually (part 3): Multi-head attention, Deep Dive - TowardsDataScience</article-title>
          . URL: https://towardsdatascience.com
          <article-title>/ transformers-explained-visuallypart-3-multi-head-attention-deep-dive-1c1ff1024853.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Omelianenko</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doroshenko</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodin</surname>
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Autonomous navigation through the maze using coevolution strategy</article-title>
          .
          <source>In: I. Sinitsyn, Ph. Andon (Eds.) Proceedings of the 14th International Scientific and Practical Programming Conference (UkrPROG</source>
          <year>2024</year>
          ). Kyiv, Ukraine, May
          <volume>14</volume>
          -15,
          <year>2024</year>
          . P.
          <volume>301</volume>
          -
          <fpage>311</fpage>
          . https://ceur-ws.
          <source>org/</source>
          Vol-3806/S_29_Omelianenko_Doroshenko_Rodin.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ozturk</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Convolutional neural networks for medical image processing applications</article-title>
          . CRC.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Na</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <source>Communications, signal processing, and systems: Proceedings of the 8th International Conference on Communications, Signal Processing, and Systems</source>
          . Springer Nature.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Hussein</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Renewable energy: generation and application: ICREGA'24</article-title>
          . Materials Research Forum LLC.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>M. M. Shibly</surname>
            ,
            <given-names>T. A.</given-names>
          </string-name>
          <string-name>
            <surname>Tisha</surname>
            ,
            <given-names>T. A.</given-names>
          </string-name>
          <string-name>
            <surname>Tani</surname>
          </string-name>
          , S. Ripon, “
          <article-title>Convolutional neural network-based ensemble methods to recognize Bangla handwritten character,” PeerJ Comput</article-title>
          . Sci., vol.
          <volume>7</volume>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Khang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdullayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jadhav</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Morris</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>AI-Centric Modeling and Analytics: Concepts, Technologies, and Applications</article-title>
          . CRC Press.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Truong</surname>
            ,
            <given-names>T.T.</given-names>
          </string-name>
          (
          <year>2021</year>
          )
          <article-title>Recognition framework using transfer learning -</article-title>
          IEEEAccess URL: https://www.researchgate.net/publication/357852732_A_Dish_
          <article-title>Reco gnition_Framework_Using_Transfer_Learning.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pérez-Cruz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kramer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Read</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lozano</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Machine learning and knowledge discovery in databases</article-title>
          . Research track: European Conference,
          <source>ECML PKDD</source>
          <year>2021</year>
          , Bilbao, Spain,
          <source>September 13-17</source>
          ,
          <year>2021</year>
          , Proceedings, Part I. Springer Nature.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>T. Wei.</surname>
          </string-name>
          (
          <year>2022</year>
          )
          <article-title>Optimized separable convolution: Yet another efficient convolution operator - AI Open</article-title>
          . URL: https://www.sciencedirect.com/science/ article/pii/S2666651022000158.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>K_</surname>
          </string-name>
          <article-title>04 understanding of mobilenet - Wikidocs</article-title>
          . URL: https://wikidocs.net/165429.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>Introduction to TensorFlow and Keras - Deep learning with TensorFlow</article-title>
          . URL: https://developmentseed.org/tensorflow-eo-training/docs/Lesson1b_Intro_Tensor Flow_Keras.html.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Anatoliy</surname>
            <given-names>Doroshenko</given-names>
          </string-name>
          , Dmytro Zhora, Vladyslav Haidukevych, Yaroslav Haidukevych, and
          <string-name>
            <given-names>Olena</given-names>
            <surname>Yatsenko</surname>
          </string-name>
          .
          <source>Predicting 24-Hour Nationwide Electrical Energy Consumption Based on Regression Techniques. CEUR-WS</source>
          ,
          <year>2024</year>
          , vol.
          <volume>3806</volume>
          , 17 p. https://ceur-ws.org/Vol3806/S_4_Doroshenko_Zhora_
          <article-title>Haidukevych_Yatsenko.pdf. The model training notebook is available</article-title>
          at https://gitlab.com/MaksGovor/road-assistant/- /blob/main/road-assistant-model/rsv2.ipynb.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>