<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Efficiency Assessment of Neural Network Classifiers on Single-board Computers for Batch Image Processing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pylyp Prystavka</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University of Ukraine Igor Sikorsky Kyiv Polytechnic Institute</institution>
          ,
          <addr-line>Prospect Beresteiskyi (former Peremohy) 37, 03056, Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>60 Volodymyrska Street, 01033 Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The State University “Kyiv Aviation Institute”</institution>
          ,
          <addr-line>1 Lubomyr Huzar Avenue, 03058 Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The aim of this study is to experimentally evaluate the performance of contemporary deep learning models under constrained computational resources and to establish formalized dependencies between inference speed and input batch size. The Raspberry Pi 4 Model B single-board computer was used as the computing platform, representing a typical example of low-power, resource-limited hardware employed onboard unmanned aerial vehicles (UAVs). In contrast to existing YOLO-based approaches that remain ill-suited for low-resource UAV platforms, this study proposes an alternative method focused on modeling classification time using linear regression. The results provide a framework for the development of onboard vision subsystems for domestic UAVs, capable of near-real-time image analysis under batch-processing constraints. This work presents, for the first time, linear regression models that describe the classification time of neural networks with varying architectures-specifically, mobile, balanced, and deep-level modelsdepending on the number of images in the input batch. The statistical significance of the resulting regression models has been experimentally validated, as well as their consistency with real-world measurements obtained during actual single-board computer operation. These regression equations enable inference time prediction without the need for repeated empirical testing, thus significantly improving the efficiency of neural system configuration in embedded environments. Based on the developed models, boundary conditions were identified for the applicability of each considered architecture to ensure near-real-time data processing. The results demonstrate that some models can handle relatively large data batches without exceeding critical time thresholds, whereas others, despite offering higher classification accuracy, exhibit excessive computational complexity and require hardware acceleration or optimization. The relevance of this research is driven by the growing demand for autonomous image analysis systems, particularly in the context of UAV deployment for military operations, reconnaissance, search and rescue missions, and monitoring applications. The proposed approach can be integrated into hardware-software systems to enable adaptive selection of neural architectures according to operational conditions and resource constraints. This creates a foundation for the further development of intelligent UAV systems with enhanced autonomy.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;unmanned aerial vehicles (UAVs)</kwd>
        <kwd>aerial imaging data</kwd>
        <kwd>neural network classifiers</kwd>
        <kwd>machine learning</kwd>
        <kwd>limited computational resources</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        One of the key challenges in processing aerial imaging data is to ensure the high efficiency of
visual analysis systems under the constrained computational resources typical of onboard
equipment used in unmanned aerial vehicles (UAVs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Traditional methods for processing
streaming video from surveillance cameras require substantial computational power and time,
rendering them unsuitable for real-time operation—particularly in scenarios where the timely
detection of target objects is critical [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In military applications, for instance, delays in object
identification may lead to a loss of tactical advantage, while in search and rescue operations, they
may reduce the effectiveness of victim detection [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        This issue is further exacerbated by the limited energy supply available to UAVs, necessitating the
use of energy-efficient algorithms. The application of optimized models for real-time classification
and segmentation can significantly reduce the volume of data transmitted to ground stations,
thereby improving the overall autonomy and functionality of the system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>To enable effective processing of video streams from UAV onboard cameras, an approach that
combines high processing speed with an acceptable level of detection and classification accuracy is
required. Given the constrained computational resources of such platforms, real-time processing of
high-resolution full frames using deep models is largely impractical due to considerable hardware
demands.</p>
      <p>This study proposes a two-stage image processing procedure. It involves an initial downscaling of
the frame resolution, followed by detailed processing of selected fragments using more accurate
classification models that are capable of efficient operation on single-board computers. This
approach reduces the volume of input data processed at each stage, thereby decreasing the overall
computation time and increasing the system’s efficiency.</p>
      <p>However, it is important to note that downscaling an image is functionally equivalent to
smoothing or filtering out high-frequency components. This may result in the loss of critical
information, particularly when target objects are small or when images are captured from high
altitudes. Consequently, the likelihood of missing objects of interest increases—an important
drawback in tasks that require precise detection and localization.</p>
      <p>Therefore, optimizing video stream processing onboard UAVs requires careful balancing of
processing speed and image informativeness. This can be achieved through a combination of
downscaling, pre-filtering, and the use of adapted deep models.</p>
      <p>We begin by reviewing some of the most widely used deep learning models for object detection in
UAV-acquired imagery.</p>
      <p>
        YOLO (You Only Look Once) is one of the most popular deep learning architectures for real-time
object detection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Its main advantage lies in simultaneously determining object boundaries and
performing classification in a single network pass, enabling high-speed inference with acceptable
accuracy (up to 75% mAP in YOLOv8 on the COCO dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). YOLO performs well in scenarios
with fixed camera positions or predictable movement, such as ground vehicle navigation systems.
However, YOLO presents several critical limitations when applied to aerial monitoring with UAVs.
First, it is poorly adapted to scale variations and perspective distortions, which are inevitable due
to changes in altitude and camera angles during flight. Second, in dynamic environments with
highly variable backgrounds and lighting conditions, YOLO often exhibits reduced accuracy due to
limited contextual adaptability. Even simplified versions like Tiny-YOLO remain too
resourceintensive for deployment on single-board computers such as the Raspberry Pi or OrangePi.
Thus, despite its overall effectiveness in ground-based applications, YOLO is not an optimal choice
for onboard deployment in UAVs operating under constrained computational conditions and
complex aerial dynamics. To overcome these limitations, this paper proposes a different approach
aimed at quantifying classification performance on single-board computers under batch processing
conditions. The proposed method involves estimating classification time via linear regression
models as a function of batch size for various neural architectures. This approach enables a deeper
understanding of performance boundaries and supports the development of onboard computer
vision solutions tailored to the constraints of domestic UAVs.
      </p>
      <p>Several other modern architectures also merit discussion for their potential applicability to various
computer vision tasks involving UAV imagery.</p>
      <p>
        Vision Transformers (ViT) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] employ a self-attention mechanism that effectively captures spatial
dependencies in images. This architecture demonstrates high accuracy, especially on large datasets.
However, its significant computational requirements and dependence on powerful hardware limit
its use in low-resource environments like single-board computers.
      </p>
      <p>
        NASNet (Neural Architecture Search Network) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is the result of automated architecture
optimization tailored for classification and detection tasks. While it delivers high accuracy and can
be adapted to resource constraints, it is impractical for real-time deployment due to its resource
demands.
      </p>
      <p>
        Faster R-CNN is one of the most widely used architectures for object detection, offering excellent
accuracy in complex scenes with multiple objects [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Nevertheless, its computational complexity
and substantial processing delays make it unsuitable for real-time applications on platforms like
the Raspberry Pi.
      </p>
      <p>
        SqueezeNet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], by contrast, is explicitly designed for resource-constrained environments. Its
compact architecture (~1.2 million parameters) enables efficient operation on embedded platforms.
However, its main drawback remains the notably lower recognition accuracy compared to more
advanced models.
      </p>
      <p>Each of these models exhibits specific strengths, yet all suffer from considerable limitations. Their
complexity, energy consumption, and processing delays hinder real-time deployment on
autonomous UAV platforms. This creates a demand for thorough analysis of models that not only
deliver acceptable classification accuracy but also meet strict performance and energy-efficiency
criteria.</p>
      <p>The objective of this study is to experimentally assess the performance of contemporary deep
learning models under limited computational resources and to formalize the relationship between
processing speed and input batch size. The Raspberry Pi 4 Model B was used as the target device
for evaluation, serving as a prototypical low-power, resource-constrained hardware platform
commonly employed onboard UAVs.</p>
      <p>For the experiments, four widely adopted convolutional neural network architectures were
selected, differing in complexity, accuracy, and computational requirements: EfficientNetV2S,
MobileNetV2, ResNet50, and ResNet101. This selection is based on their prevalence in applied
computer vision tasks, widespread support across deep learning frameworks, and availability of
open-source models with well-documented performance metrics. Furthermore, these architectures
represent distinct categories in terms of the trade-off between accuracy and performance—ranging
from lightweight mobile models to deep high-precision networks. Evaluating their behavior on the
Raspberry Pi platform enables the formulation of informed recommendations regarding
architecture selection based on task-specific requirements and computational constraints.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Methods</title>
      <sec id="sec-2-1">
        <title>The following neural network models were considered in this study.</title>
        <p>
          EfficientNetV2S (Efficient Network V2 Small) is a representative of the second generation of deep
convolutional neural networks, optimized for fast inference and training. The model was developed
using Neural Architecture Search (NAS) and the compound scaling strategy. A key feature of
EfficientNetV2S is the combination of traditional mobile blocks (MBConv) with the newer
FusedMBConv blocks, which significantly reduces processing time without compromising accuracy. Due
to its balanced architecture, EfficientNetV2S achieves high performance on the ImageNet
benchmark with a relatively small number of parameters, making it suitable for both server-based
and embedded applications [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          MobileNetV2 is a lightweight neural network architecture designed for use on mobile and
embedded devices. It combines depthwise separable convolutions with inverted residual blocks and
linear bottlenecks, significantly reducing computational overhead. A unique characteristic of
MobileNetV2 is the transformation order within each block: instead of reducing the number of
output channels as in conventional networks, the number of channels is first expanded and then
compressed. This helps retain informative features and minimizes information loss during
convolutional processing. The architecture is especially well-suited for real-time tasks under
resource-constrained conditions [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          ResNet50 (Residual Network-50) is a classic deep convolutional neural network architecture based
on shortcut (residual) connections. ResNet was introduced to address the vanishing gradient
problem in very deep networks. ResNet50 employs residual blocks with three layers (bottleneck
blocks), enabling effective signal propagation across deep layers. Due to its high accuracy and
moderate computational requirements, ResNet50 is widely used in various computer vision tasks
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          ResNet101 is an extended version of the ResNet architecture, consisting of 101 convolutional layers
and using the same bottleneck blocks as ResNet50. The increased depth allows the model to
capture more complex features, which enhances classification performance, particularly on large
datasets. However, the larger number of parameters and FLOPs increases processing time and
makes ResNet101 less suitable for real-time inference. This model is best suited for high-accuracy
tasks where computational resources are not a limiting factor [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>For model deployment and testing, the hardware platform selected was the Raspberry Pi 4
Model B single-board computer, in order to evaluate the suitability of the networks for integration
onboard UAVs.</p>
        <p>
          To ease the processing load on the models by limiting the number of regions passed for
classification, a study [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] explored a fast filtering method to extract informative image fragments
from each frame. Specifically, it was demonstrated that for a frame of 1920×1080 pixels, the
authors’ custom rapid selection algorithm for informative fragments (each 64×64 pixels) processes
the frame using a sliding window in less than 0.1 seconds. Thus, within this time frame, a
classification model receives a batch consisting of a variable number of 64×64-pixel images. In
general, the number of images in a batch may range from several dozen to several hundred,
depending on the hyperparameters of the frame processing procedure. The authors of the study
suggest that this number is primarily determined by two factors: the texture variability of
individual image fragments and the low likelihood of encountering multiple informative fragments
within a single frame [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>The goal of this work is to determine the maximum allowable batch size for which the
classification model’s inference time does not exceed 0.85 seconds. This duration ensures that both
the information filtering and the classification of informative image fragments are guaranteed to be
completed within one second on the given computational device.</p>
        <p>This paper presents the results of an experimental evaluation of the performance of four selected
convolutional neural network architectures: EfficientNetV2S, MobileNetV2, ResNet50, and
ResNet101.</p>
        <p>For each model, a series of inference time measurements (i.e., image processing times) was
conducted across varying input batch sizes, ranging from 1 to 512 images. During testing, a fixed
hardware configuration and software environment were maintained to eliminate external factors
that could influence the results. All measurements were performed on Raspberry Pi 4 Model B
single-board computers, emulating the resource-constrained environment typical of onboard UAV
systems. This platform presents significant limitations in terms of CPU power, RAM, and energy
consumption, thereby providing a realistic simulation of real-time operation conditions.
For further analysis, the data were grouped into six batch size intervals: 0–16; 17–32; 33–64; 65–
128; 129–256; and 257–512. For each interval, observations were aggregated, and the average batch
processing time was computed in seconds (Table 2). Outlier data were removed using the
interquartile range (IQR) method to ensure statistical robustness of the results.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Result and Discussion</title>
      <p>Based on the results presented in Table 2, the following conclusions can be drawn.
MobileNetV2 demonstrates the lowest average processing time across all batch sizes, confirming its
efficiency for mobile and real-time applications. Its architecture, characterized by a small number
of parameters and low computational complexity, enables high-speed performance even under
increased processing loads.</p>
      <p>EfficientNetV2S provides the highest classification accuracy among all models considered, while
also maintaining high performance. This makes it an optimal choice for systems where a balance
between recognition quality and processing speed is essential.</p>
      <p>ResNet50 exhibits moderate inference time and represents a well-balanced trade-off between
accuracy and speed. As a result, it is frequently used as a baseline model in production-level
computer vision systems.</p>
      <p>ResNet101 is the slowest of all the models across all batch size intervals. Despite offering slightly
higher accuracy compared to ResNet50, it demonstrates low efficiency in terms of performance,
which limits its practical applicability in real-time tasks.</p>
      <p>For all models, an approximately linear relationship between batch size and processing time is
observed; however, the rate of increase (i.e., the slope of the regression line) varies by architecture.
Further results are presented below.</p>
      <p>Figure 1 illustrates the empirical relationship between input batch size and the average processing
time per batch for the EfficientNetV2S model. Prior to regression modeling, the data were cleaned
to remove statistical outliers.
where</p>
      <p>is the average processing time (in seconds), and  is the batch size.</p>
      <p>The coefficient of determination is R2=0.95618.</p>
      <p>The red line in Figure 1 represents the linear regression result, which approximates the empirical
relationship within the investigated range (batch size up to 32). The coefficient of determination
R2=0.95618 18 indicates a strong linear relationship between batch size and processing time.
The horizontal dashed line marks the threshold for acceptable processing time, set at 0.85 seconds—
corresponding to typical real-time constraints when accounting for the frame pre-filtering step.
The vertical dashed line intersects the regression curve at the point corresponding to the maximum
permissible batch size that does not exceed the specified time limit. According to the regression
results, this threshold for the given model is 13 images per batch.
,
where</p>
      <p>is the average processing time (in seconds), and  is the batch size.</p>
      <p>The coefficient of determination is R2=0.97811.</p>
      <p>This indicates a very high degree of agreement between the model and the empirical data.
The maximum batch size that ensures processing within 0.85 seconds is 58 images per batch.
Figure 3 presents the results for the ResNet50 model.
where</p>
      <p>is the average processing time (in seconds), and  is the batch size.</p>
      <p>The coefficient of determination is R2=0.99373. This value indicates an excellent linear fit to the
empirical data. Maximum batch size processed within 0.85 seconds – 14.</p>
      <p>Figure 4 illustrates the relationship between the processing time per image and the input batch
size for the ResNet101 model, based on a subset of the experimental data (batch sizes up to 32
included for improved visualization).
where</p>
      <p>is the average processing time (in seconds), and  is the batch size.</p>
      <p>The maximum batch size that allows processing within 0.85 seconds is 7 images.
A summary table (Table 3) is provided below, presenting a comparative overview of the models at
the processing time threshold of 0.85 seconds. The table includes key characteristics: regression
slope, intercept, coefficient of determination, maximum allowable batch size, and Top-1
classification accuracy.</p>
      <sec id="sec-3-1">
        <title>The results presented in Table 3 support the following conclusions.</title>
        <p>EfficientNetV2S demonstrates the highest classification accuracy (84.9%) while maintaining
acceptable inference speed, with a maximum batch size of 13 under the 0.85-second threshold. This
enables group image processing even in near real-time scenarios, making it one of the most
suitable models for tasks where high recognition precision is critical.</p>
        <p>MobileNetV2 offers the best performance in terms of speed, with the largest maximum batch size
(58) and the lowest regression slope (0.0105 s/image). However, it yields the lowest Top-1 accuracy
(72%) among the models considered. This makes it an attractive option for applications where
speed is paramount and accuracy is of secondary importance.</p>
        <p>ResNet50 provides a balanced compromise between inference speed and accuracy: it supports
batches of up to 14 images within the 0.85-second threshold and achieves a classification accuracy
of 76.2%. This makes it suitable for a wide range of tasks, particularly those requiring moderate
depth of analysis and computational robustness.</p>
        <p>ResNet101 achieves relatively high accuracy (77.4%) but significantly lower throughput, supporting
only 7 images per batch within the same time limit. Due to its computational complexity, it is the
least suitable model for deployment on resource-constrained devices.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>The conducted research focused on solving the classification problem under constrained
computational conditions, aiming to provide practical solutions for the development of onboard
vision systems for domestic UAV applications. Based on an analysis of existing methods, including
YOLO-based models that exhibit significant limitations in aerial environments, an alternative
approach was proposed. To support this, the authors constructed regression models describing
classification time as a function of batch size for several neural network architectures. The
obtained results may serve as a foundation for future research in the development of onboard
image processing technologies for aerial surveillance systems.In this study, linear regression
models were developed for the first time to describe the classification time of four neural network
architectures—EfficientNetV2S, MobileNetV2, ResNet50, and ResNet101—as a function of the
number of 64×64-pixel images in the input batch. The statistical significance and adequacy of the
regression models were substantiated, confirming their alignment with empirical measurement
results.</p>
      <p>For each model, the maximum permissible batch size that ensures processing within 0.85 seconds
(real-time mode) on a Raspberry Pi 4 Model B single-board computer was determined.
The constructed models enable accurate quantitative estimation of inference time without the need
for repeated physical measurements, which is particularly valuable for rapid performance
evaluation on low-power devices.</p>
      <p>MobileNetV2 demonstrated the highest performance, enabling the processing of up to 58 images
per batch within the 0.85-second threshold, with a strong regression fit (R2=0.993). This makes it an
appropriate choice for real-time systems with limited computational resources.</p>
      <p>EfficientNetV2S, offering higher classification accuracy (84.9%), supports batches of up to 13
images. The high coefficient of determination (R2=0.956) confirms the reliability of the regression
model for this architecture.</p>
      <p>ResNet50 achieves a batch size of 14 images, with strong approximation reliability (R2=0.981),
although it incurs higher computational costs compared to MobileNetV2.</p>
      <p>ResNet101 showed the lowest throughput—only 7 images per batch—due to its substantial
computational complexity. Despite its high classification accuracy (77.4%), this architecture
requires hardware acceleration to be viable for real-time applications.</p>
      <p>The results of this study provide a solid foundation for the informed selection of neural network
architectures in autonomous image analysis tasks, including onboard deployment in UAVs for
monitoring, inspection, and reconnaissance.</p>
      <p>The application of these findings is relevant in domains requiring rapid image analysis, particularly
in military operations, intelligence gathering, search and rescue missions, critical infrastructure
inspection, agrotechnology, and logistics. Furthermore, the results may be integrated into modern
monitoring and data analysis platforms to enhance their performance, adaptability to dynamic
conditions, and overall real-time efficiency.</p>
      <p>In conclusion, the outcomes of this work have practical value both for direct implementation in
modern embedded AI systems and for guiding the design of next-generation intelligent image
analysis modules with a focus on performance, autonomy, and energy efficiency.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4.o for figures 1-4 in order to generate
images. After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.</p>
      <p>The ceur-art template for Word can be downloaded at https://ceur-ws.org/Vol-XXX/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Prystavka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Chyrkov Suspicious</surname>
          </string-name>
          Object Search in Airborne Camera Video Stream // In: Z.,
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petoukhov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dychka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
          </string-name>
          (eds)
          <article-title>Advances in Computer Science for Engineering and Education</article-title>
          .
          <source>ICCSEEA 2018. Advances in Intelligent Systems and Computing</source>
          , vol.
          <volume>754</volume>
          . Springer, Cham,
          <year>2018</year>
          . - P.
          <fpage>340</fpage>
          -
          <lpage>348</lpage>
          . - DOI: 10.1007/978-3-
          <fpage>319</fpage>
          -91008-6_
          <fpage>34</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Prystavka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Shevchenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ,
          <article-title>Rokitianska Comparative Analysis of Detector-Tracker Architecture for Object Tracking Based on SBC for UAV //</article-title>
          <source>Proceedings of the 2024 IEEE 7th International Conference on Actual Problems of Unmanned Aerial Vehicles Development (APUAVD</source>
          <year>2024</year>
          ). -
          <fpage>2024</fpage>
          . - P.
          <fpage>175</fpage>
          -
          <lpage>178</lpage>
          . - DOI: 10.1109/APUAVD64488.
          <year>2024</year>
          .10765897
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Prystavka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Chyrkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Sorokopud</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          ,
          <source>Kovtun Automated Complex for Aerial Reconnaissance Tasks in Modern Armed Conflicts // CEUR Workshop Proceedings</source>
          , vol.
          <volume>2588</volume>
          ,
          <year>2019</year>
          . - P.
          <fpage>57</fpage>
          -
          <lpage>66</lpage>
          . - [Online]. Available: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2588</volume>
          /paper6.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Prystavka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Cholyshkina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Dolgikh</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ,
          <source>Karpenko Automated Object Recognition System Based on Convolutional Autoencoder // 2020 10th International Conference on Advanced Computer Information Technologies (ACIT)</source>
          .
          <source>- 2020</source>
          . - P.
          <fpage>830</fpage>
          -
          <lpage>833</lpage>
          . - DOI: 10.1109/ACIT49673.
          <year>2020</year>
          .9208945
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Farhadi</surname>
          </string-name>
          <string-name>
            <surname>YOLO9000</surname>
          </string-name>
          : Better, Faster, Stronger //
          <source>Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <source>- 2017</source>
          . - P.
          <fpage>6517</fpage>
          -
          <lpage>6525</lpage>
          . - DOI: 10.1109/CVPR.
          <year>2017</year>
          .690
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Tan</surname>
          </string-name>
          <string-name>
            <surname>MD</surname>
          </string-name>
          -
          <article-title>YOLO: Multi-scale Dense YOLO for Small Target Pest Detection // Computers and</article-title>
          Electronics in Agriculture. -
          <year>2023</year>
          . - Vol.
          <volume>213</volume>
          . - Article 108233. - DOI: 10.1016/j.compag.
          <year>2023</year>
          .108233
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Kolesnikov</surname>
          </string-name>
          , et al.
          <article-title>An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</article-title>
          . arXiv:
          <year>2010</year>
          .
          <article-title>11929 [cs</article-title>
          .CV],
          <year>2020</year>
          . https://arxiv.org/abs/
          <year>2010</year>
          .11929
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            ,
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Learning Transferable Architectures for Scalable Image Recognition</article-title>
          .
          <source>arXiv:1707.07012 [cs.CV]</source>
          ,
          <year>2017</year>
          . https://arxiv.org/abs/1707.07012
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Sun Faster</surname>
          </string-name>
          R-CNN:
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          .
          <source>arXiv:1506.01497 [cs.CV]</source>
          ,
          <year>2015</year>
          . https://arxiv.org/abs/1506.01497
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F. N.</given-names>
            ,
            <surname>Iandola</surname>
          </string-name>
          , S., Han,
          <string-name>
            <given-names>M. W.</given-names>
            ,
            <surname>Moskewicz</surname>
          </string-name>
          , et al.
          <article-title>SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and &lt;0.5MB Model Size</article-title>
          .
          <source>arXiv:1602.07360 [cs.CV]</source>
          ,
          <year>2016</year>
          . https://arxiv.org/abs/1602.07360
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Tan</surname>
          </string-name>
          , &amp;
          <string-name>
            <surname>Q. Le</surname>
          </string-name>
          , (
          <year>2021</year>
          ).
          <article-title>EfficientNetV2: Smaller Models and Faster Training</article-title>
          .
          <source>arXiv preprint arXiv:2104</source>
          .00298. https://doi.org/10.48550/arXiv.2104.00298
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Zhmoginov</surname>
          </string-name>
          , &amp;
          <string-name>
            <given-names>L. C.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>MobileNetV2: Inverted Residuals and Linear Bottlenecks</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <fpage>4510</fpage>
          -
          <lpage>4520</lpage>
          . https://doi.org/10.1109/CVPR.
          <year>2018</year>
          .00474
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Ren</surname>
          </string-name>
          , &amp; J.,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . https://doi.org/10.1109/CVPR.
          <year>2016</year>
          .90
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Prystavka</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <source>Zhultynska Spline Approaches for Anomaly Detection in UAV-Based Aerial Surveillance // Proceedings of the 2024 IEEE 7th International Conference on Actual Problems of Unmanned Aerial Vehicles Development (APUAVD</source>
          <year>2024</year>
          ). -
          <fpage>2024</fpage>
          . - P.
          <fpage>187</fpage>
          -
          <lpage>190</lpage>
          . - DOI: 10.1109/APUAVD64488.
          <year>2024</year>
          .10765898
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Zivakin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Kozachuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Prystavka</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          ,
          <article-title>Cholyshkina Training set AERIAL SURVEY for Data Recognition Systems From Aerial Surveillance Cameras /</article-title>
          / CEUR Workshop Proceedings. -
          <year>2022</year>
          . - Vol.
          <volume>3347</volume>
          . - P.
          <fpage>246</fpage>
          -
          <lpage>255</lpage>
          . - [Online]. Available: https://ceur-ws.org/Vol3347/Paper_21.pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>