<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Modern Data Science Technologies Doctoral Consortium, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Dnipro University of Technology</institution>
          ,
          <addr-line>Dmytra Yavornytskoho Ave 19, Dnipro, 49005</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>15</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>This paper presents an information technology for car detection on aerial imagery using a modified YOLOv11 architecture with support for oriented bounding boxes (OBB), enabling improved accuracy in vehicle detection within complex scenes on high-resolution images. The core YOLOv11-OBB architecture integrates modules: Freq-SpaFEM, SPPF, DAGLF, AMSF2, and RAFA-designed to optimize feature extraction, refinement, and aggregation, thereby enhancing the quality of vehicle detection in high spatial resolution imagery. Experimental studies demonstrated a significant performance improvement of the proposed approach over existing object detection models according to key metrics: mAP@50 at 96.8%, Precision at 99.5%, Recall at 96.8%, and OBB IoU at 98.5%. The developed information technology can be applied in urban planning for infrastructure analysis and design, and in traffic management systems for monitoring vehicle flows to support informed decision-making processes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;сar detection</kwd>
        <kwd>aerial imagery</kwd>
        <kwd>YOLOv11-OBB</kwd>
        <kwd>oriented bounding boxes</kwd>
        <kwd>feature extraction</kwd>
        <kwd>high-resolution images</kwd>
        <kwd>object detection</kwd>
        <kwd>deep learning1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Object detection is one of the most challenging tasks in the field of computer vision, as it involves
both object classification and accurate determination of their spatial position in an image
(detection). Given the rapid development of digital image processing technologies, artificial
intelligence, and deep learning in particular, object detection methods have undergone significant
growth, which has opened up new prospects for their application in high-precision automated
systems. This task is particularly relevant in the context of processing images obtained from aerial
remote sensing platforms, including uncrewed aerial vehicles (UAVs). Aerial photographs
generated by UAVs are characterized by high spatial resolution, wide coverage bandwidth, and
flexible imaging parameters, including platform height, sensor tilt angle, and choice of scene
coverage geometry. These characteristics make UAV imagery effective in applied tasks such as
automated monitoring of transport infrastructure, urban planning, emergency response, etc.</p>
      <p>
        Despite the numerous advantages of aerial imagery obtained with the help of UAVs, their use in
object detection tasks is accompanied by several technical difficulties. In particular, the
orthoperspective characteristic of aerial imagery ensures that objects are invariant to rotation, but at the
same time reduces the expressiveness of texture and morphological features, which makes it
challenging to identify and recognize objects. In addition, the scale characteristics of objects
depend on the platform's flight altitude, optical properties of the sensor, and other parameters,
which lead to variability in the size of target objects and their high spatial resolution. It makes it
difficult to form stable feature descriptors in cases of small-scale or crowded objects. Another
factor that reduces detection efficiency is the complex and heterogeneous structure of the
background. One image can combine urbanized areas, natural landscapes, transportation elements,
etc., which causes a high level of background noise and increases the risk of false positives during
classification. Atmospheric phenomena also have a significant impact, such as cloudiness, fog,
shadowy areas, or uneven lighting, which affect the quality of data analysis. They can distort the
visual characteristics of objects, reducing the reliability of automated object identification [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>Vehicles, in particular cars, are of practical interest among the objects of aerial imagery, as they
are of practical importance for the implementation of intelligent monitoring systems, traffic flow
modeling, urban planning, and improving the efficiency of security technologies. Detecting cars on
UAVs is complicated by their small size (usually 15-30 pixels), high density in transportation hubs,
similarity of color and texture features to the environment (roads, parking lots), and a variety of
shapes and orientations. These factors significantly complicate automated vehicle identification,
which requires the development of models capable of providing accurate localization and
classification of vehicles in conditions of complex backgrounds and large-scale variability.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Methods for detecting vehicles in aerial images are conventionally divided into: multi-scale object
detection, tiny object detection, dense object detection, and deep learning.</p>
      <p>
        One of the most extensively studied areas is the detection of multi-scale objects, which is
particularly relevant in vehicle detection tasks, where object size largely depends on imaging
altitude, camera focal length, and other image parameters. Architectures with multi-level feature
construction are widely used to address the issue of scale variability. One of the most common
solutions is the Feature Pyramid Network (FPN), proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which implements a hierarchical
top-down feature construction approach with lateral connections. This method enables the
generation of feature maps at different hierarchical levels, providing an adequate representation of
objects at various scales and improving the accuracy of small object detection without significantly
increasing computational cost. However, the study [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] highlights several limitations of the
conventional FPN. In particular, the basic feature fusion approach does not account for contextual
alignment, which complicates the formation of high-quality descriptors and may result in the loss
of critical spatial details. Additionally, the use of traditional upsampling when merging multi-scale
features can cause mismatches between spatial representations at different levels. To overcome
these limitations, the authors proposed an improved approach, the Adaptive Feature Pyramid
Network, which allows for more precise integration of spatial information from different levels of
the feature hierarchy. Another direction in enhancing multi-scale detection involves improving
classical one-stage detectors. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a modification of the base Single Shot MultiBox Detector (SSD)
architecture called Feature Fusion SSD (FSSD) was introduced, aimed at improving accuracy with
minimal reduction in processing speed. The key innovation of FSSD lies in the introduction of a
lightweight and efficient feature fusion module, where features from different network levels and
scales are merged into a unified representation. This fused representation is then downsampled to
form a new feature pyramid, which is passed to multiple detection blocks for prediction. This
approach enhances contextual information representation and improves small object detection
performance without significant computational overhead.
      </p>
      <p>
        Tiny object detection methods are particularly relevant for the analysis of high-resolution
images, where objects, such as vehicles, may occupy only a few dozen pixels. One simple yet
effective solution is the adaptation of the training strategy, specifically by lowering the Intersection
over Union (IoU) threshold when assigning positive and negative samples for anchor regions in
Region Proposal Networks (RPNs). It allows for more flexible training sample generation, which
improves the localization of small objects. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a fundamentally new approach to object
representation was proposed, known as RepPoints. Instead of representing an object with a fixed
rectangular bounding box, RepPoints describe it as a set of learnable sampling points that
simultaneously serve for both localization and semantic representation. These points are
automatically trained to spatially cover the object and capture its most significant local regions.
RepPoints eliminate the need for the traditional anchor box mechanism. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a practical and
straightforward approach to dynamic label assignment was introduced, which takes into account
the current training state based on model predictions. The method incorporates the predicted IoU
scores into the positive sample selection process, enabling the selection of the most relevant
examples with better alignment to ground truth objects. This approach helps reduce the
discrepancy between classification and localization scores and enhances the quality of bounding
boxes.
      </p>
      <p>
        Dense object detection methods are aimed at addressing the challenges associated with high
object density and compactness in images. Such conditions are often observed in aerial
photographs, where objects, for example, ships in harbors, airplanes in aircraft graveyards, or
vehicles in parking lots and on roads are located close to one another. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the authors proposed
the SCRDet++ method to improve the detection of densely and compactly arranged objects in
aerial imagery. The core of the approach lies in introducing an instance-level denoising mechanism
in the feature map, which enhances the detection accuracy of small and crowded objects. To
address problems related to object orientation changes, a constant IoU coefficient was added to the
loss function, resolving issues of angular periodicity and boundary interchangeability. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a
Constraint-based Fine-tuning Approach (CFA) was proposed to reduce catastrophic forgetting in
incremental learning tasks without increasing the model’s parametric complexity. The method
adapts Average Gradient Episodic Memory (A-GEM) to the generalized few-shot object detection
(G-FSOD) scenario by introducing additional constraints on gradient search. It enables the
formulation of a new gradient update rule that ensures more effective knowledge alignment
between base and novel classes.
      </p>
      <p>
        Deep learning-based methods [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], particularly convolutional neural networks (CNNs), have
significantly improved the effectiveness of object detection in images over the past decade [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ].
Two-stage architectures such as R-FCN, Faster R-CNN, and MS-Faster R-CNN [13, 14] have gained
particular popularity for achieving state-of-the-art (SOTA) accuracy by first generating region
proposals and then refining object boundaries and classes. In contrast, one-stage models,
particularly the YOLO series [15–18], perform detection without the proposal generation stage,
which provides significantly higher processing speed. These models are better suited for real-time
applications but may fall short of two-stage methods in terms of accuracy when detecting small or
densely clustered objects.
      </p>
      <p>
        Despite the rapid development of object detection methods based on convolutional neural
networks (CNNs), there are still limitations that significantly affect their effectiveness in aerial
image object detection tasks. In particular, although modern CNN architectures exhibit rotation
invariance at the feature level, traditional approaches still predominantly operate with horizontal
bounding boxes (HBBs), which are unable to accurately describe oriented objects typical for aerial
imagery (e.g., ships, airplanes, large vehicles). In such cases, the use of oriented bounding boxes
(OBBs) [
        <xref ref-type="bibr" rid="ref11">11, 19</xref>
        ] is more appropriate, as they allow for more precise localization of objects with
complex geometries, take into account their orientation, and improve the separation of densely
packed instances. The implementation of OBBs effectively forms a separate task, oriented object
detection, which requires modifications to both detector architectures and loss functions [20].
Another issue is the input image size constraints inherent to many existing models [21]. For
instance, Faster R-CNN typically resizes the shorter image side to 600 pixels, while YOLO operates
with a fixed input size of 608×608 pixels. At the same time, the typical dimensions of images from
remote sensing datasets, such as ITCVD or DOTA, far exceed these limitations, necessitating
additional solutions such as image tiling or scalable architectures.
      </p>
      <p>This work aims to develop an information technology for vehicle detection in aerial images
using a modified YOLOv11 architecture with support for oriented bounding boxes (OBB), which
enables improved detection accuracy of vehicles in complex scenes within high spatial resolution
imagery.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed technology</title>
      <p>The proposed information technology is presented in Figure 1 and is functionally structured into
five key stages. Each of these stages is an integral part of the overall data processing workflow and
the model development and validation process, ensuring a systematic approach to solving the task
of vehicle detection in aerial images.</p>
      <p>At the first step of the technology implementation, the input aerial image is loaded, and the
COCO dataset [22] is used as a basic dataset with vehicle markings. Input to the information
technology is an aerial image of the following resolution:
where H is image height,  is image width, and 3 is the number of channels (RGB).</p>
      <p>The second step involves data preparation and OBB annotation. First, the uploaded aerial
images are pre-processed [23]: normalization, resizing, contrast-limited adaptive histogram
equalization (CLAHE), and color correction. To ensure the stability of neural network training, the
primary data undergoes a normalization step:</p>
      <p>where I(x,y,c) is the intensity of a pixel in the coordinate ( , ) in channel  ,   ,   are the mean
value and standard deviation of pixel values in the channel,  ′ is the normalized image.</p>
      <p>∈ ℝ × × ,
 ( , , ) =  ( , , ) − 

preserve the original scene geometry to avoid distortion of the aspect ratio. Next, Contrast Limited
Adaptive Histogram Equalization (CLAHE) is applied to enhance the local contrast of the image,
which is especially important for aerial images where objects may have low contrast with the
background due to lighting conditions or shadows. Unlike global histogram equalization, CLAHE
processes the image in small, non-overlapping regions (tiles) and then uses bilinear interpolation to
smooth the boundaries between tiles. Let p(k) be the normalized histogram of pixel intensities for
each tile, where k is the intensity level. The intensity transformation from Iin to Iout is performed
using the cumulative distribution function (CDF) C(k):
(3)
(4)
(5)</p>
      <p>After preprocessing, the extraction of geometric parameters is performed. For each object, an
oriented bounding box is calculated:
where (xc, yc) are the coordinates of the box center; w,h are the width and height of the box;  is


=  ( ) =</p>
      <p>( ).</p>
      <p>= ( ,  ,  , ℎ,  ),
the rotation angle of the box relative to the x-axis  .</p>
      <p>It allows transitioning from HBB to OBB using a transformation matrix:


=
cos 
sin 
− sin 
cos 


− 
−</p>
      <p>+ 
.</p>
      <p>Next, objects are filtered based on area and overlap, incorrect labels are cleaned, and the data is
saved in a format compatible with oriented bounding box processing. Finally, all OBB data are
organized and formatted to create a unified OBB-annotated dataset ready for model training. The
outputs of this stage are the preprocessed aerial image and the structured OBB-annotated dataset.</p>
      <p>At the third step, data splitting is performed. The oriented bounding box dataset, consisting of
2,758 aerial images, was logically divided into three mutually independent subsets to ensure
objective evaluation of the model's performance and prevent bias in the results. The training set
covers 70% of the total data volume, comprising 1,931 images, and is used directly to optimize the
neural network weights during training. The validation subset contains 552 images (20%) and is
intended for monitoring intermediate results and tuning model hyperparameters to reduce the risk
of overfitting. The test set, formed from the remaining 10% (275 images), is used exclusively at the
final stage to assess the model's generalization ability independently. It allows objective
determination of its performance on previously unseen data.</p>
      <p>The fourth step is the construction of the YOLOv11-OBB model. This step is the core of the
proposed information technology implementation. The model takes the image Iout as input. The
architecture is conventionally divided into three functional components: the backbone, the neck,
and the detection head. The backbone performs hierarchical feature extraction from the image after
the second stage, using a sequence of convolutional layers (Conv1, Conv2) integrated with
Frequency-Spatial Feature Extraction Modules Freq-SpaFEM1 and Freq-SpaFEM2. Freq-SpaFEM
simultaneously extracts local spatial and global frequency characteristics, which are essential when
processing data with complex textures and small objects typical of aerial images. The backbone
concludes with SPPF (Spatial Pyramid Pooling Fast) [24] and DAGLF (Dual Attention Global Local
Fusion) layers. SPPF enables aggregation of contextual features at different scales. At the same
time, DAGLF allows simultaneous focus on local details and the global scene context, enhancing
the semantic expressiveness of features and reducing the impact of background noise.</p>
      <p>The Neck component is an intermediate link between the feature extraction block (Backbone)
and the detection head, enabling the fusion of features from different network depth levels. It
includes the AMSF2 module (Adaptive Multi-Scale Feature Fusion), which enhances generalization
for objects of various sizes. The component also incorporates Upsample layers (to align spatial
dimensions with higher network levels), Freq-SpaFEM (to strengthen spatial-frequency consistency
of feature maps), and Conv layers (for further feature refinement before passing them to the
detection block).</p>
      <p>The detection head is the final component of the architecture responsible for directly generating
predictions for object detection and classification. It contains the RAFA module (Region-Aware
Feature Aggregation), which enables the model to distinguish objects from the background better
and consider the context around each object, as well as the Detect Head, which produces the final
OBB predictions, including coordinates, dimensions, rotation angles, and forms the output oriented
bounding boxes.</p>
      <p>The final step of information technology includes the practical application of the trained model
for vehicle detection and evaluation of its performance. A comprehensive analysis of the results is
conducted using standard evaluation metrics [25] for object detection quality (Precision, Recall,
mAP, OBB IoU) to confirm the effectiveness of the proposed technology. Additionally, detection
results are visualized to demonstrate the model’s ability to localize and classify vehicles in aerial
images accurately.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>Experiments were conducted to study the effectiveness of the proposed information technology for
detecting cars on aerial photographs, the results of which allowed us to formulate reasonable
conclusions about the efficacy of the model.</p>
      <p>During training, the model was evaluated using the following metrics: mean Average Precision
(mAP), mAP@0.50–0.95 (COCO metric), and loss curves. In the initial stages of training, the mAP
value increases rapidly, reaching over 0.85 by the 300th epoch, indicating high localization and
classification accuracy. The mAP@0.50–0.95 value steadily increases, reaching approximately 0.85
by the end of training, which confirms the model's effective generalization capability. The Box Loss
curve (bounding box loss) shows a sharp decrease during the first 50 epochs and then stabilizes at
around 0.25, indicating reliable spatial localization. The Class Loss curve (classification loss)
similarly decreases early in training and stabilizes at approximately 0.15, reflecting effective
learning of class-specific features. Objectness Loss also stabilizes at around 0.1, demonstrating high
sensitivity to object presence.</p>
      <p>The loss curves for the training and validation sets (train/val_box_loss, train/val_cls_loss,
train/val_dfl_loss) show a synchronized decrease in loss functions across both subsets without
significant divergence, indicating the absence of overfitting and effective generalization (Fig. 4).
The precision and recall metrics (precision(B), recall(B), mAP50(B), mAP50–95(B)) demonstrate
steady improvement in both accuracy and recall, with values exceeding 0.85 at the final stage of
training. In particular, the mAP50 and mAP50–95 curves on the validation set show consistent
growth, confirming the reliability and generalization capability of the trained model.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results and discussion</title>
      <p>To visually demonstrate the effectiveness of the proposed information technology for vehicle
detection in aerial images (Fig. 5), a visual comparison was conducted between its results and those
of existing models on images from the test set, with the predicted bounding boxes overlaid.
d) e)</p>
      <p>Figure 5: A comparison of the visual results of car detection by different models: a) YOLOv11;
b) YOLOv8; c) YOLOv7; d) YOLOv5; e) CNN; f) a proposed model.</p>
      <p>Many false negatives are observed in the image fragments obtained using earlier-generation
CNN models, such as YOLOv5 (Fig. 5d, Fig. 5e)—some vehicles remain undetected. Additionally,
there are false positives and inaccurate bounding box dimensions. The detected boxes exhibit
distorted geometric properties, indicating a limited ability of such models to isolate small objects,
especially those with arbitrary orientations, precisely. The results obtained using YOLOv7 /
YOLOv8 models (Fig. 5b, Fig. 5c) show noticeably better performance: the number of detected
vehicles increases, and the boxes more accurately correspond to the objects' positions. However,
using only axis-aligned rectangles leads to less effective coverage of angled or elongated objects. In
particular, there are cases of excessive overlap with adjacent objects or unnecessary inclusion of
background. The visual results of the newer YOLOv11 model (Fig. 5a) indicate a further
improvement in detection accuracy compared to YOLOv7/YOLOv8. Nonetheless, in cases where
objects are significantly oriented for the image axis, spatial inaccuracies still occur due to the
mismatch between the bounding box geometry and the actual shape of the vehicle. The results of
the proposed YOLOv11-OBB model (Fig. 5f) demonstrate the highest detection accuracy for
vehicles in aerial images. All cars in the image were successfully identified, and the bounding
boxes' shape aligns with the objects' actual geometry. OBB minimizes unnecessary background
coverage and reduces the likelihood of overlap with adjacent objects, which is particularly critical
when vehicles are densely packed.</p>
      <p>For the quantitative analysis of the effectiveness of the developed information technology for
vehicle detection in aerial images, a comparison with existing models was conducted using the
following evaluation metrics (Fig. 6): mean Average Precision at a 50% intersection-over-union
threshold (mAP@50), classification Precision, Recall, and mean spatial overlap accuracy for
oriented bounding boxes (OBB IoU).
c) d)</p>
      <p>Figure 6: Performance metrics comparison of object detection models: a) mAP@50 (%);
b) Precision (%); c) Recall (%); d) OBB IoU (%).</p>
      <p>The proposed YOLOv11-OBB model demonstrates the highest mean Average Precision
(mAP@50) of 96.8%, surpassing the results of other compared models (CNN – 44.5%, YOLOv5 –
45.6%, YOLOv7 – 78.6%, YOLOv8 – 94.5%, YOLOv11 – 95.8%). This result indicates the model's
strong ability to detect vehicles with sufficient overlap accurately, achieved through enhanced
architecture and oriented bounding boxes. The Precision metric, which reflects the model's ability
to minimize false positives, reaches 99.5% for the proposed model, significantly outperforming
(CNN – 53.4%, YOLOv5 – 55.5%, YOLOv7 – 83.3%, YOLOv8 – 93.7%, YOLOv11 – 95.9%). The Recall
metric demonstrates a value of 96.8% for YOLOv11-OBB, the highest among the compared models
(CNN – 58.1%, YOLOv5 – 53.9%, YOLOv7 – 71.4%, YOLOv8 – 86.8%, YOLOv11 – 94.2%), indicating
high sensitivity in detecting vehicles with minimal missed detections. The localization quality for
oriented bounding boxes, measured by OBB IoU, reaches 92.5% for YOLOv11-OBB, significantly
higher than CNN – 40.0%, YOLOv5 – 45.0%, YOLOv7 – 70.0%, YOLOv8 – 82.5%, YOLOv11 – 87.0%.
This high OBB IoU score confirms the effectiveness of oriented bounding boxes and specialized
architectural components (Freq-SpaFEM, DAGLF, AMSF2, RAFA), contributing to accurate feature
extraction and detection of objects with complex geometries. It is especially critical in densely
packed vehicles, overlaps, or arbitrary orientations typical in aerial imagery.</p>
      <p>To evaluate each model's ability to classify and identify objects, confusion matrices correctly
[26] were constructed for the test dataset (Fig. 7). The resulting matrices allow for assessing the
number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for
two classes: "Car" and "Non-car" (background/other objects).</p>
      <p>The proposed model (YOLOv11-OBB) exhibits only one false positive case (an object incorrectly
classified as a car) and one false negative case (a vehicle that was not detected), indicating high
accuracy with a minimal number of errors. YOLOv7 and YOLOv8 detect fewer true cars and
produce more false positive detections, reflecting lower precision than the proposed model.
YOLOv11 detects more cars than YOLOv7 and has fewer false positives; however, it still
underperforms relative to the proposed model regarding correctly identified vehicles and exhibits
twice as many false positives. YOLOv5 shows significantly poorer results, detecting only two true
cars, with the highest number of false positive detections and a substantial count of missed
vehicles. The performance of the baseline CNN is comparable to that of YOLOv7 and YOLOv8
according to this confusion matrix.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>A car detection information technology based on the modified YOLOv11 architecture supporting
oriented bounding boxes has been developed and experimentally evaluated. The quantitative and
visual analysis results demonstrate the proposed technology's effectiveness. Specifically, the
modified YOLOv11-OBB model achieved the highest performance among the compared
counterparts across key metrics: mAP@50 (96.8%), Precision (99.5%), Recall (96.8%), and OBB IoU
(98.5%). The significant outperformance of other models, including the standard YOLOv11,
confirms the feasibility and advantages of integrating oriented bounding boxes and optimized
architectural components such as Freq-SpaFEM, DAGLF, AMSF2, and RAFA. These modifications
enabled the model to effectively address challenges related to small object sizes, arbitrary
orientations, and complex backgrounds, which are typical for aerial imagery.</p>
      <p>The analysis of the confusion matrices further confirmed the high accuracy of the proposed
YOLOv11-OBB model, as it achieved the highest number of accurate positive detections while
simultaneously minimizing both false positives and false negatives. This result demonstrates the
model's ability to identify vehicles accurately using high-resolution imagery. Compared to other
models, YOLOv11-OBB significantly reduces the number of errors, which is particularly important
for practical applications such as traffic monitoring and urban planning.
The study was conducted as part of the international educational project “Safe Artificial
Intelligence: The European Legal Dimension” [101176092, a joint project of Dnipro University of
Technology, Erasmus+ Jean Monnet Foundation, and the European Education and Culture
Executive Agency (EACEA)]. Support from the European Commission for the publication of this
work does not imply endorsement of its content, which solely reflects the views and opinions of
the authors, and the Commission cannot be held responsible for any use that may be made of the
information contained therein.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors used Grammarly to check the grammar.</p>
    </sec>
    <sec id="sec-8">
      <title>References</title>
      <p>[13] Z. Li, Y. Chen, G. Yu, Y. Deng. "R-FCN++: Towards Accurate Region-Based Fully
Convolutional Networks for Object Detection". Proc. AAAI Conf. Artif. Intell. 32.1 (2018).
doi:10.1609/aaai.v32i1.12265.
[14] D. Avola, L. Cinque, A. Diko, A. Fagioli, G. L. Foresti, A. Mecca, D. Pannone, C. Piciarelli,
MSFaster R-CNN: Multi-Stream Backbone for Improved Faster R-CNN Object Detection and
Aerial Tracking from UAV Images, Remote Sens. 13.9 (2021) 1670. doi:10.3390/rs13091670.
[15] A. Bochkovskiy, C. Wang, H. Liao. "YOLOv4: Optimal Speed and Accuracy of Object</p>
      <p>Detection". ArXiv abs/2004.10934 (2020): n. pag.
[16] K. Li, X. Zheng, J. Bi, G. Zhang, Y. Cui, T. Lei, RMVAD-YOLO: A Robust Multi-View Aircraft
Detection Model for Imbalanced and Similar Classes, Remote Sens. 17.6 (2025) 1001.
doi:10.3390/rs17061001.
[17] S. Patil, S. Waghule, S. Waje, P. Pawar, S. Domb, Efficient Object Detection with YOLO: A
Comprehensive Guide, Int. J. Adv. Res. Sci., Commun. Technol. (2024) 519–531.
doi:10.48175/ijarsct-18483.
[18] W. Liu, G. Ren, R. Yu, S. Guo, J. Zhu, L. Zhang. "Image-Adaptive YOLO for Object Detection in
Adverse Weather Conditions". Proc. AAAI Conf. Artif. Intell. 36.2 (2022) 1792–1800.
doi:10.1609/aaai.v36i2.20072.
[19] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, L. Zhang. "DOTA: A
Large-Scale Dataset for Object Detection in Aerial Images". IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). (2018). doi:10.1109/cvpr.2018.00418.
[20] J. Ding, N. Xue, Y. Long, G.-S. Xia, Q. Lu. "Learning RoI Transformer for Oriented Object
Detection in Aerial Images". IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). ( 2019). doi:10.1109/cvpr.2019.00296.
[21] V. Hnatushenko, P. Kogut, M. Uvarov, On Flexible Co-registration of Optical and SAR Satellite
Images, in: Advances in Intelligent Systems and Computing, Springer International Publishing,
Cham, 2020, pp. 515–534. doi:10.1007/978-3-030-54215-3_33.
[22] The Common Objects in Context-stuff. URL: https://paperswithcode.com/dataset/coco-stuff.
[23] G. Tariku, I. Ghiglieno, A. Simonetto, F. Gentilin, S. Armiraglio, G. Gilioli, I. Serina, Advanced
Image Preprocessing and Integrated Modeling for UAV Plant Image Classification, Drones 8.11
(2024) 645. doi:10.3390/drones8110645.
[24] K. Xia, Z. Lv, C. Zhou, G. Gu, Z. Zhao, K. Liu, Z. Li, Mixed Receptive Fields Augmented YOLO
with Multi-Path Spatial Pyramid Pooling for Steel Surface Defect Detection, Sensors 23.11
(2023) 5114. doi:10.3390/s23115114.
[25] V. Hnatushenko, D. Mozgovoy, V. Vasyliev. "Accuracy evaluation of automated object
recognition using multispectral aerial images and neural network". Tenth International
Conference on Digital Image Processing (ICDIP 2018), Shanghai, China, SPIE, 2018, p. 72.
doi: 10.1117/12.2502905.
[26] A. Vijayakumar, S. Vairavasundaram, YOLO-based Object Detection Models: A Review and its
Applications, Multimed. Tools Appl. (2024). doi:10.1007/s11042-024-18872-y.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A feature fusion deep-projection convolution neural network for vehicle detection in aerial images</article-title>
          ,
          <source>PLOS ONE 16.5</source>
          (
          <year>2021</year>
          )
          <article-title>e0250782</article-title>
          . doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0250782</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Hnatushenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. I.</given-names>
            <surname>Kogut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Uvarov</surname>
          </string-name>
          , On Optimal 2-
          <string-name>
            <given-names>D</given-names>
            <surname>Domain Segmentation</surname>
          </string-name>
          <article-title>Problem via Piecewise Smooth Approximation of Selective Target Mappings</article-title>
          ,
          <string-name>
            <surname>J. Optim.</surname>
          </string-name>
          ,
          <source>Differ. Equ. Their Appl</source>
          .
          <volume>27</volume>
          .2 (
          <year>2019</year>
          )
          <article-title>60</article-title>
          . doi:
          <volume>10</volume>
          .15421/141908.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhong</surname>
          </string-name>
          .
          <article-title>"Adaptive Feature Pyramid Networks for Object Detection"</article-title>
          .
          <source>IEEE Access</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>107024</fpage>
          -
          <lpage>107032</lpage>
          . doi:
          <volume>10</volume>
          .1109/access.
          <year>2021</year>
          .
          <volume>3100369</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sunil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Beniwal</surname>
          </string-name>
          ,
          <article-title>Sentiment Analysis: A Tool for Mining Opinions and Emotions, SSRN Electron</article-title>
          . J. (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .2139/ssrn.3746951.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Zuoxin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Fuqiang</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>"FSSD: Feature Fusion Single Shot Multibox Detector"</article-title>
          .
          <source>ArXiv abs/1712</source>
          .00960 (
          <year>2017</year>
          ): n. pag.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>"RepPoints: Point Set Representation for Object Detection"</article-title>
          .
          <source>IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1109/iccv.
          <year>2019</year>
          .
          <volume>00975</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharda</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Wang, Dynamic Label Assignment for Object Detection by Combining Predicted IoUs</article-title>
          and
          <string-name>
            <surname>Anchor IoUs</surname>
          </string-name>
          ,
          <source>J. Imaging</source>
          <volume>8</volume>
          .7 (
          <year>2022</year>
          )
          <article-title>193</article-title>
          . doi:
          <volume>10</volume>
          .3390/jimaging8070193.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>He</surname>
          </string-name>
          .
          <article-title>"SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing"</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell. (
          <year>2022</year>
          )
          <article-title>1</article-title>
          . doi:
          <volume>10</volume>
          .1109/tpami.
          <year>2022</year>
          .
          <volume>3166956</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Guirguis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hendawy</surname>
          </string-name>
          , G. Eskandar,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdelsamad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kayser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Beyerer</surname>
          </string-name>
          .
          <article-title>"CFA: Constraint-based Finetuning Approach for Generalized Few-Shot Object Detection"</article-title>
          .
          <source>IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
          .
          <article-title>(</article-title>
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1109/cvprw56347.
          <year>2022</year>
          .
          <volume>00449</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kashtan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hnatushenko</surname>
          </string-name>
          ,
          <article-title>Deep Learning Technology for Automatic Burned Area Extraction Using Satellite High Spatial Resolution Images</article-title>
          ,
          <source>in: Lecture Notes in Data Engineering</source>
          , Computational Intelligence, and Decision Making, Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>664</fpage>
          -
          <lpage>685</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -16203-9_
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ren</surname>
          </string-name>
          .
          <article-title>"Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression"</article-title>
          .
          <source>Proc. AAAI Conf. Artif. Intell</source>
          .
          <volume>34</volume>
          .07 (
          <year>2020</year>
          )
          <fpage>12993</fpage>
          -
          <lpage>13000</lpage>
          . doi:
          <volume>10</volume>
          .1609/aaai.v34i07.
          <fpage>6999</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Kashtan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Hnatushenko</surname>
          </string-name>
          ,
          <article-title>Automated building damage detection on digital imagery using machine learning</article-title>
          ,
          <source>Nauk. Visnyk Natsionalnoho Hirnychoho Universytetu No. 6</source>
          (
          <year>2023</year>
          )
          <fpage>134</fpage>
          -
          <lpage>140</lpage>
          . doi:
          <volume>10</volume>
          .33271/nvngu/2023-6/134.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>