<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Donata Gliaubičiūtė</string-name>
          <email>donata.gliaubiciute@ktu.lt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rokas Janavičius</string-name>
          <email>rokas.janavicius@ktu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aušra Gadeikytė</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukas Paulauskas</string-name>
          <email>lukas.paulauskas@ktu.lt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kaunas University of Technology</institution>
          ,
          <addr-line>Studentų g. 50, Kaunas, 51368</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays, the engineering application of vehicle detection from aerial images is a challenging task due to the particularity of perspective, the small size of the objects, and the complex background. This research aim is to investigate low-resolution aerial images of vehicles that can be utilized for vehicle detection using machine learning models. The research work was conducted using one-stage deep learning-based object detection algorithms YOLOv5, YOLOv7, and YOLOv8 on the two datasets (COWC and VEDAI) that addressed the task of small vehicle detection. For the training of the models, available pre-trained weights were used as a starting point, and then each model was trained by utilizing transfer learning. The obtained results of the study demonstrated that by reducing the image pixel ratio every 5 cm per pixel from 12.5x12.5 to 27.5x27.5 cm per pixel, the accuracy of the object detection models decreases by an average of 3.51%. When the pixel ratio varies from 30x30 to 32.5x32.5 cm per pixel, the accuracy of the models drops by an average of 2.33% on the COWC dataset and 42.4% on the VEDAI dataset. Vehicle detection, aerial images, convolutional neural networks, pixel ratio, YOLOv5, Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>YOLOv7, YOLOv8 Vehicle detection, aerial images, convolutional neural networks, pixel ratio, YOLOv5,</title>
      <sec id="sec-1-1">
        <title>1. Introduction</title>
        <p>Object detection involves finding the numerous objects in the images and identifying their locations.
Recently object detection has been considered one of the most challenging tasks in computer vision due
to the appearance of objects varying greatly depending on various circumstances, such as image capture
technologies. One such technology is UAVs (Unmanned Aerial Vehicles) - a critical enabler for a wide
range of applications including automated driving, crowd flow counting, topographic exploration,
environmental pollution monitoring, etc.</p>
        <p>
          Over the years, a lot of effort has been put into identifying vehicles and other small targets in the
images that UAVs collect. According to B. Wang and B. Xu in 2021 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] the most common difficulties
of object detection in aerial images are:



        </p>
        <p>
          The particularity of perspective. Since aerial images are typically taken from above, the objects
have less texture features [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. As a result, the targets can be easily mistaken with other objects [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          The size of the objects. In aerial images, the objects are quite small (composed of only 15 to 30
pixels). Moreover, Convolutional Neural Networks (CNNs) down-sampling layers minimize the
amount of information that each object has. For instance, after four down-sampling layers, a 24x24
pixels object maintains only around one pixel in feature maps, making it challenging to identify
small objects from the background [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          The complexity of the background. Usually, aerial images might cover an area of several square
kilometers. The presence of different backgrounds in this receptive field, such as the countryside,
mountains, urban areas, etc., interfere with the object detection process [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>2023 Copyright for this paper by its authors.
CEUR</p>
        <p>ceur-ws.org</p>
        <p>
          Deep learning algorithms have enabled vehicle identification technologies to attain very high
performance. The deep convolutional neural network may use the dataset to train and enhance its model
independently. Deep learning-based object detection algorithms that are frequently utilized may be split
into two categories: one-stage and two-stage detectors [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          Two-stage object detection algorithms Faster R-CNN (Region-based Convolutional Neural
Networks) divide the target detection into two stages, that is, first use the Region Proposal Network
(RPN) to extract candidate target information, and then use the detection network to complete the
location and category of candidate targets [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. One-stage object detection algorithms such as YOLO
(You Only Look Once) do not require to use RPN, and directly generate the location and category
information of the target through the network, which is an end-to-end target detection algorithm.
Therefore, the single-step target detection algorithms have a faster detection speed [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          One of the first methods to use convolutional neural networks for object detection and to show off
their impressive capabilities is region-based CNN (R-CNN) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In R-CNN, a selective search algorithm
selects image regions that could contain target objects, and then the CNN is used to map the target
objects in the suggested region. Fast R-CNN used an SPP (Spatial Pyramid Pooling) layer and a RoI
pooling layer to increase accuracy and runtime over R-CNN [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Unlike the R-CNN, which classifies
each region proposal independently, Fast R-CNN computes a feature map from a full image only once
and then categorizes region proposals by projecting each one onto that feature map. Moreover, the Fast
R-CNN algorithm uses a time-consuming selective search method to look for region suggestions in a
target image. In Faster R-CNN [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the selective search is replaced with RPN, which calculates region
proposals from an input image. Faster R-CNN is 900% faster than Fast R-CNN and is made up
completely of deep learning networks. Directly connecting the RPNs and the classifier network would
help Faster R-CNN to further advance [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In 2017, T. Tang et al. designed an improved Faster-RCNN
to solve the difficulties of locating the positions of small vehicles and classifying the vehicle from the
background [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          In one pass, YOLO predicts and categorizes bounding boxes of objects. An image is initially divided
into non-overlapping grids through YOLO. For each cell in the grids, YOLO fore-casts the likelihood
that an object will be present, the coordinates of the anticipated box, and the object's class. Each cell's
bounding boxes and their confidence scores are predicted by the network. The network then determines
the classes' probabilities for each cell [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The first version of YOLO, coined YOLOv1, reportedly
achieves a faster inference time, but lower accuracy compared to a single-shot detector [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In order to
increase the speed and accuracy of detection, YOLOv2 was suggested. Anchor boxes are used in
YOLOv2 together with convolutional layers that are not fully connected [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The accuracy of the
network is further increased by YOLOv2 using batch normalization (BN) and a high-resolution
classifier. YOLOv3 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] uses three detection levels and predicts three box anchors for each cell. To
extract feature maps, YOLOv3 adds a deeper backbone network (Darknet-53) to the system. Due to the
addition of more layers, the prediction is slower than with YOLOv2. Many technical improvements
were made in YOLOv4 while maintaining its computational efficiency. The improvements slightly
affected the inference time but significantly increased accuracy [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          According to A. Ammar et al. in 2021 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], vehicle detection is possible for different data sets with
an accuracy from 85.3% to 98%. However, vehicle detection is still challenging when aerial images are
small in size and contain a large number of objects. It might cause information loss when convolution
operations are performed [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          There are different aerial image data sets such as OIRDS [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], PUCPR [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], COWC[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and
VEDAI [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] that might be used for the investigation of vehicle detection. The overhead imagery
research data set (OIRDS) project produced a data set with almost 1,000 labeled images suitable for
developing automated vehicle detection algorithms [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. “Overhead imagery research data set” contains
approximately 1,800 labeled targets. For each target, there are over 30 annotations and over 60 statistics,
that describe the target within the context of the image. Images sizes range from 256×256 pixels to
512×512 pixels. The dataset contains five classes of vehicles (“truck”, “pickup”, “car”, “van” and
“unknown”). Annotations give information such as color and distance to the ground [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. On the other
hand, this database is hard to apply to benchmark target detection algorithms because there is no defined
evaluation protocol, the dataset is obtained by aggregating multiple sources of images (20 different
sources), and does not have sufficient statistical regularity. These issues make the results difficult to
reproduce, preventing other researchers from making any comparisons with this dataset [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ]. It was
tried to split this database (easy, medium, and hard). However, the precise set of images in each split
was not defined, preventing the reproduction of results [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Approximately 17,000 photos in the Pontifical Catholic University of Parana Dataset (PUCPR) are
devoted to car counting in settings of various parking lots. The dataset includes details about 16,456
vehicles. The aerial images in the collection were taken from a drone view at a height of about 40
meters. The image set is annotated by a bounding box per car. All labeled bounding boxes have been
well recorded with the top-left and bottom-right points [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          Nearly 90,000 automobiles were collected using a drone from 4 distinct parking lots for the Car
Parking Lot Dataset (CARPK). This is a large dataset with an emphasis on automobile counting in
various parking lots. The bounding box for each car is annotated in the image set. Top-left and
bottomright points have been accurately recorded for each labeled bounding box. It is supporting object
counting, object localizing, and further investigations with the annotation format in bounding boxes
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>The purpose of this study is to investigate the change in the accuracy of object detection models for
detecting vehicles in aerial images when the resolution of the images fed to the models is reduced. The
findings of this study might be useful when certain situations require quick and real-time
decisionmaking regarding the distribution of vehicles in a geographic space if collecting aerial photographs of
this space is available. When it is known what minimum resolution and object detection models are
sufficient to obtain acceptable results from aerial images, it is possible to save time for flying UAVs,
processing information, and presenting results. The models selected for this study are one of the
bestperforming one-stage detectors (YOLOv5, YOLOv7, YOLOv8) that reach high accuracy and speed
when applied to object detection tasks.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2. Methods 2.1.</title>
      </sec>
      <sec id="sec-1-3">
        <title>Data Preprocessing</title>
        <p>
          The research was conducted using object detection algorithms on the COWC and VEDAI datasets
that address the task of small vehicle detection. The cars overhead with context (COWC) dataset
contains many unique cars (32,716) from six different image sets, each covering a different
geographical location and produced by different images [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The images cover regions from Toronto
(Canada), Selwyn (New Zealand), Potsdam and Vaihingen (Germany), Columbus, and Utah (The
United States). The COWC dataset provides data from overhead at 15 cm (about 5.91 in) per pixel
resolution at ground (all data is EO) and is designed to be challenging for detection models.
Furthermore, it contains 58,247 usable negative targets, many of which have been hand-picked objects
similar to cars such as boats, trailers, bushes, and A/C units. To compensate for the additional difficulty,
the context was included around targets. Context can help identify something that may not be a car or
confirm it is a car. In general, the idea is to allow a deep learner to decide the weight between context
and appearance such that something that looks very much like a car is detected even if it is in an unusual
place.
        </p>
        <p>
          The vehicle detection in aerial imagery (VEDAI) database includes various back-grounds such as
woods, cities, roads, parking lots, construction sites, or fields. In addition, the vehicles to be detected
have different orientations and can be altered by specular spots, occluded, or masked. Each image is
available in several spectral bands and resolutions. VEDAI set has 2950 cars in 512x512 and 1024x1024
images. The dataset with 1024x1024 resolution images has a resolution of 12.5cm×12.5cm per pixel.
Likewise, 512x512 resolution images have a resolution of 25cm×25cm per pixel. The images were
taken during the spring of 2012. Raw images have 4 uncompressed color channels. The dataset has nine
different classes of vehicles: “plane”, “boat”, “camping car”, “car”, “pick-up”, “tractor”, “truck”, “van”,
and “other”. Two meta-classes are also defined and considered in the experiments. The “small land
vehicles” class has the “car”, “pick-up”, “tractor”, and “van” classes included, and the “large land
vehicles” class contains the “truck” and “camping car” classes [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>When preparing the VEDAI dataset for the experiments, these classes have been dropped: “plane”,
“boat”, “camping car”, “tractor”, “truck”, and “other”. These changes were done in order to have
comparable visual data about the vehicles. Only one class "Car" was left in both datasets. In the COWC
dataset, a class of negative samples was removed. Figure 1. depicts a histogram of different numbers of
cars in the VEDAI dataset.</p>
        <p>
          The Roboflow [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] platform was used to manage the data sets. For each evaluation sample, the
dataset images were proportionally reduced in size to achieve the desired centimeters per pixel ratio.
With the use of the Roboflow platform, an analysis of the data was performed, resulting in the following
findings:
 COWC dataset contains 32810 annotations. The average number of bounding boxes per image
is 18.8. There are 800 images in the dataset with null examples.
 VEDAI dataset contains 2807 annotations. The average number of bounding boxes per image
is 2.2. There are 233 images in the dataset with null examples.
 Figure 1. demonstrates that the VEDAI dataset has mostly from 2 to 4 vehicles per image and
the COWC dataset has from 2 to 49 vehicles per image.
2.2.
        </p>
      </sec>
      <sec id="sec-1-4">
        <title>Object Detection Algorithms</title>
        <p>
          The YOLOv5 adopted the concept of anchor boxes to speed up the R-CNN algorithm and abandoned
the use of manually chosen anchor boxes was released in 2020. To get a better prior value, K-means
clustering was done on the bounding box dimensions [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>
          Introduced in 2022 YOLOv7 surpassed all known object detectors created before in both speed and
accuracy in the range from 5 FPS to 160 FPS and had the highest accuracy 56.8% AP among in that
time all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7 was trained
on the MS COCO dataset from scratch without using any other datasets or pre-trained weights. The
YOLOv7 model preprocessing method is integrated with YOLOv5, and the use of Mosaic data
augmentation is suitable for small object detection [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>
          The most recent group of YOLO-based object detection models is called YOLOv8. For detection,
segmentation, and classification, there are five models (Nano, Small, Medium, Large, and Xtra Large)
in each category of the YOLOv8. The fastest and smallest is YOLOv8 Nano, and the slowest and most
accurate is YOLOv8 Extra Large (YOLOv8x). All of the YOLOv8 models had improved throughput
when compared to other YOLO models trained at 640 image resolution while using around the same
amount of parameters [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The effectiveness of object detection on 640 image size between YOLOv8
and YOLOv5 is summarized in Table 1.
3. Evaluation metrics
used in this study:







(mAP@0.5);
0.75, 0.8, 0.85, 0.9, 0.95);
        </p>
        <p>F1 score;</p>
        <p>Confusion Matrix.</p>
        <p>
          For the evaluation of the performance of the models, the following evaluation metrics [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] were
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Total (pre-process + inference + NMS) detection speed in milliseconds; Mean average precision (mAP) calculated at Intersection over Union (IoU) [19] threshold 0.5 mAP over different IoU thresholds from 0.5 to 0.95 with step 0.05 (0.5, 0.55, 0.6, 0.65, 0.7,</title>
      <p>YOLOv5
28
37.3
44.9
positive prediction results (see calculation formula (1)).</p>
      <p>where TP denotes true positive samples, FP – false positive samples.</p>
      <p>The recall represents the prediction result as the proportion of the actual positive samples in the
positive samples to the positive samples in the whole sample. The calculation formula (2), where FN
stands for false negative samples can be defined as
=</p>
      <p>TP</p>
      <p>TP + FP
=</p>
      <p>TP
TP + FN
−1) = 2 ∙

The F1 score is the weighted average of precision and recall, calculated (3) as follows:
 1 = (

−1 +</p>
      <p>∙ 
Precision + Recall</p>
      <p>Precision reflects the model’s ability to distinguish negative samples. The higher the precision, the
stronger the model’s ability to distinguish negative samples. Recall reflects the model’s ability to
identify positive samples. The higher the recall, the stronger the model’s ability to detect positive
samples. The F1 score is a combination of the two. The higher the F1 score, the more robust the model.</p>
      <p>
        For object detection tasks, the most common way to determine if a single object proposal is correct
is by using the Intersection over Union (IoU) metric [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. It takes the set A of proposed object pixels
and set of true object pixels B and calculates the intersection area. The calculation formula (4) is as
follows:
      </p>
      <p>In most cases, an IoU of over 0.5 means that the object was detected, otherwise it was a failure .
( ,  ) =
 ∩</p>
      <p>A ∪ B</p>
      <p>P =
1
n
 =
∑</p>
      <p>The mean of average precision (AP) values are calculated over recall values from 0 to 1. Average
precision is calculated as the weighted mean of precisions at each threshold and the weight is the
increase in recall from the prior threshold. The mean of average precision is the average AP of each
class. The mAP is evaluated (5) by finding each class‘s average precision (AP) and then averaging over
all specified classes.</p>
      <sec id="sec-2-1">
        <title>4. Experiments and results</title>
        <p>During the experiments, the processes indicated in Fiure. 2. were carried out, which consisted of
obtaining images with their original pixel ratio, resizing images, dividing the data sets into training,
validation, and testing sets, running training sets to models, custom training models using pre-trained
weights, evaluating models with testing sets.</p>
        <p>For the training of the models, available pre-trained weights were used as a starting point for each
variation of the datasets („yolov5s.pt“ for the YOLOv5, „yolov7.pt“ for the YOLOv7, and „yolov8s.pt“
for YOLOv8). Afterward, each model was trained by utilizing transfer learning for 100 epochs. After
training, testing was performed, by providing models with unseen images and capturing their inference
time, as well as available precision metrics. Regarding the dataset splits, both datasets featured the same
data split: 70% training, 20% validation, and 10% testing. The hardware used for training and testing
each variation is provided in the Table 2.
The following issues were observed during models training:
 YOLOv8 training on COWC at 15x15 cm per pixel variation would display uncommon
behavior in several repeated runs, where model training performance drops in the middle of training,
and the precision drops instantly.
 YOLOv7 training on VEDAI at 20x20 cm per pixel variation also exhibits inconsistent
behavior, as after the first 20 training epochs the training metrics fluctuate and drop, resetting the
training progress.</p>
        <p>The results of training YOLOv5 on the COWC dataset are provided in Figure 3. The testing results
(see Table 3.) demonstrate the performance of YOLOv5, YOLOv7, and YOLOv8 models when trained
and tested at various image pixel ratios. Specifically, the models were evaluated at 15, 20, 25, 27.5, 30,
and 32.5 cm per image pixel resolution. However, it should be noted that the original pixel ratio of the
VEDAI dataset is 12.5 cm per pixel, while the initial pixel ratio of the COWC dataset is 15 cm per
pixel, meaning that the original resolution of the datasets did not match in the starting evaluation phase.
As a result, the findings of the VEDAI and COWC datasets were not directly comparable during the
experiments when the pixel ratio was 12.5 cm per pixel.</p>
        <p>It was found that on the VEDAI dataset P, R, and mAP indicators values are smaller than on the
COWC dataset (see Table 3). However, the testing speed was higher. When the datasets had the lowest
pixel ratio, which is 32.5 cm per pixel, the YOLOv7 model obtained better test results on the COWC
dataset than the YOLOv5. However, YOLOv5 showed significantly higher values on the VEDAI
dataset than YOLOv7. For the COWC dataset, YOLOv7 achieved mAP score of 0.93, while YOLOv5
achieved 0.889 under the same conditions. For the VEDAI dataset, YOLOv7 achieved mAP score of
0.029, while YOLOv5 achieved 0.248. YOLOv7 had a faster total detection speed across all datasets
and their variations. Figure 4. display the change of accuracy metric of all the models changed with the
COWC or VEDAI datasets when the image pixel ratio was reduced by 5 or 2.5 cm, while Figure 5.
graphs the detection speed results for the same evaluation task.</p>
      </sec>
      <sec id="sec-2-2">
        <title>5. Conclusion</title>
        <p>In this paper, the change in the accuracy of vehicle detection using reduced pixel ratio aerial images
was investigated. When models object detection on aerial images mAP indicator results from 0.86 to
0.97 needed to be reached, the usage of a pixel ratio of 12.5x12.5 to 27.5x27.5 cm per pixel, and
YOLOv5, YOLOv7, and YOLOv8 object detection algorithms were proposed. When investigating the
dependence of aerial image resolution on the performance of object detection models, it was observed
that one-stage object detection algorithms such as YOLOv5, YOLOv7, and YOLOv8 achieve an
average of 3.51% lower mAP scores when the image pixel ratio is reduced every 5 cm per pixel from
12.5x12.5 to 27.5x27.5 cm per pixel.</p>
        <p>The YOLOv8 model had the most stable results among other models, decreasing by an average of
0.24% when tested on the COWC dataset from 12.5x12.5 to 27.5x27.5 cm per pixel image pixel ratio.
The YOLOv5 model achieved an average mAP reduction of 9.6% with images from 12.5x12.5 to
27.5x27.5 cm per pixel image pixel ratio from the COWC dataset, significantly lagging behind the other
tested models. All models performed better on the COWC and not on the VEDAI dataset during testing.
The VEDAI dataset only had 2807 annotated vehicles, while the COWC dataset contained 32810
annotated vehicles, which may have influenced the testing results. Additionally, vehicles in the VEDAI
dataset were labeled, even if some of them were partially hidden by other objects or just partially visible
at the edges of the image. When models were tested with the COWC dataset, results dropped by an
average of 0.68% and by an average of 6.35% with the VEDAI dataset when using image pixel ratios
between 12.5x12.5 and 27.5x27.5 cm per pixel. More pronounced changes in the mean average
accuracy of the models are noticeable when the pixel ratio varies from 30x30 to 32.5x32.5 cm per pixel.
When the pixel ratio was 30x30 cm per pixel then the accuracy dropped by an average of 1.42% on the
COWC dataset and 11.96% on the VEDAI dataset. When the pixel ratio was 32.5x32.5 cm per pixel
then the accuracy dropped by an average of 3.23% on the COWC dataset and 72.84% on the VEDAI
dataset.</p>
        <p>Future research should include other one-stage and two-stage deep learning-based object detection
algorithms and experiments with more image-pixel ratio options in order to collect more data on the
change in accuracy of the object detection models.</p>
      </sec>
      <sec id="sec-2-3">
        <title>6. Acknowledgements</title>
        <p>The analysis was carried out as a group work of the subject P170M113 Applied Research Project in
Kaunas University of Technology Masters’ programme Artificial Intelligence in Computer Science.
We also thank dr. Andrius Kriščiūnas for support, useful suggestions, and mentoring process.</p>
      </sec>
      <sec id="sec-2-4">
        <title>7. References</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          , “
          <article-title>A feature fusion deep-projection convolution neural network for vehicle detection in aerial images</article-title>
          ,”
          <source>PLoS One</source>
          , vol.
          <volume>16</volume>
          , no.
          <issue>5</issue>
          , p.
          <fpage>e0250782</fpage>
          ,
          <source>May</source>
          <year>2021</year>
          , doi: 10.1371/journal.pone.
          <volume>0250782</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ammar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koubaa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Benjdira</surname>
          </string-name>
          , “
          <article-title>Vehicle Detection from Aerial Images Using Deep Learning: A Comparative Study</article-title>
          ,”
          <string-name>
            <surname>Electronics</surname>
          </string-name>
          (Basel), vol.
          <volume>10</volume>
          , no.
          <issue>7</issue>
          , p.
          <fpage>820</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2021</year>
          , doi: 10.3390/electronics10070820.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Soviany</surname>
          </string-name>
          and R. T. Ionescu, “
          <article-title>Optimizing the Trade-Off between Single-Stage and Two-Stage Deep Object Detectors using Image Difficulty Prediction,” in 2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)</article-title>
          ,
          <year>Sep</year>
          .
          <year>2018</year>
          , pp.
          <fpage>209</fpage>
          -
          <lpage>214</lpage>
          . doi:
          <volume>10</volume>
          .1109/SYNASC.
          <year>2018</year>
          .
          <volume>00041</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          , “
          <string-name>
            <surname>Fast</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          ,” in
          <source>2015 IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>Dec</year>
          .
          <year>2015</year>
          , pp.
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2015</year>
          .
          <volume>169</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Miyazaki</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Shibasaki</surname>
          </string-name>
          , “
          <article-title>A CNN-Based Method of Vehicle Detection from Aerial Images Using Hard Example Mining,” Remote Sens (Basel)</article-title>
          , vol.
          <volume>10</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>124</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2018</year>
          , doi: 10.3390/rs10010124.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zou</surname>
          </string-name>
          , and L. Lei, “
          <source>Vehicle Detection in Aerial Images Based on Region Convolutional Neural Networks and Hard Negative Example Mining,” Sensors</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>2</issue>
          , p.
          <fpage>336</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          .
          <year>2017</year>
          , doi: 10.3390/s17020336.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Koay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Chuah</surname>
          </string-name>
          , C.
          <article-title>-</article-title>
          <string-name>
            <surname>O. Chow</surname>
            ,
            <given-names>Y.-L.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , and
          <string-name>
            <surname>K. K. Yong</surname>
          </string-name>
          , “
          <article-title>YOLO-RTUAV: Towards Real-Time Vehicle Detection through Aerial Images with Low-Cost Edge Devices,” Remote Sens (Basel)</article-title>
          , vol.
          <volume>13</volume>
          , no.
          <issue>21</issue>
          , p.
          <fpage>4196</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          .
          <year>2021</year>
          , doi: 10.3390/rs13214196.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          et al.,
          <string-name>
            <surname>“</surname>
            <given-names>SSD</given-names>
          </string-name>
          : Single Shot MultiBox Detector,”
          <year>2016</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          - 46448-
          <issue>0</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , “
          <article-title>YOLOv3: An Incremental Improvement</article-title>
          ,” Apr.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and H.
          <string-name>
            <surname>-Y. M. Liao</surname>
          </string-name>
          , “
          <article-title>YOLOv4: Optimal Speed and Accuracy of Object Detection</article-title>
          ,” Apr.
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tanner</surname>
          </string-name>
          et al., “
          <article-title>Overhead imagery research data set &amp;#x2014; an annotated data library &amp;#x00026; tools to aid in the development of computer vision algorithms</article-title>
          ,” in
          <source>2009 IEEE Applied Imagery Pattern Recognition Workshop (AIPR</source>
          <year>2009</year>
          ), Oct.
          <year>2009</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1109/AIPR.
          <year>2009</year>
          .
          <volume>5466304</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kembhavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Harwood</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Davis</surname>
          </string-name>
          , “
          <article-title>Vehicle Detection Using Partial Least Squares,”</article-title>
          <source>IEEE Trans Pattern Anal Mach Intell</source>
          , vol.
          <volume>33</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>1250</fpage>
          -
          <lpage>1265</lpage>
          , Jun.
          <year>2011</year>
          , doi: 10.1109/TPAMI.
          <year>2010</year>
          .
          <volume>182</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>M.-R. Hsieh</surname>
            ,
            <given-names>Y.-L.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            , and
            <given-names>W. H.</given-names>
          </string-name>
          <string-name>
            <surname>Hsu</surname>
          </string-name>
          , “
          <article-title>Drone-Based Object Counting by Spatially Regularized Regional Proposal Network</article-title>
          ,” in
          <source>2017 IEEE International Conference on Computer Vision</source>
          (ICCV), Oct.
          <year>2017</year>
          , pp.
          <fpage>4165</fpage>
          -
          <lpage>4173</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2017</year>
          .
          <volume>446</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Mundhenk</surname>
          </string-name>
          , G. Konjevod,
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Sakla</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Boakye</surname>
          </string-name>
          , “
          <article-title>A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning</article-title>
          ,”
          <year>2016</year>
          , pp.
          <fpage>785</fpage>
          -
          <lpage>800</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -46487-9_
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Razakarivony</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Jurie</surname>
          </string-name>
          , “
          <article-title>Vehicle detection in aerial imagery : A small target detection benchmark,” J Vis Commun Image Represent</article-title>
          , vol.
          <volume>34</volume>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>203</lpage>
          , Jan.
          <year>2016</year>
          , doi: 10.1016/j.jvcir.
          <year>2015</year>
          .
          <volume>11</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>B.</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          (
          <year>2022</year>
          ),
          <string-name>
            <surname>S. J.</surname>
          </string-name>
          , et. al. Dwyer, “
          <source>Roboflow (Version</source>
          <volume>1</volume>
          .0) [Software].” https://roboflow.com.
          <source>computer vision</source>
          .,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rath</surname>
          </string-name>
          , “YOLOv8 Ultralytics:
          <article-title>State-of-the-Art YOLO Models</article-title>
          ,” https://learnopencv.com/ultralytics-yolov8/, Jan.
          <volume>10</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Jiang</surname>
          </string-name>
          et al.,
          <article-title>“An Attention Mechanism-Improved YOLOv7 Object Detection Algorithm for Hemp Duck Count Estimation</article-title>
          ,” Agriculture, vol.
          <volume>12</volume>
          , no.
          <issue>10</issue>
          , p.
          <fpage>1659</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          .
          <year>2022</year>
          , doi: 10.3390/agriculture12101659.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Rezatofighi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tsoi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Gwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sadeghian</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Reid</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          , “
          <article-title>Generalized intersection over union: A metric and a loss for bounding box regression</article-title>
          ,” Apr.
          <year>2019</year>
          , arXiv:
          <year>1902</year>
          .09630.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>