<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improved model for detecting randomly oriented objects on remote sensing images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ihor A. Pilkevych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mykola P. Romanchuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olena M. Naumchak</string-name>
          <email>olenanau@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro L. Fedorchuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonid M. Naumchak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>PCWrEooUrckResehdoinpgs ISSNc1e6u1r-3w-0s0.o7r3g</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Korolyov Zhytomyr Military Institute</institution>
          ,
          <addr-line>22 Myru Ave., Zhytomyr, 10004</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>118</fpage>
      <lpage>126</lpage>
      <abstract>
        <p>Object detection in optical remote sensing images is an important task. In recent years, methods based on convolutional neural networks have shown progress. However, due to object variations such as scale, aspect ratio, and random orientation, detection is dificult to further improve. Most convolutional neural networks use rectangular bounding boxes for object detection parallel to the image coordinate axes, which is efective. However, for military objects in satellite images, which may have a large aspect ratio and be randomly oriented, rectangular bounding boxes may not always provide suficient target localization. In this paper, methods based on the rotation of rectangular frames or other polygonal boundaries are considered, including the following. Rotation Region Proposal Network (RRPN) and Rotation Region CNN (R2CNN). One-stage models such as SSD, YOLO, and RetinaNet have demonstrated high speed and accuracy. The new YOLOv11 model, which is a further development of the one-stage model approaches, demonstrates an increase in the accuracy and speed of object detection and recognition. The purpose of the study is the analysis of modern neural network models and their improvement to enhance the accuracy of detecting and recognizing small densely located, randomly oriented objects on satellite images. The paper proposes a model with a five-parameter regression that includes the parameter of the rotation angle of the bounding box. The results of the study show that this model improves the accuracy of object detection in complex scenarios by providing accurate determination of their orientation and scale.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;remote sensing</kwd>
        <kwd>randomly oriented object</kwd>
        <kwd>detector</kwd>
        <kwd>object detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In world practice, computer vision technologies are widely used to process remote sensing images. To
identify objects in remote sensing images, it is necessary to solve the tasks of detecting, recognizing,
assigning accurate bounding boxes or masks for small, randomly oriented objects, separating them
from the background, and providing object class labels [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>
        Currently, a large number of models based on convolutional neural networks have been developed to
improve the accuracy of object detection and recognition. In the process of recognizing and locating an
object, the neural network model uses a rectangular bounding box to detect it, and then classifies and
distinguishes between the object itself or the background within it [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Most cases of object detection
from a perspective parallel to the Earth are parallel to the image coordinate axis with a small aspect
ratio. As a result, a rectangular bounding box can better cover objects and contain less background [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
However, in the case of observing military objects with a large aspect ratio and disordered direction in
images acquired remotely from an observation angle perpendicular to the Earth [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], it is not possible to
accurately surround the object with a rectangular bounding box alone [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        In the field of computer image processing, a detector is a model for detecting and recognizing objects.
To solve the problem of detecting simple objects, one-stage and two-stage detectors are used. One-stage
detectors include: SSD, YOLO, RetinaNet, R³Det, RSDet, RIDet, FCOS, CSL, DCL, GWD, KLD, KFioU,
and two-stage detectors include Fast R-CNN, Faster R-CNN, Mask R-CNN, Cascade R-CNN, RRPN,
R²CNN, SCRDet, SCRDet++ [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Classical object detection is the detection of a simple object in an image using a horizontal bounding
box. Nowadays, many high-performance methods for detecting simple objects, such as the two-stage
model described by Fast R-CNN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Faster R-CNN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], focus on accuracy and reduce the amount
of computation to improve detection speed. To solve the problem of changing the scale of an object in
an image, the pyramidal feature network (FPN) method was proposed.
      </p>
      <p>
        Since most approaches are based on the assumption that objects are located along horizontal lines in
the image, the detector uses a rectangular bounding box parallel to the coordinate axis to detect and
locate the object in the image. Then it classifies the object or background directly within this frame
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. As a result, the task of detecting randomly rotated objects with a large aspect ratio arises, which
increases the bounding box and, as a result, leads to overloading of the detector during classification,
and in the case of detecting randomly rotated, densely spaced objects, the overlapping frames process
complex scenes and make it dificult to distinguish a single object [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (figure 1).
      </p>
      <p>
        To solve the problem of detecting randomly oriented objects, approaches based on the rotation of
a rectangular bounding box or other polygonal bounding boxes are used. For example, the Rotation
Region Proposal Network (RRPN) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] obtains a region of interest based on the rotated anchor for
feature detection. The Rotational Region CNN (R2CNN) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is based on the Fast R-CNN, using two
types of pooling size with diferent width-to-height ratios. However, newly developed models using the
approach of two-stage detectors based on traditional horizontal region detection do not produce results
with the required speed and accuracy.
      </p>
      <p>
        The CornerNet [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], CenterNet [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and ExtremeNet methods have gained popularity, which select
and group a set of certain key points of an object, such as corners, peaks, etc., to build a bounding box.
      </p>
      <p>
        Single-stage detection methods (single-frame multi-box SSD detector [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], YOLO family of models,
and RetinaNet [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]) are based on bounding box regression. YOLOv11 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is the most advanced model
that supports all the previous ones, and is improved by a new backbone network, detection unit, and
loss function [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>Their advantage is the higher speed of object detection and recognition. The disadvantage of the
considered approaches is that they do not take into account the cases of complex scenarios on satellite
images when it is necessary to detect small, densely located, randomly rotated objects, the detection of
which remains relevant under such conditions.</p>
      <p>The YOLOv11 object detection system is a single-stage system, but its accuracy is higher than most
two-stage detectors, and it is also fast. Therefore, in this paper, we use YOLOv11, on the basis of which
we implement the detection of randomly rotated objects.</p>
      <p>The purpose of the article is the analysis of neural network models and their improvement as a tool
for improving the accuracy of detecting and recognizing small, arbitrarily rotated objects on satellite
images.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Theoretical background</title>
      <p>The considered detectors for detecting objects in images use a rectangular bounding box parallel to the
coordinate axes, which, when detecting randomly rotated objects with a large aspect ratio, increases
the bounding box and, as a result, leads to overloading of the detector in the case of classification. In
addition, it does not provide accurate information about the object’s orientation and scale.</p>
      <p>
        To implement the detection of objects with randomly orientation, each detector and dataset provides
its own definition of the rotation angle. The DOTA dataset [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] stores the coordinates of the four
corners of the object’s bounding box. R2CNN [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] uses the coordinates of the first two clockwise corners
of the four (1, 1; 2, 2) and the height of the rectangle to define the frame. A common method is
ifve-parameter regression, which adds an angle  parameter in addition to the basic parameters and
ℎ, to represent the bounding box in any direction. As shown in figure 2) (left), this is an acute angle
formed by the width (or height) of the bounding box and the axis  in the range of 0 − 90∘ . Another
method is that the angle formed between the longest side of the rectangle and the axis  is between
− 90∘ and +90∘ , as shown in figure 2) (right).
      </p>
      <p>In the proposed model, the image label is pre-processed, in which the processing mainly concerns
the label part, and the angle information is obtained from the spatial label ℎ of the object. Data
preprocessing allows us to obtain the object’s orientation angle from − 90∘ to +90∘ and the division
into width  and height ℎ.</p>
      <p>The architecture of the YOLOv11 model, which is designed for improving small object detection
and accuracy while maintaining real-time inference speed, is shown in figure 4. The network consists
of the following parts: input, main, and prediction. Some modules are omitted in the figure and only
the general structure is shown. The input part is used to extract features from the image, from which
three feature maps are sequentially extracted, which pass through the main part, where a number of
operations are performed on them, such as convolution (* ), upsampling (↑) and combining (⨁︀).</p>
      <p>
        The convolution (* ) consists of a 2D convolutional layer and a 2D batch normalization layer with
SiLU activation function [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. YOLOv11 uses C3K2 blocks to handle feature extraction at diferent
processing stages. The C3K2 block optimizes information processing by dividing the feature map
and applying a series of smaller kernel convolutions 3 × 3, which are faster and cheaper to compute
compared to large kernel convolutions. It consists of convolution blocks at the beginning and end,
followed by a series of convolution blocks with interval pooling that disregards residuals when negative,
and ends with a pooling and simple convolution block.
      </p>
      <p>A special feature of YOLOv11 is the use of fast spatial pyramid fusion (SPFF), which was developed
to combine features from diferent regions of the image at diferent scales. To merge features, SPFF uses
multiple maximal pooling operations (with diferent kernel sizes) to aggregate multi-scale contextual
information. This improves the processing of fine-grained objects in images.</p>
      <p>One of the significant innovations in YOLOv11 is the addition of the Cross Stage Partial with Spatial
Attention (C2PSA) block. This block introduces attention mechanisms that improve the model’s focus
on important areas of the image, such as smaller or partially covered objects, by emphasizing spatial
relevance in feature maps.</p>
      <p>Prediction produces detection blocks for three diferent scales (low, medium, high) using the feature
maps created by the previous processing steps. This approach ensures that small objects are detected in
greater detail while larger objects are captured by higher-level features.</p>
      <p>As a result of processing, the neural network produces three predictions for the scales 80 × 80, 40 ×
40, 20 × 20. The format of the predicted object label for all scales is provided in the following format:
 – general label category and five parameters of the bounding box ( ,  – coordinates of the lower
left corner; , ℎ – width and height;  – angle of inclination to the -axis).</p>
      <p>In the proposed five-parameter model (, , , ℎ,  ), regression is used to predict the rotation of
the object bounding box, since weapons and military equipment samples on satellite images have a
ifxed aspect ratio, and the direction parallel to the longer side is defined as the direction of the object’s
movement. Therefore, to facilitate the regression task, the longer side is defined as , and the shorter
side is defined as ℎ. Thus, the direction parallel to is the direction of motion of the object. The angle
between the longer side  and the axis  is the angle of rotation. Given that the required range of angles
is [− 90∘ , 90∘ ], the function  is chosen to calculate the angle  . The rotation angle is calculated
using the expression (figure 4):</p>
      <p>The two points on the longest side , (), describe the value of the point on the axis  with the
smaller value on the axis , (), describing the opposite.</p>
      <p>To accurately determine the angles, it is also necessary to perform a conversion between the
fiveparameter method and the four-point annotation 11, 22, 33, 44. The YOLOv11 model module,
which performs data preprocessing, afine and color transformations of the image, receives four points
of the object’s corners as input. When recalculated, the final result of target detection is the coordinates
of the four corner points with a rotating bounding box applied to the original image. An example of
coordinate calculation for the coordinate :
(1)
(2)
(3)
  = (− 1)(,) − (− 1)()ℎ
+ ,
(, ) =
︂{ 0  &gt; 
1  &lt;  ,
where   is the final value of the point after the transformation; (, ) – is the relative value of
the location of the initial corner point  and the center point .</p>
      <p>The angle value is added to solve the problem of regressing the object’s direction of rotation. For
example, in a neural network, when processing an input image with 80 object detection and recognition
categories and using the four-parameter method (, , , ℎ) to locate the target, the final output matrix
is  ×  × (80 + 4 + 1).  provides the dimension of the feature map output by the last prediction
layer, and is the probability that a certain pixel in the feature map is the center point of the object; the
main part module is located between the two layers mentioned above and provides some modules such
as FPN. Therefore, to use the five-parameter positioning method, an additional channel is added to the
main part to predict the angle value (figure 4).</p>
      <p>When using the five-parameter positioning method, the center point of the object in the classification
and positioning prediction matrix in the original layer of the feature map is placed in a rectangular
coordinate system. As a result, during the training of the neural network model, a significant distance
between the training sample and the object prediction can lead to large values of the loss function,
which will not contribute to the convergence of the neural network model.</p>
      <p>Therefore, first, the cell of the coordinate grid where the label is placed is determined. Its upper left
corner is the origin. Subsequently, the coordinates  are calculated as the ofset of  and  relative
to the upper left corner in the range of values [0; − 1], which reduces the value of the loss function.
When training the neural network to increase the accuracy of localization of positive label predictions,
YOLOv11 uses one training sample to create three positive predictions, which leads to a change in the
range of coordinates  and [− 0, 5; 1, 5] (figure 5).</p>
      <p>= 1 −  +
 2(̃︀, )
2</p>
      <p>+ ,</p>
      <p>The result given by the prediction part of the neural network model cannot be directly calculated
for the loss function. To limit it within a given range, we use coordinate regression functions , , –
[− 0, 5; 0, 5]:</p>
      <p>2
 = 1 + −  − 0, 5 + ,
where  is the actual position of the center point of the predicted bounding box;  is the output value
of the neural network model after calculation;  is the value of the grid origin; angle of inclination
 – [− 1, 5; 1, 5] – (calculated in radians):</p>
      <p>3
 = 1 + −  − 1, 5,
 =  + ,</p>
      <p>The loss functions  for training the neural network model for positioning and orienting the bounding
box are:
where the loss functions  are for calculating the size and location of the center, and  – for
calculating the angle of rotation.</p>
      <p>
        The function  [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] (figure 5) works with the width , height ℎ, distances  between the two
center points of the bounding boxes and  – between the outer corners of their union. In figure 5, the
bounding box of the training sample ̃︀ is marked with a solid line, the predicted one  with a dashed
line, the intersection (̃︀, ) with a dashed line, and the union  (̃︀, ) with a dotted line.
      </p>
      <p>The full loss function  can be described as follows:
(4)
(5)
(6)
(7)
where ̃︀ – rotation angle of the training sample ̃︀;  – angle according to the forecast.</p>
      <p>Backpropagation will gradually reduce the training losses of the neural network model to achieve
the expected object detection result.
3. Experimental results
 takes into account the aspect ratio:
 is used to measure the consistency of the aspect ratio:
 =
(̃︀, )
 (̃︀, )</p>
      <p>,
 =
(1 −  ) + 
,
The rotation angle is calculated by the individual losses of ℎ1:
 =
4
 2 (arctan ̃︀ − arctan
ℎ
 2</p>
      <p>) ,
ℎ
ℎ1 =</p>
      <p>⃒
⎧
⎨ 0.5(̃︀ −  )2
⎩⃒⃒ ̃︀ −  ⃒⃒ − 0.5 ⃒⃒ ̃︀ −  ⃒⃒ ≥ 1
⃒
⃒
⃒
⃒ ̃︀ −  ⃒ &lt; 1
⃒
⃒
⃒
,
where  is the Euclidean distance between the center points ̃︀,  of the bounding boxes ̃︀ and , 
is the minimum diagonal distance of their union,  and  is the penalty of the loss function for the
distance between the center points and the aspect ratio of the bounding boxes.</p>
      <p>The components of  take into account the following:
 calculates the intersection area over the union of the training sample bounding box and the
object prediction:
(8)
(9)
(10)
(11)

̃︀
⃒</p>
      <p>To evaluate the results of the proposed rotating detector model, comparative experiments were conducted
on the DOTA reference dataset. The DOTA images were collected from Google Earth, GF-2 and JL-1
satellite remote sensing data provided by the China Satellite Data Resource and Application Center, and
aerial photographs from CycloMedia. DOTA consists of RGB and grayscale images. RGB images are
taken from Google Earth and CycloMedia, and grayscale images are taken from the panchromatic range
of GF-2 and JL-1 satellite images. All images are saved in png format. The dataset contains 11268 remote
sensing images (whose sizes vary from 800 ×
which are divided into 18 categories. Dataset composition: 4622 images with 621973 instances are the
training set; 593 images with 81048 instances are the validation set; 6053 images with 1090637 instances
are the test set. Each instance is labeled as a rectangle with clockwise dots. 11, 22, 33, 44 Half
of the images in this set were used as a training set, one third as a test set, and one sixth as a validation
set.</p>
      <p>To evaluate the performance of the model, we used the mean accuracy metric (mAP), which calculates
the average of the mAP scores for the variable IoU values. It allows penalizing a large number of
bounding boxes with incorrect classifications to avoid over-specialization in a few classes at the expense
of weak overfitting in others.</p>
      <p>The model was trained for 120 epochs with a learning rate of 0.01 and a momentum of 0.937. To
ifnalize the model, 3 TTAs were applied (minor image slicing with 650 × 650, 750 × 750, 850 × 850, and
rotation (0∘ , 90∘ , 180∘ , 270∘ ). To take into account the location of the image in the image (to reduce
the influence of objects with larger curved features at the edge of the image), we reduced the probability
by a correction factor of 0.8.</p>
      <p>As a result of tuning the developed model, along with increasing the image set and post-processing,
the accuracy of mAP object detection and recognition was improved by 0.33%, which is 81.69 compared
to YOLOv11-obb.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion</title>
      <p>To improve the eficiency and reliability of detailed interpretation of remote sensing data, we analyzed
the methods of automatic image processing. As a result, the study of neural network models to solve
the problem of detecting and recognizing small randomly oriented objects in satellite images revealed
dificulties that reduce the accuracy of object detection and recognition.</p>
      <p>In this study, a rotating bounding box detection model based on YOLOv11 is proposed to solve the
problem of traditional horizontal detectors that have dificulty detecting targets with high density,
high aspect ratio and overlapping bounding boxes. A rotation angle channel and a corresponding
angular loss calculation function were added to the original YOLOv11 model. To achieve the learning
efect, data label preprocessing was set up to detect and calculate the width, height, and angle of the
objects. A publicly available remote sensing dataset was selected to validate the model results and
assess its efectiveness. Experimental data and visual analysis showed that the YOLOv11-based model
is an efective choice for detecting and recognizing small-scale multidirectional remote sensing images.
Further research should focus on solving the problem of detecting and recognizing objects by detector
models in adverse meteorological conditions.</p>
      <p>Declaration on Generative AI: The authors have not employed any generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kovbasiuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kanevskyy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chernyshuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanchuk</surname>
          </string-name>
          ,
          <article-title>Detection of vehicles on images obtained from unmanned aerial vehicles using instance segmentation</article-title>
          ,
          <source>in: 15th International Conference on Advanced Trends in Radioelectronics</source>
          , Telecommunications and Computer Engineering, TCSET
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>271</lpage>
          . doi:
          <volume>10</volume>
          .1109/TCSET49122.
          <year>2020</year>
          .
          <volume>235437</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kovbasiuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kanevskyy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanchuk</surname>
          </string-name>
          ,
          <article-title>A hybrid segmentation cascade model for automatic object decoding on aerial images, Modern information technologies in the field of security and defense 35 (</article-title>
          <year>2019</year>
          )
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
          . doi:
          <volume>10</volume>
          .33099/
          <fpage>2311</fpage>
          -
          <lpage>7249</lpage>
          /
          <fpage>2019</fpage>
          -35-2-
          <fpage>65</fpage>
          -70.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sudha</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Priyadarshini,</surname>
          </string-name>
          <article-title>An intelligent multiple vehicle detection and tracking using modified vibe algorithm and deep learning algorithm</article-title>
          ,
          <source>Soft Computing</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>17417</fpage>
          -
          <lpage>17429</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s00500-020-05042-z.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Dogra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <article-title>Unsupervised classification of erroneous video object trajectories</article-title>
          ,
          <source>Soft Computing</source>
          <volume>22</volume>
          (
          <year>2018</year>
          )
          <fpage>4703</fpage>
          -
          <lpage>4721</lpage>
          . doi:
          <volume>10</volume>
          .1007/s00500-017-2656-x.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Small-scale moving target detection in aerial image by deep inverse reinforcement learning</article-title>
          ,
          <source>Soft Computing</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>5897</fpage>
          -
          <lpage>5908</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s00500-019-04404-6.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Araujo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fontinele</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>Oliveira, Multi-Perspective Object Detection for Remote Criminal Analysis Using Drones</article-title>
          ,
          <source>IEEE Geoscience and Remote Sensing Letters</source>
          <volume>17</volume>
          (
          <year>2020</year>
          )
          <fpage>1283</fpage>
          -
          <lpage>1286</lpage>
          . doi:
          <volume>10</volume>
          . 1109/lgrs.
          <year>2019</year>
          .
          <volume>2940546</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mu</surname>
          </string-name>
          , G. Kou,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Object Detection Based on Eficient Multiscale Auto-Inference in Remote Sensing Images</article-title>
          ,
          <source>IEEE Geoscience and Remote Sensing Letters</source>
          <volume>18</volume>
          (
          <year>2021</year>
          )
          <fpage>1650</fpage>
          -
          <lpage>1654</lpage>
          . doi:
          <volume>10</volume>
          .1109/LGRS.
          <year>2020</year>
          .
          <volume>3004061</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Qaddour</surname>
          </string-name>
          , Object Detection Performance:
          <string-name>
            <given-names>A Comparative</given-names>
            <surname>Study</surname>
          </string-name>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .21203/rs.3. rs-
          <volume>3181849</volume>
          /v1.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fast</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          , in: IEEE International Conference on Computer Vision (ICCV), Santiago, Chile,
          <year>2015</year>
          , p.
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2015</year>
          .
          <volume>169</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>39</volume>
          (
          <year>2017</year>
          )
          <fpage>1137</fpage>
          -
          <lpage>1149</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2016</year>
          .
          <volume>2577031</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>CDD-Net: A Context-Driven Detection Network for Multiclass Object Detection</article-title>
          ,
          <source>IEEE Geoscience and Remote Sensing Letters</source>
          <volume>19</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/LGRS.
          <year>2020</year>
          .
          <volume>3042465</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          , W. Shao,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <surname>Arbitrary-Oriented Scene</surname>
          </string-name>
          Text Detection via Rotation Proposals,
          <source>IEEE Transactions on Multimedia</source>
          <volume>20</volume>
          (
          <year>2018</year>
          )
          <fpage>3111</fpage>
          -
          <lpage>3122</lpage>
          . doi:
          <volume>10</volume>
          . 1109/TMM.
          <year>2018</year>
          .
          <volume>2818020</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Luo,</surname>
          </string-name>
          <article-title>R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection</article-title>
          ,
          <source>CoRR abs/1706</source>
          .09579 (
          <year>2017</year>
          ). URL: http: //arxiv.org/abs/1706.09579. arXiv:
          <volume>1706</volume>
          .
          <fpage>09579</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Law</surname>
          </string-name>
          , J. Deng, CornerNet: Detecting Objects as Paired Keypoints,
          <source>International Journal of Computer Vision</source>
          <volume>128</volume>
          (
          <year>2019</year>
          )
          <fpage>642</fpage>
          -
          <lpage>656</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11263-019-01204-1.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Krähenbühl</surname>
          </string-name>
          , Objects as Points, CoRR abs/
          <year>1904</year>
          .07850 (
          <year>2019</year>
          ). URL: http: //arxiv.org/abs/
          <year>1904</year>
          .07850. arXiv:
          <year>1904</year>
          .07850.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , SSD: Single Shot MultiBox Detector, in: B.
          <string-name>
            <surname>Leibe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Matas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Sebe</surname>
          </string-name>
          , M. Welling (Eds.),
          <source>Computer Vision - ECCV</source>
          <year>2016</year>
          , volume
          <volume>9905</volume>
          of Lecture Notes in Computer Science, Springer International Publishing, Cham,
          <year>2016</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -46448-
          <issue>0</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal Loss for Dense Object Detection</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>42</volume>
          (
          <year>2020</year>
          )
          <fpage>318</fpage>
          -
          <lpage>327</lpage>
          . doi:
          <volume>10</volume>
          .1109/ TPAMI.
          <year>2018</year>
          .
          <volume>2858826</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Khanam</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Hussain, YOLOv11: An Overview of the Key Architectural Enhancements</article-title>
          ,
          <source>CoRR abs/2410</source>
          .17725 (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2410.17725. arXiv:
          <volume>2410</volume>
          .
          <fpage>17725</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Gai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Jing,
          <article-title>An improved Tiny YOLOv3 for real-time object detection</article-title>
          ,
          <source>Systems Science &amp; Control Engineering</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>314</fpage>
          -
          <lpage>321</lpage>
          . doi:
          <volume>10</volume>
          .1080/21642583.
          <year>2021</year>
          .
          <volume>1901156</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G.-S.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Datcu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pelillo</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Zhang,</surname>
          </string-name>
          <article-title>DOTA: A LargeScale Dataset for Object Detection in Aerial Images</article-title>
          , in: 2018
          <source>IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3974</fpage>
          -
          <lpage>3983</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2018</year>
          .
          <volume>00418</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Luo,</surname>
          </string-name>
          <article-title>R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection</article-title>
          ,
          <source>CoRR abs/1706</source>
          .09579 (
          <year>2017</year>
          ). URL: http: //arxiv.org/abs/1706.09579. arXiv:
          <volume>1706</volume>
          .
          <fpage>09579</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Elfwing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Uchibe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Doya</surname>
          </string-name>
          ,
          <article-title>Sigmoid-weighted linear units for neural network function approximation in reinforcement learning</article-title>
          ,
          <source>Neural Networks</source>
          <volume>107</volume>
          (
          <year>2018</year>
          )
          <fpage>3</fpage>
          -
          <lpage>11</lpage>
          . doi:
          <volume>10</volume>
          .1016/J. NEUNET.
          <year>2017</year>
          .
          <volume>12</volume>
          .012.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>