<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An approach to matching satellite and UAV images for visual place recognition using color normalization and YOLO</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Volodymyr Vozniak</string-name>
          <email>vozniakvz@khmnu.edu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuriy Ushenko</string-name>
          <email>y.ushenko@chnu.edu.ua</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Orken Mamyrbayev</string-name>
          <email>morkenj@mail.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information and Computational Technologies</institution>
          ,
          <addr-line>125, Pushkin str., Almaty, 050010</addr-line>
          ,
          <country>Republic of Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Khmelnytskyi National University</institution>
          ,
          <addr-line>11, Institutes str., Khmelnytskyi, 29016</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yuriy Fedkovych Chernivtsi National University</institution>
          ,
          <addr-line>Chernivtsi, 58012</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Accurate localization of Unmanned Aerial Vehicles (UAVs) in GPS-denied urban environments is a critical challenge for autonomous navigation and disaster response. The primary dificulty lies in the reliable cross-view matching of onboard UAV imagery with geo-referenced satellite databases, which is often hindered by significant discrepancies in viewpoint, scale, and illumination. In this work, we propose a robust visual place recognition framework that integrates a fine-tuned YOLO11 object detection model with a novel statistical distribution alignment method to bridge the domain gap between aerial and satellite views. Our approach specifically targets building segmentation to extract semantically meaningful vector representations, achieving an F1-score of 0.722 on a dedicated building dataset. Furthermore, we introduce a cumulative distribution function (CDF) alignment technique that standardizes pixel intensity distributions, significantly enhancing visual consistency across modalities. Experimental evaluation on the VPAIR dataset demonstrates that the proposed pipeline achieves a Recall@1 score of 0.195 with a localization radius of 3 in urban scenarios, outperforming existing CNN-based methods such as CosPlace. The significant conclusion of this study is that leveraging YOLO-derived vector representations in conjunction with rigorous statistical preprocessing provides a computationally eficient and accurate solution for global UAV localization in complex urban terrains.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Visual place recognition (VPR)</kwd>
        <kwd>UAV</kwd>
        <kwd>YOLO</kwd>
        <kwd>image preprocessing</kwd>
        <kwd>deep learning</kwd>
        <kwd>image segmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Accurate localization of Unmanned Aerial Vehicles (UAVs) in complex environments is essential for
applications such as disaster response, environmental monitoring, precision agriculture, and urban
planning. While Global Navigation Satellite Systems (GNSS) such as GPS remain the standard solution, their
performance is often compromised by signal blockage, interference, or multipath efects, particularly
in urban canyons or areas with dense infrastructure [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In such GPS-denied scenarios, vision-based
localization emerges as an attractive alternative, ofering low-cost, information-rich positioning that
does not sufer from cumulative drift [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Visual Place Recognition (VPR) has become a prominent vision-based approach, allowing UAVs to
localize themselves by matching onboard camera views to geo-referenced imagery [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Of particular
interest is cross-view matching between UAV images and satellite data, since satellites provide global
coverage and accessible references. However, this task is highly challenging due to substantial viewpoint,
scale, and resolution diferences, as well as variations in color, illumination, weather, and seasonal
conditions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Repetitive structures such as rooftops and fields further reduce the availability of
distinctive landmarks. Classical feature-based methods (e.g., SIFT, SURF) have proven unreliable in such
scenarios, highlighting the need for more robust alternatives.
      </p>
      <p>
        Recent progress in deep learning has enabled the extraction of viewpoint- and appearance-invariant
embeddings that significantly improve cross-view recognition. Convolutional neural networks (CNNs),
when trained on large datasets, can learn semantically meaningful features and achieve high recall
despite extreme perspective diferences [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In addition, preprocessing methods, including geometric
rectification, color and illumination normalization, and statistical distribution alignment, further reduce
the domain gap between UAV and satellite imagery, thereby stabilizing visual descriptors across diverse
conditions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Considering the limited computational resources on UAV platforms, solutions must combine accuracy
with eficiency. The YOLO family of object detection networks provides an efective balance, ofering
real-time inference and high accuracy on resource-constrained hardware [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. YOLO models deliver
multi-scale feature representations while focusing on salient objects, making them suitable backbones
for UAV localization pipelines.
      </p>
      <p>The contributions of this research include:
• development of a method to obtain robust image embeddings from the convolutional layers of a
ifne-tuned YOLO11 model trained on a dataset of segmented buildings;
• incorporation of preprocessing and distribution alignment techniques (e.g., cumulative
distribution function normalization) to mitigate domain discrepancies between UAV and satellite
imagery;
• demonstration of an eficient pipeline for global UAV localization that maintains real-time
performance on edge devices while enhancing robustness under GPS-denied conditions.</p>
      <p>The paper is structured as follows: Section 2 presents related work and defines the research objectives.
Section 3 outlines the proposed UAV localization framework, with emphasis on preprocessing and
distribution alignment methods. Section 4 reports experimental results, analyzing the performance of
the fine-tuned YOLO model and comparing it with existing approaches. Section 5 concludes the paper
and discusses directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Modern visual place recognition (VPR) is commonly formulated as an image retrieval problem: a query
image (e.g., UAV snapshot) is compared against a large database of geo-referenced references (e.g.,
satellite tiles), and the closest match indicates the UAV’s location. The central element of this process is
the image descriptor. Early approaches relied on handcrafted global descriptors or bag-of-visual-words
built from local features such as SURF or SIFT (e.g., FAB-MAP [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], DBoW2 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). While efective under
moderate viewpoint changes, these methods proved unreliable in UAV-to-satellite scenarios, where
viewpoint, scale, and illumination diferences are severe.
      </p>
      <p>
        The emergence of deep learning brought about a step-change in descriptor quality. CNN-based
embeddings demonstrated robustness to lighting and viewpoint variations, substantially performing
engineered features. A seminal contribution was NetVLAD [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which combined a CNN backbone
with a VLAD aggregation layer, trained on large datasets such as Pitts-250k, to produce compact global
descriptors. Subsequent works refined this idea: Patch-NetVLAD [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] incorporated multi-scale feature
aggregation for improved viewpoint invariance, though at the cost of high-dimensional vectors and
increased memory requirements.
      </p>
      <p>
        Alongside CNN-based models, research has explored specialized datasets and cross-view learning
paradigms. For example, the University-1652 dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduced UAV imagery for building-level
geo-localization, enabling Siamese and triplet networks to learn a common embedding space for UAV and
satellite views. Other benchmarks such as VIGOR [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], SUES-200 [14], ALTO [15], and VPAIR [16] have
further pushed evaluation toward multi-altitude, multi-terrain, and rotation-robust settings, highlighting
the importance of large-scale, curated datasets for advancing cross-view VPR.
      </p>
      <p>Lightweight and multimodal descriptors have also been proposed. MinkLoc employed sparse 3D
convolutions for large-scale recognition, efective with LiDAR or depth data, though impractical for UAVs
due to payload and energy constraints. More recently, CosPlace [17] reframed VPR as a classification
problem, training on the massive San Francisco XL dataset of 40 million directional images, while
MixVPR [18] introduced an MLP-based feature mixer trained on GSV-Cities (530,000 images from 62,000
locations) [19]. These advances underline the trend toward dataset-driven improvements and the need
for scalable embeddings.</p>
      <p>The rise of transformer architectures has added further capabilities through attention-based modeling.
LoFTR [20] removed explicit keypoint detectors, directly matching local features, while AnyLoc [21]
leveraged self-supervised DinoV2 [22] features and combined local and global attention modules. Such
approaches achieve high benchmark scores but remain computationally heavy, often unsuitable for
real-time UAV deployment due to high-dimensional descriptors and slow inference.</p>
      <p>Another important strand of research addresses photometric discrepancies between UAV and satellite
imagery. Preprocessing techniques such as histogram matching, mean-variance color normalization, and
style transfer help reduce domain gaps [23]. More advanced strategies, such as cumulative
distributionbased color alignment, standardize UAV imagery toward satellite-like appearance, improving feature
consistency across modalities. In addition, style augmentation during training has been shown to
improve robustness against weather, seasonal, and illumination variations.</p>
      <p>In summary, state-of-the-art VPR methods range from compact CNN-based global descriptors to
transformer-based hybrid models and self-supervised embeddings. While transformers and
foundation models provide highly discriminative features, CNN-based pipelines remain attractive for UAV
applications due to their computational eficiency. Surprisingly, despite YOLO’s established balance of
accuracy and speed in object detection, it has not yet been systematically explored for VPR.</p>
      <p>
        Based on the literature analysis, the identified research gaps are as follows:
• limited deployment of real-time VPR pipelines optimized for UAV hardware with constrained
computational resources;
• high-dimensional descriptors in methods such as NetVLAD [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Patch-NetVLAD [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] hinder
memory eficiency and scalability;
• transformer-based approaches achieve strong accuracy but remain impractical for UAVs due to
inference latency and resource demands;
• color and style normalization techniques exist but are often dataset-specific; generalizable domain
alignment methods are still lacking;
• YOLO-based architectures, despite their proven eficiency in detection tasks, have not been
systematically adapted or evaluated for UAV–satellite VPR.
      </p>
      <p>The objective of this work is to build an enhanced visual place recognition system that aligns UAV
images with satellite views, integrating deep embedding models and color preprocessing techniques to
achieve greater robustness and precision, especially in urban cross-view conditions. In particular, the
target system is designed to reliably recognize places from a low-altitude UAV perspective by matching
against satellite imagery, even under significant viewpoint and appearance changes.</p>
      <p>To achieve this goal, we have formulated the following tasks:
• fine-tune the YOLO11 model on a dataset with segmented buildings to obtain vector
representations for UAV global location determination under challenging conditions;
• develop a method for aligning the statistical distributions of satellite imagery and UAV images to
achieve visual similarity;
• compare results obtained by the proposed method against existing CNN-based methods across
diferent terrain types.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and methods</title>
      <sec id="sec-3-1">
        <title>3.1. Methods</title>
        <p>Determining the position of Unmanned Aerial Vehicles (UAVs) is commonly formulated as a Visual Place
Recognition (VPR) problem. In the proposed approach, a query image captured by the UAV is compared
against a database of geo-referenced satellite imagery. The UAV’s global location is first estimated by
retrieving the most visually similar satellite image (or set of candidates). This is then refined by aligning
the query image with the selected satellite image to obtain precise geographic coordinates.</p>
        <p>Accordingly, the task can be divided into two main stages:
• global localization: retrieving the most similar satellite tile from a large reference database;
• coordinate refinement: computing the exact position (latitude and longitude) within the matched
tile.</p>
        <p>The overall processing workflow for UAV localization is depicted in Figure 1.</p>
        <p>In this study, convolutional neural network (CNN) architectures are employed to address the problem
of UAV global localization in urban environments. The overall structure of the proposed approach is
illustrated in Figure 2.</p>
        <p>We adopt the following notation: let  = {} denote a query image captured by the UAV at a specific
position and orientation, and let ℛ = {1, 2, . . . , } represent a database of reference images with
known geographic coordinates (typically satellite imagery).</p>
        <p>To obtain image descriptors, we define a mapping function from the image domain  to the feature
space  .</p>
        <p>:  → ,
where  is the image,  is its corresponding feature vector.</p>
        <p>The task of UAV global localization is then reduced to finding the reference image  ∈ ℛ whose
location ℓ( ) is closest to that of the query image ℓ(). Formally, this can be expressed as:
 = argmin  ( (),  ( )) ,
1≤≤
(1)
(2)
where (·, ·) is a distance metric (e.g., Euclidean distance) applied to feature vectors, and  denotes the
index of the retrieved satellite image.</p>
        <p>To enable global UAV localization, a database of satellite images along a predefined route must first
be established. Using deep learning models, each image can be transformed into a vector representation
(feature embedding) as defined in (1). In this study, the YOLO11 model is fine-tuned on a dataset of
segmented buildings, and descriptors are extracted from its final backbone layer, which encodes the
most comprehensive information produced by YOLO11’s convolutional filters. The same model is
applied to both satellite and UAV images, ensuring consistent feature representations across domains.
Consequently, the fine-tuned YOLO11 network generates embeddings for the satellite database and, in
real time, processes incoming UAV frames to produce query embeddings (1). The UAV’s global location
is then identified by selecting the satellite image whose descriptor minimizes a similarity measure (e.g.,
Euclidean distance) with the UAV embedding (2).</p>
        <p>To further reduce domain discrepancies between UAV and satellite imagery, we propose a distribution
alignment method that enhances visual similarity at the pixel level. Unlike prior approaches that directly
apply cumulative distribution functions (CDF) from satellite to UAV images – which can introduce
distortions due to mismatched distributions – our method computes transformations individually for
each UAV frame, taking into account its unique statistical properties.</p>
        <p>The approach is grounded in probability theory. For a random variable  with distribution function
 (), one can obtain a uniformly distributed variable  ∼  (0, 1) by applying the transformation
 =   () . Conversely, applying the inverse CDF to  ∼  (0, 1) yields a variable  with the distribution
 (). Satellite images, acquired using consistent sensors, exhibit stable color characteristics, while
UAV imagery varies significantly depending on illumination and acquisition conditions. Treating
pixel intensities as discrete random variables, we denote UAV and satellite intensities by  , and 
respectively. By computing  () from the satellite dataset and estimating  () from UAV images,
the alignment transformation is given by:
′ =  −1 ( ()) ,
(3)
where ′ denotes the transformed UAV intensity aligned to the satellite domain.</p>
        <p>Figure 3 illustrates the proposed alignment process based on these probabilistic principles.</p>
        <sec id="sec-3-1-1">
          <title>The proposed method comprises two stages:</title>
          <p>1. Computation of the averaged cumulative distribution function from the satellite image set.
2. Application of the averaged function individually to UAV frames for appearance normalization.</p>
          <p>Algorithm 1 provides pseudocode for computing the averaged satellite CDF, while Algorithm 2
outlines its application to UAV images.
() = [,(), ,(), ,()] ;
 () =  () + () .
,() = ˆ−1 (,()) ,
where ˆ−1 i s t h e i n v e r s e f u n c t i o n o f t h e a v e r a g e d c u m u l a t i v e
d i s t r i b u t i o n f u n c t i o n f o r s a t e l l i t e images i n c h a n n e l  .</p>
          <p>Because ˆ−1 might be non− a n a l y t i c a l , i n t e r p o l a t i o n i s used
a s an a p p r o x i m a t i o n :
,() = interp(,(), ˆ(), ) ,
where interp i s an i n t e r p o l a t i o n f u n c t i o n t h a t f i n d s t h e</p>
          <p>c o r r e s p o n d i n g  f o r each () , such t h a t ˆ() ≈  () .
p i x e l s o f t h e  − t h UAV i m a g e f o r t h e g i v e n c o l o r c h a n n e l
,  ∈ [0; 255] .</p>
          <p>Histogram alignment standardizes pixel intensity distributions between UAV and satellite images
by adjusting brightness, contrast, and tonal characteristics. This reduces intra-class variability and
improves consistency for feature extraction, thereby enhancing matching accuracy and recognition
performance.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation</title>
        <p>The performance of image segmentation models is commonly evaluated using metrics such as mean
Average Precision (mAP), Precision, Recall, and F1-score. Precision, Recall, and F1-score [24] are
widely used across machine learning tasks, including binary classification and image segmentation. A
distinctive aspect of segmentation evaluation is the absence of true-negative counts in the confusion
matrix, which, however, does not afect the computation of these metrics.</p>
        <p>Among these measures, mean Average Precision (mAP) is of particular importance and can be
formally defined as:</p>
        <p>= ∑︁( −  −1 ),</p>
        <p>=1
 = 1 ∑︁ ,</p>
        <p>=1


,
where  and  are Precision and Recall at the threshold  with 0 = 0 and  = 1,  is the number
of classes,  – average precision for the class .</p>
        <p>Conceptually,  corresponds to the area under the Precision–Recall curve for a given class. Since
this study focuses on segmentation of a single class (buildings), the evaluation reduces to  = .</p>
        <p>For localization tasks, Recall@N [25] is a standard metric. A query is considered correctly localized if
at least one relevant image from the database appears within the top-N retrieved results:
where  is the total number of query images, and  is the number of queries with at least one correct
match within the top-N results. This measure is particularly suitable when subsequent processing steps
can further refine or filter false positives.</p>
        <p>An extended definition of Recall@N incorporates a localization radius, whereby a result is deemed
correct if the distance between the query and retrieved images falls within a predefined threshold. This
radius can be specified either in physical units (e.g., meters) or in frame indices when satellite images
are sequentially ordered. Such a formulation is especially useful in scenarios with overlapping reference
imagery, allowing precise UAV localization even in the absence of an exact image match.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <sec id="sec-4-1">
        <title>4.1. YOLO finetuning</title>
        <p>Since the standard YOLO model does not include buildings as a predefined class, this study fine-tunes
the YOLO11 [26] architecture using a dedicated building segmentation dataset [27]. The dataset contains
9,665 images, with major contributions from cities such as Tyrol (2,999), Tripoli (1,078), Kherson (1,053),
Donetsk (999), Mekelle (951), Mykolaiv (739), and Kharkiv (602). An example of annotated building
segmentation from the dataset is shown in Figure 4.</p>
        <p>Fine-tuning was carried out with the YOLO11 implementation provided by the open-source ultralytics
library [28], executed on an Ubuntu environment with an Nvidia MSI RTX 3060 GPU. The training
(4)
(5)
(6)
(b) True Segmentation
targeted only building segmentation, using the default YOLO11 configuration with 100 epochs and an
input resolution of 640 pixels.</p>
        <p>Model performance was evaluated using mAP, Precision, Recall, and F1-score. Among these, the
F1-score was emphasized as the primary metric due to its balanced consideration of false negatives
(missed buildings) and false positives (incorrect detections). To ensure reliability, results were averaged
(Avg) and standard deviations (Std) were calculated across seven randomized 80/20 splits of the dataset
into training and testing subsets.</p>
        <p>Figure 5 presents an example of segmentation output from the test set used during YOLO11 fine-tuning,
while Figure 6 demonstrates segmentation performance on an image from the VPAIR dataset.
(a) Original Image
(b) True Segmentation
(c) YOLO Segmentation
• box_loss – emphasizes the accurate placement of bounding boxes around detected objects
(b) YOLO Detections</p>
        <p>(weight coeficient: 7.5);
• seg_loss – enforces accurate delineation of object segmentation masks (weight: 7.5);
• cls_loss – accounts for correct classification of detected objects (weight: 0.5);
• dfl_loss – improves discrimination between visually similar or ambiguous objects by
emphasizing distinctive features (weight: 1.5).</p>
        <p>(a) Training Losses
(b) Validation Losses</p>
        <p>The loss curves demonstrate the model’s ability to efectively learn from the training data and
generalize to unseen examples. This is reflected in the steady decrease of training loss and the subsequent
stabilization of validation loss, indicating convergence without significant overfitting. Figure 8 presents
the confusion matrix, with true negatives omitted, as their definition is not well-defined in the context
of image segmentation.</p>
        <p>Table 1 reports the evaluation metrics for the YOLO11 model fine-tuned on the building segmentation
dataset. The results include mean Average Precision (mAP), Precision, Recall, and F1-score, calculated
across seven distinct train/test splits. To assess robustness, both average (Avg) values and standard
deviations (Std) are provided, reflecting the stability of the model’s performance.</p>
        <p>Figure 9 shows the Precision–Recall (PR) curve for the best-performing split, with an achieved area
under the curve (AUC [29]) of 0.76. The PR curve is especially informative in segmentation tasks, as
it highlights the model’s ability to correctly detect positive samples (buildings) under conditions of
significant class imbalance between foreground structures and background regions.</p>
        <p>For building segmentation tasks involving partially occluded structures or complex architectural
outlines, the fine-tuned YOLO11 model achieved an F1-score of 0.722 on the test set. This result is
particularly encouraging given the additional advantage of real-time inference inherent to
YOLObased architectures. The outcome indicates that the model can reliably detect buildings, a capability
essential for generating vector data required in UAV-based global localization pipelines. Moreover, the
standard deviation of all evaluation metrics remained below 0.5% across both training and testing splits,
confirming the robustness and stability of model performance. Future research may explore architectural
modifications to YOLO11 and fine-tuning of hyperparameters to further improve segmentation accuracy,
with a focus on challenging building recognition scenarios.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Visual place recognition with image preprocessing method</title>
        <p>To validate the proposed method for aligning statistical distributions between UAV and satellite images,
experiments were conducted using the VPAIR dataset [16]. This dataset was specifically designed for
UAV localization under challenging real-world conditions. Data were collected during a 107-kilometer
lfight from Bonn into the mountainous Eifel region of Germany on October 13, 2020. The imagery
spans diverse terrain types, including urban areas, agricultural fields, and forests. Data acquisition was
performed with a single-lens color camera at a resolution of 1600× 1200 pixels, later downsampled to
800× 600 pixels for dataset inclusion. Each image is paired with highly accurate GNSS/INS ground truth
(rotational error: 0.05∘ , positional accuracy: &lt;1 m). The dataset consists of 2,706 UAV query images, an
equal number of corresponding satellite references, and an additional 10,000 distractor images collected
near Düsseldorf.</p>
        <p>Since the original dataset did not provide terrain-type annotations, a new classification scheme was
introduced, grouping images into four categories: urban (dominated by buildings and roads), field,
forest, and water.</p>
        <p>Validation of the proposed distribution alignment method was performed using the Recall@1 metric
with a localization radius of 3. Experiments evaluated performance across all terrain categories, with a
particular focus on urban environments, reflecting the specialization of the fine-tuned YOLO11 model
on building segmentation for feature extraction.</p>
        <p>Three comparative experiments were carried out to benchmark the proposed approach against
state-of-the-art methods such as CosPlace:
• using the original, unprocessed VPAIR images;
• using grayscale-converted images;
• using UAV imagery preprocessed with the proposed averaged cumulative distribution function
(CDF) alignment method.</p>
        <p>Figure 10 illustrates the visual outcomes of the CDF-based preprocessing, showing UAV images
transformed to statistically match the distribution of corresponding satellite imagery.</p>
        <p>Table 2 provides a concise yet comprehensive overview of how each stage of the evaluation pipeline
influences final localization accuracy. For every UAV query image in the VPAIR dataset [ 16], the
following sequence was executed under three color-handling variants, after which Recall@1 with a
localization radius of 3 was recorded:
• color handling – retain the original RGB image, convert to single-channel grayscale, or apply the
proposed averaged cumulative distribution function (CDF) transfer to align UAV pixel statistics
with the satellite domain;
• embedding extraction – process the pre-processed image using either CosPlace or the fine-tuned</p>
        <p>YOLO11 model;
• nearest-neighbor retrieval – perform L2-based similarity search against the 2,706-image satellite
reference set, selecting the closest match as the predicted location;
• localization test – classify the prediction as correct if the matched satellite tile lies within three
reference frames of the ground-truth tile; otherwise mark as incorrect;
• metric aggregation – compute Recall@1 for each terrain category (Urban, Field, Forest, Water) by
aggregating correct matches over all 2,706 queries.
Using proposed averaged cumulative distribution function
CosPlace [17] 0.145 0.108 0.134 0.374</p>
        <p>YOLO (Our) 0.195 0.068 0.070 0.545</p>
        <p>The results demonstrate that CNN-based localization methods (CosPlace, YOLO11) benefit from the
proposed preprocessing strategy, confirming its efectiveness in improving UAV global localization
accuracy. While CosPlace performs better across non-urban terrains such as fields, forests, and water,
the YOLO11-based approach achieves superior results in urban settings, reflecting its fine-tuning on
building segmentation data.</p>
        <p>In urban scenarios, the proposed method achieved a Recall@1 score of 0.195 (19.5%) within a
localization radius of 3, a promising outcome given the inherent dificulty of sustaining high accuracy at
top ranks in UAV global localization. This performance not only underscores the robustness of the
YOLO11-based technique but also demonstrates its competitiveness against established approaches
such as CosPlace.</p>
        <p>Nevertheless, the method has certain limitations. Its applicability is currently constrained to urban
areas under daytime and favorable weather conditions, and it requires prior availability of satellite
imagery along predefined UAV flight routes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The primary objective of this study was successfully achieved: the development of a robust UAV–
satellite image matching framework that leverages deep embeddings and color normalization to improve
precision and robustness under challenging cross-view and urban conditions. Fine-tuning the YOLO11
model on a dataset of segmented buildings enabled the extraction of vector representations that
substantially enhanced UAV-to-satellite image matching accuracy.</p>
      <p>The proposed preprocessing strategy, based on aligning statistical distributions between UAV and
satellite imagery, further improved visual consistency, outperforming established methods such as
CosPlace, particularly in urban terrain. Quantitative results confirm this advantage, with the model
achieving an F1-score of 0.722 for building segmentation and a Recall@1 of 19.5% within a localization
radius of 3, surpassing existing benchmarks for urban UAV localization.</p>
      <p>Key strengths of this work include the ability to achieve accurate and eficient UAV localization in
GPS-denied environments, supported by YOLO11’s inherent real-time inference capability. This makes
the approach highly relevant for practical applications such as emergency response, urban surveillance,
and infrastructure monitoring.</p>
      <p>Despite these advances, several limitations remain. The method is currently optimized for urban
scenarios under daytime and favorable weather conditions, and its efectiveness depends on the
availability of pre-collected satellite imagery aligned with potential UAV flight paths. To mitigate seasonal
variability, reference datasets should ideally be captured during late spring, summer, or early autumn,
when environmental appearance is most stable. Additionally, to ensure robust feature extraction,
imagery should meet a minimum resolution of 640× 640 pixels, i.e., consistent with the YOLO input
requirements, which also normalize resolution diferences between UAV and satellite imagery.</p>
      <p>Future work will focus on extending the versatility and precision of the YOLO11-based pipeline
through advanced augmentation strategies, architectural refinements, and hyperparameter optimization.
Expanding localization capabilities beyond urban settings to include forests, agricultural regions, and
aquatic environments will further enhance adaptability, broadening the practical scope of UAV mission
scenarios.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[14] R. Zhu, L. Yin, M. Yang, F. Wu, Y. Yang, W. Hu, Sues-200: a multi-height multi-scene cross-view
image benchmark across drone and satellite, IEEE Trans. Circuits Syst. Video Technol. 33 (2023)
4825–4839. doi:10.1109/TCSVT.2023.3249204.
[15] I. Cisneros, P. Yin, J. Zhang, H. Choset, S. Scherer, Alto: a large-scale dataset for uav visual
place recognition and localization, arXiv preprint arXiv:2207.12317 (2022). doi:10.48550/arXiv.
2207.12317.
[16] M. Schleiss, F. Rouatbi, D. Cremers, Vpair: aerial visual place recognition and localization in
large-scale outdoor environments, arXiv preprint arXiv:2205.11567 (2022). doi:10.48550/arXiv.
2205.11567.
[17] G. Berton, C. Masone, B. Caputo, Re-thinking visual geo-localization for large-scale applications,
in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 4878–4888.
[18] A. Ali-bey, B. Chaib-draa, P. Giguère, Mixvpr: feature mixing for visual place recognition, in: Proc.</p>
        <p>IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), 2023, pp. 2998–3007.
[19] A. Ali-bey, B. Chaib-draa, P. Giguère, Gsv-cities: toward appropriate supervised visual place
recognition, Neurocomputing 513 (2022) 194–203. doi:10.1016/j.neucom.2022.09.127.
[20] J. Sun, Z. Shen, Y. Wang, H. Bao, X. Zhou, Loftr: detector-free local feature matching with
transformers, in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 8922–
8931.
[21] N. Keetha, R. Arief, S. Adhikari, et al., Anyloc: towards universal visual place recognition, IEEE</p>
        <p>Robot. Autom. Lett. 9 (2024) 1286–1293. doi:10.1109/LRA.2023.3343602.
[22] M. Oquab, T. Darcet, T. Moutakanni, et al., Dinov2: learning robust visual features without
supervision, arXiv preprint arXiv:2304.07193 (2023). doi:10.48550/arXiv.2304.07193.
[23] J. Shao, L. Jiang, Style alignment-based dynamic observation method for uav-view geo-localization,</p>
        <p>IEEE Trans. Geosci. Remote Sens. 61 (2023) 1–14. doi:10.1109/TGRS.2023.3337383.
[24] O. Rainio, J. Teuho, R. Klén, Evaluation metrics and statistical tests for machine learning, Sci. Rep.</p>
        <p>14 (2024) 6086. doi:10.1038/s41598-024-56706-x.
[25] M. Zafar, S. Ehsan, L. Momeni, et al., Vpr-bench: an open-source visual place recognition
evaluation framework with quantifiable viewpoint and appearance change, Int. J. Comput. Vis.
129 (2021) 2136–2174. doi:10.1007/s11263-021-01469-5.
[26] Ultralytics, Yolo11 new, Available at: https://docs.ultralytics.com/models/yolo11, ???? Accessed
04.05.2025.
[27] Roboflow, Buildings instance segmentation – v1 raw-images, Available at: https://universe.
roboflow.com/roboflow-universe-projects/buildings-instance-segmentation/dataset/1, ????
Accessed 04.05.2025.
[28] G. Jocher, J. Qiu, A. Chaurasia, Ultralytics yolo, github repository, Available at: https://github.com/
ultralytics/ultralytics, 2023. Accessed 04.05.2025.
[29] Ş. K. Çorbacıoğlu, G. Aksel, Receiver operating characteristic curve analysis in diagnostic accuracy
studies: a guide to interpreting the area under the curve value, Turk. J. Emerg. Med. 23 (2023)
182–187. doi:10.4103/tjem.tjem_182_23.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Melnychenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Scislo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Savenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sachenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Radiuk</surname>
          </string-name>
          ,
          <article-title>Intelligent integrated system for fruit detection using multi-UAV imaging and deep learning</article-title>
          ,
          <source>Sensors</source>
          <volume>24</volume>
          (
          <year>2024</year>
          )
          <year>1913</year>
          . doi:
          <volume>10</volume>
          .3390/ s24061913.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Lightweight visual localization algorithm for uavs</article-title>
          ,
          <source>Sci. Rep</source>
          .
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <article-title>6069</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41598-025-88089-y.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A novel geo-localization method for uav and satellite images using cross-view consistent attention</article-title>
          ,
          <source>Remote Sens</source>
          .
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <article-title>19</article-title>
          . doi:
          <volume>10</volume>
          .3390/ rs15194667.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          , E. Zheng,
          <article-title>Uav geo-localization dataset and method based on cross-view matching</article-title>
          ,
          <source>Sensors</source>
          <volume>24</volume>
          (
          <year>2024</year>
          )
          <article-title>6905</article-title>
          . doi:
          <volume>10</volume>
          .3390/s24216905.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zalutska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Molchanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sobko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mazurets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pasichnyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Barmak</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Krak</surname>
          </string-name>
          ,
          <article-title>Method for sentiment analysis of Ukrainian-language reviews in e-commerce using RoBERTa neural network</article-title>
          ,
          <source>in: Proceedings of the 7th International Conference on Computational Linguistics and Intelligent Systems (CoLInS 2023), Volume I: Machine Learning Workshop</source>
          , volume
          <volume>3387</volume>
          , CEUR-WS.org, Aachen,
          <year>2023</year>
          , pp.
          <fpage>344</fpage>
          -
          <lpage>356</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3387</volume>
          /paper26.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Melnychenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Savenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Radiuk</surname>
          </string-name>
          ,
          <article-title>Apple detection with occlusions using modified YOLOv5- v1</article-title>
          ,
          <source>in: Proceedings of the 12th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS</source>
          <year>2023</year>
          ), IEEE, New York, NY, USA,
          <year>2023</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>112</lpage>
          . doi:
          <volume>10</volume>
          .1109/IDAACS58523.
          <year>2023</year>
          .
          <volume>10348779</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>You only look once: unified, real-time object detection</article-title>
          ,
          <source>in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit</source>
          .
          <source>(CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          . doi:
          <volume>10</volume>
          . 1109/CVPR.
          <year>2016</year>
          .
          <volume>91</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cummins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <article-title>Fab-map: probabilistic localization and mapping in the space of appearance</article-title>
          ,
          <source>Int. J. Robot. Res</source>
          .
          <volume>27</volume>
          (
          <year>2008</year>
          )
          <fpage>647</fpage>
          -
          <lpage>665</lpage>
          . doi:
          <volume>10</volume>
          .1177/0278364908090961.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gálvez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Tardós</surname>
          </string-name>
          ,
          <article-title>Bags of binary words for fast place recognition in image sequences</article-title>
          ,
          <source>IEEE Trans. Robot</source>
          .
          <volume>28</volume>
          (
          <year>2012</year>
          )
          <fpage>1188</fpage>
          -
          <lpage>1197</lpage>
          . doi:
          <volume>10</volume>
          .1109/TRO.
          <year>2012</year>
          .
          <volume>2197158</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Arandjelovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gronat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pajdla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sivic</surname>
          </string-name>
          , Netvlad:
          <article-title>Cnn architecture for weakly supervised place recognition</article-title>
          ,
          <source>in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit</source>
          .
          <source>(CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>5297</fpage>
          -
          <lpage>5307</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hausler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Milford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <article-title>Patch-netvlad: multi-scale fusion of locally-global descriptors for place recognition</article-title>
          ,
          <source>in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit</source>
          .
          <source>(CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>14141</fpage>
          -
          <lpage>14152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y</surname>
          </string-name>
          . Yang, University-1652:
          <article-title>a multi-view multi-source benchmark for drone-based geo-localization</article-title>
          ,
          <source>in: Proc. 28th ACM Int. Conf. Multimedia</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1395</fpage>
          -
          <lpage>1403</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3394171.3413896.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Vigor: cross-view image geo-localization beyond one-to-one retrieval</article-title>
          ,
          <source>in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit</source>
          .
          <source>(CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3640</fpage>
          -
          <lpage>3649</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>