<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>CITI'</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Image matching of satellite and UAV for visual place recognition using YOLO⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Volodymyr Vozniak</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksander Barmak</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iurii Krak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Glushkov Cybernetics Institute</institution>
          ,
          <addr-line>40, Glushkov Ave., Kyiv, 03187</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Khmelnytskyi National University</institution>
          ,
          <addr-line>11, Institutes str., Khmelnytskyi, 29016</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>64/13, Volodymyrska str., Kyiv, 01601</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>3</volume>
      <fpage>11</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Determining the location of UAVs under challenging conditions plays a crucial role in various applications. In this research, we aim to enhance the accuracy of CNN-based methods for global UAV localization in urban environments using models capable of rapid real-time image processing. By utilizing the YOLO11 model, fine-tuned on a dataset of segmented buildings (achieving an F1-score of 0.722), and employing a proposed statistical distribution alignment method to increase visual similarity between satellite and UAV images, we obtained a Recall@1 metric value of 0.195 with a localization radius of 3 for UAV global localization in urban areas. This result surpasses those obtained by existing CNN-based methods. The outcomes indicate that employing YOLO-generated image vector representations combined with our image preprocessing approach is promising for UAV global localization, with significant potential for further improvement.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Visual Place Recognition (VPR)</kwd>
        <kwd>UAV</kwd>
        <kwd>YOLO</kwd>
        <kwd>image preprocessing</kwd>
        <kwd>deep learning</kwd>
        <kwd>image segmentation 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>limitations, and proposes a method for aligning statistical distributions between satellite imagery and
UAV images to achieve visual similarity. Section 4 details the datasets used, metrics for method
evaluation, and experimental methodology. Section 5 presents results, discusses the fine-tuned YOLO
model using the dataset with segmented buildings, and compares results for UAV location
determination against existing methods.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <p>
        In the task of automatic location determination (Visual Place Recognition, VPR), recent years have
seen significant methodological progress driven by the extensive adoption of deep learning methods
adapted to varying environmental conditions and different imaging angles. Primary approaches rely
on global descriptors derived from convolutional neural networks (CNNs). These methods involve
neural network-generated image features aggregated into a single compact vector representation,
achieving an optimal balance between matching accuracy and retrieval speed. This contrasts with
earlier popular methods relying on local descriptors, which are more subjective in matching, mainly
when dealing with large datasets, rendering them unsuitable for scenarios with limited
computational resources or prone to instability under severe lighting or seasonal variations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        A literature analysis indicates that the high performance of most modern approaches can be
attributed to the growing size of training datasets designed explicitly for visual localization [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. NetVLAD [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] combines local feature encoding with the VLAD training layer to create robust
global descriptors. Although trained on the extensive Pitts-250k dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], NetVLAD descriptors are
high-dimensional (tens of thousands), necessitating increased computational memory storage and
potentially hindering real-time UAV applications [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [8]. Recent variants like Patch-NetVLAD [9]
incorporate multi-scale feature aggregation to enhance viewpoint robustness, yet the aggregated
representations remain relatively large.
      </p>
      <p>Several studies explored lightweight or multimodal representations. MinkLoc [10] employs sparse
3D convolutions for robust large-scale place recognition, particularly effective with LiDAR or
depthmap data. However, due to payload and power limitations, supplementary sensors may be
impractical for many UAV platforms. Recently, CosPlace [11] combined classification-based learning
using the San Francisco XL dataset, containing 40 million GPS-tagged directional images.
Subsequently, MixVPR [12] introduced an MLP-based feature mixer trained on GSV-Cities [13], a
specialized dataset comprising 530,000 images from 62,000 global locations. These examples
underline the significant volume of specialized datasets utilized in recent studies.</p>
      <p>Advances in attention-based architectures have recently introduced several transformer-based
methods utilizing attention mechanisms for image feature matching. LoFTR [14] matches local
features without detectors, exhibiting robustness to moderate viewpoint changes [15]. AnyLoc [16]
expands transformer paradigms by integrating self-supervised features from DinoV2 [17] and
combining global and local attention modules within a unified framework for place recognition.
Despite considerable metric improvements on various VPR benchmarks, transformer architectures’
main disadvantages remain their high hardware requirements and low prediction speed. For
example, AnyLoc’s heavy ViT-Large architecture and 32,000-dimensional VLAD image feature limit
its practical application for real-time tasks on resource-constrained devices such as UAVs.</p>
      <p>Although computationally efficient, CNN-based visual place recognition methods require training
on large, specialized datasets. VLAD-feature methods deliver enhanced performance but require
significant memory due to high-dimensional vectors. Finally, transformer-based methods provide
high-quality and versatile vector features that are usable without additional fine-tuning but remain
considerably slower than CNN models, complicating deployment on UAV platforms.</p>
      <p>Surprisingly, using YOLO [18], a CNN-based model renowned for real-time efficiency and high
accuracy, has not yet gained popularity in VPR.</p>
      <p>Therefore, this study aims to enhance the accuracy of CNN-based methods for UAV global
location determination in urban environments using models capable of high-speed real-time image
processing.</p>
      <p>To achieve this goal, we have formulated the following tasks:
 fine-tune the YOLO11 model [19] on a dataset with segmented buildings to obtain vector
representations for UAV global location determination under challenging conditions;
 develop a method for aligning the statistical distributions of satellite imagery and UAV
images to achieve visual similarity;
 compare results obtained by the proposed method against existing CNN-based methods
across different terrain types.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and methods</title>
      <p>Determining the location of Unmanned Aerial Vehicles (UAVs) is typically addressed through Visual
Place Recognition (VPR). The approach proposed in this study involves comparing a query image
obtained from a UAV with a database of georeferenced satellite images. The system determines the
UAV’s global location by identifying the most similar image (or set of images). Subsequently, a
method is applied to determine precise coordinates by aligning the query image with the
corresponding satellite image.</p>
      <p>Therefore, the considered task can be decomposed into the following sub-tasks:
 determining the global location: identifying the most similar satellite image (or tile) from an
extensive database;
 determining precise coordinates: calculating the exact position (latitude and longitude)
within the identified tile.</p>
      <p>Figure 1 illustrates the general processing scheme for determining UAV location coordinates.</p>
      <p>Using CNN architectures, this study explicitly addresses determining UAVs’ global location in
urban environments. Figure 2 shows a detailed scheme of the proposed process.</p>
      <p>We introduce the following notation. Let Q={q } represent a query image obtained by the UAV at
a specific position and orientation, and R={r1 , r2 , … , rn } represent a set of pre-prepared images
with known coordinates (usually satellite images).</p>
      <p>Additionally, we define the following mapping for obtaining vector representations (feature
vectors) of images:</p>
      <p>F : I → V , (1)
where I is the image, V is the feature vector.</p>
      <p>Thus, the general task of determining the global UAV location involves identifying the image
r j∈ R whose spatial location l (r j) is closest to the query image l ( q ). Formally, this can be expressed
as:
k =argmin d ( F ( q ) , F (r j)) , (2)</p>
      <p>1≤ j ≤n
where d ( ∙ , ∙ ) is a metric (e.g., Euclidean distance) operating on images represented as feature
vectors and k is the identified satellite image.</p>
      <p>A database of satellite images along a predefined route must be available to determine the UAV's
global location. Using deep learning models, image vector representations or feature sets can be
obtained (1). In this study, the YOLO11 model is employed and fine-tuned on a dataset of segmented
buildings. The image representation vector is obtained from the final backbone layer, which contains
the most comprehensive information about the image derived from YOLO11’s convolutional layers.
Importantly, this model is identical to vector representations of both satellite images and UAV
images. Therefore, the fine-tuned YOLO11 model generates vector representations for the satellite
image database, facilitating UAV global location determination. In real-time mode, UAV images are
fed into the YOLO11 model to obtain their vector representations (1), subsequently identifying the
most suitable satellite image (2) by minimizing a similarity search function (e.g., Euclidean distance).
The matched satellite image thus represents the UAV’s global location.</p>
      <p>We propose a method that aligns their statistical distributions to enhance visual similarity
between UAV and satellite images. Unlike methods that directly apply cumulative distribution
functions (CDF) from satellite to UAV images [20] (which may cause distortions due to distribution
mismatch), our approach calculates transformations individually for each UAV frame, considering its
unique distribution.</p>
      <p>The method leverages probability theory: given a random variable ξ with a known distribution
function Fξ ( x ), one can obtain a uniformly distributed random variable by applying the
distribution function to itself γ = Fξ ( ξ ). Conversely, applying the inverse distribution function
yields a variable ξ with the distribution Fξ ( x ). Satellite images collected by the same
device share consistent color characteristics. UAV images, acquired under different
conditions, exhibit distinct statistical properties. Pixel intensities can thus be treated as discrete
random variables. Correspondingly, intensities from UAV and satellite images yield random
variables Y , and X precisely calculating F X ( x ) from satellite datasets and estimating FY ( y ) for
UAV images allows transformation via F−X1( FY ( y )) distribution alignment. Figure 3 shows the
diagram of the proposed method based on the probability theory.</p>
      <p>The proposed method comprises two stages:
1. Computing the averaged cumulative distribution function from satellite images;
2. Applying this function individually to UAV images.</p>
      <p>Algorithm 1 provides pseudocode for computing the averaged cumulative distribution function
from satellite images, and Algorithm 2 details applying this averaged function individually to UAV
images.</p>
      <p>Algorithm 1: Computing the averaged cumulative distribution function from satellite images</p>
      <p>– averaged cumulative distribution function of satellite images, x∈ [ 0 ; 255 ].</p>
      <p>Input:R={r1 , r2 , … , rn } – n satellite images.</p>
      <p>Initialization:
For i in 1..n</p>
      <p>Step 1. Compute the normalized histogram (probability density function) for each color
channel c∈ {R , G , B }:</p>
      <p>M c , j ( y )= ^F−c,1j (Gc , j ( y )) ,
where ^F −c1 is the inverse function of the averaged cumulative distribution function for
satellite images in channel c. Because ^F−1 might be non-analytical, interpolation is used
c
as an approximation:</p>
      <p>M c , j ( y )=interp (Gc , j ( y ) , ^Fc ( z ) , z ) ,
Algorithm 2: Applying averaged cumulative distribution function of satellite images individually to
UAV images.</p>
      <p>Input: Q={q1 , q2 , … , qm } – m UAV images.</p>
      <p>For j in 1..m</p>
      <p>Step 1. Compute the normalized histogram (probability density function) for each color
channel c∈ {R , G , B }:</p>
      <p>Sc , j ( y )
Dc , j ( y )= 255
,
where is an interpolation function that finds the corresponding z for each Gc ( y ), such
that ^F c ( z ) ≈ Gc ( z ).
24. End
25. Output: M c , j ( y ) – the resulting function that transforms the input pixels of the j-th UAV image
26. for the given color channel c, y∈ [ 0 ; 255 ] .</p>
      <p>Histogram alignment ensures the similarity of pixel intensity distributions between UAV and
satellite images, adjusting brightness, contrast, and tonal characteristics, thus reducing intra-class
variation and enhancing feature matching and recognition tasks.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental setup</title>
      <p>4.1.</p>
      <sec id="sec-4-1">
        <title>Datasets</title>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.1. Dataset for YOLO finetuning</title>
        <p>Since the default YOLO model does not include building recognition among its classes, this research
proposes fine-tuning the YOLO11 model [19] on a dataset of segmented buildings [21] containing
only one class – buildings. This dataset consists of 9665 images, predominantly from the following
cities: Tyrol (2999), Tripoli (1078), Kherson (1053), Donetsk (999), Mekelle (951), Mykolaiv (739), and
Kharkiv (602). Figure 4 shows an example of segmented buildings from an image of the training
dataset.
4.1.2. VPAir
For validation of the proposed method, aligning statistical distributions between satellite and UAV
images, the VPAir dataset [22] was selected. This dataset was explicitly created for the challenge of
UAV localization under complex conditions. Data were collected during a flight between Bonn and
the mountainous Eifel region in Germany, spanning 107 kilometers and encompassing diverse
landscapes, including urban, agricultural, and forested areas. Data collection took place on October
13, 2020, using a lightweight aircraft flying at altitudes between 300–400 meters. The equipment
included a single-lens color camera (resolution 1600x1200 pixels, reduced to 800x600 pixels for the
dataset) and a GNSS/INS system providing highly accurate 6-DoF positions (rotation error: 0.05°,
positional error: &lt;1 meter). The dataset comprises 2706 query images captured from the aircraft, 2706
corresponding satellite database images, and 10,000 distractor images from another geographical
region near Düsseldorf.</p>
        <p>Although the authors of [22] provide metrics comparing different algorithms across various
terrain types, this annotation is not publicly available. Therefore, we proposed our annotation of
terrain types for this dataset, classifying them into four categories: urban (dominated by buildings
and streets), field, forest, and water.
4.2.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Evaluation</title>
        <p>The following metrics are commonly used to evaluate image segmentation model performance: mAP,
Precision, Recall, and F1-score. Precision, Recall, and F1-score [23] are standard metrics applied
broadly in machine learning tasks, ranging from binary classification to image segmentation. A
distinctive feature of segmentation tasks is the absence of true-negative counts in the confusion
matrix, which does not hinder the computation of these metrics.</p>
        <p>Additionally, mean Average Precision (mAP) deserves separate mention. Formally, it can be expressed
as follows:</p>
        <p>APc can also be interpreted as the area under the Precision-Recall curve for class c. Since this
study involves segmentation with a single class (buildings), here mAP= APb.</p>
        <p>Recall@N [24] is commonly used to evaluate localization methods. In this metric, search results
are considered true-positive for a given query if the corresponding image from the database is within
the top N retrieved results:</p>
        <p>Recall@N=</p>
        <p>M Q</p>
        <p>,
N Q
where N Q is the total number of query images, and M Q is the number of queries with at least one
correct match within the top-N results.</p>
        <p>This metric is popular within computer vision communities and is suitable for scenarios where
subsequent processing may further filter false-positive matches.</p>
        <p>A modified definition considers a localization radius, meaning a result is true-positive if the
distance between the query and matched database images is within a predefined radius. This radius
can be specified either in meters or in a number of frames (as satellite images are often stored
sequentially). Such a metric is beneficial in scenarios with overlapping reference images, enabling
precise UAV localization even if an exact image match is not found.
(14)
4.3.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Experimental methodology</title>
      </sec>
      <sec id="sec-4-5">
        <title>4.3.1. YOLO finetuning</title>
        <p>The YOLO11 model was fine-tuned on the segmented buildings dataset using the open-source Python
library ultralytics [25] on Ubuntu OS with an Nvidia MSI RTX 3060 graphics card. Only one
segmentation class (Buildings) was used. The default YOLO11 architecture was employed with 100
epochs and an image size of 640 pixels.</p>
        <p>To evaluate the performance of the fine-tuned YOLO model, mAP, Precision, Recall, and F1-score
metrics were calculated. F1-score was chosen as the primary metric for this task since both missing a
building when present and incorrectly identifying a non-building as a building constitute equally
adverse segmentation outcomes. Each metric was computed by performing seven random splits of
the dataset into training and testing subsets in an 80%/20% ratio, subsequently calculating average
values (Avg) and standard deviations (Std) for model stability assessment.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.3.2. Image preprocessing method validation</title>
        <p>The proposed method was validated by aligning statistical distributions between satellite and UAV
images for determining UAV global location using the VPAir dataset [22]. The Recall@1 (14) metric
was chosen for evaluation, with a localization radius of 3. This metric was computed across various
terrain types: urban, field, forest, and water. This study considered urban terrain the primary
environment, as the YOLO11 model was fine-tuned, specifically using segmented building images for
generating vector representations of both satellite and UAV images.</p>
        <p>Three experiments were conducted to compare the proposed method (using vector
representations from YOLO11) with known UAV global localization methods (CosPlace [11]):
experiments using unprocessed VPAir dataset [22] images, experiments using grayscale conversions
of these images, and experiments employing the proposed image preprocessing method.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and discussion</title>
      <sec id="sec-5-1">
        <title>5.1. YOLO fine-tuning results</title>
        <p>Architecturally, the YOLO model family employs a weighted sum of various loss functions. For image
segmentation tasks, four specific loss functions are utilized:
 box_loss – emphasizes the accurate placement of bounding boxes around detected objects
(weight coefficient: 7.5);
 seg_loss – emphasizes the accurate placement of segmentation masks around segmented
objects (weight coefficient: 7.5);
 cls_loss – emphasizes the correct classification of objects (weight coefficient: 0.5);
dfl_loss – emphasizes differentiating between objects that are visually similar or difficult to
distinguish by better capturing their unique features (weight coefficient: 1.5).</p>
        <p>The graphs indicate that the model effectively captures underlying patterns from the building
segmentation dataset, as the training loss consistently decreases and stabilizes on the validation
dataset.</p>
        <p>Figure 8 shows the confusion matrix, which does not include true-negative counts since these
cannot be clearly defined for segmentation tasks.</p>
        <p>Table 1 presents the evaluation metrics obtained from the fine-tuned YOLO11 model on the
segmented building dataset.
This table contains values for the metrics mAP, Precision, Recall, and F1-score, calculated across
seven distinct dataset splits into training and testing subsets. Average values (Avg) and standard
deviations (Std) are also provided to assess model stability.</p>
        <p>Figure 9 shows the Precision-Recall curve for the best-performing training and testing dataset
split, with an area under the curve (AUC) of 0.76. This curve is informative for image segmentation
tasks as it emphasizes accurately identifying the positive class (buildings) in scenarios of class
imbalance (buildings vs. background).</p>
        <p>In the context of building segmentation, especially on masked images where objects may be
partially obscured or possess complex contours, an F1-score of 0.722 for the test dataset can be
considered a positive outcome, especially given the real-time speed advantages inherent to the YOLO
model family [18]. This indicates that the trained model reliably identifies buildings, which is crucial
for generating the vector representations used for UAV global localization. Furthermore, the
standard deviation of each metric across both training and testing datasets is below 0.5%, indicating
consistent model performance. Future improvements could explore modifications to the YOLO11
neural network architecture and hyperparameter tuning to maximize the F1-score, specifically for
building segmentation tasks.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Image preprocessing method results</title>
        <p>Figure 10 illustrates the visual results of the proposed averaged cumulative distribution function
(CDF) method applied to UAV images, aligning their statistical distributions with those of satellite
images.</p>
        <p>Table 2 provides Recall@1 (14) metric results with a localization radius of 3 for CosPlace [11] and
the proposed UAV global localization method using YOLO11 across different terrain types from the
VPAir dataset [22], employing various image processing approaches, including experiment without
color preprocessing, experiment with greyscale preprocessing and experiment with proposed
averaged cumulative distribution function.</p>
        <p>The results indicate that CNN-based methods (CosPlace [11], YOLO) perform better with the
proposed preprocessing method, confirming its capability to enhance UAV global localization
accuracy. Although CosPlace [11] performs better in fields, forests, and water terrains, the proposed
YOLO-based method achieves superior results in urban environments and the targeted terrain due to
YOLO11’s fine-tuning on segmented building data.</p>
        <p>Considering the inherent complexity of UAV global localization, where many approaches struggle
to maintain high accuracy at top ranks, a Recall@1 value of 0.195 (or 19.5%) with a localization radius
of 3 for urban terrain is a promising outcome. This demonstrates that the YOLO11-based method is
competitive and robust to variations in input data. Furthermore, the achieved result surpasses
existing methods like CosPlace [11], underscoring the effectiveness of the proposed method at this
stage of CNN-based localization research.</p>
        <p>The limitations of the proposed UAV global localization method include its applicability restricted
to urban terrains during daylight hours without extreme weather conditions, and reliance on a
predefined database of satellite images along potential UAV routes.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>The primary objective of this research was successfully achieved: enhancing the accuracy and
robustness of CNN-based methods for UAV global localization in challenging urban environments
using models capable of efficient real-time image processing. The fine-tuning of the YOLO11 model
on a dataset of segmented buildings yielded vector representations that significantly improved
matching accuracy between UAV images and satellite imagery.</p>
      <p>The proposed image preprocessing approach, based on aligning statistical distributions of satellite
imagery and UAV-acquired images, demonstrated clear advantages. It enhanced the visual similarity
necessary for precise localization, outperforming existing methods such as CosPlace [11],
particularly in urban and targeted terrain scenarios. Specifically, the obtained quantitative results
include an F1-score of 0.722, demonstrating reliable and consistent building segmentation
performance, crucial for effective vector representation generation. Furthermore, a Recall@1 metric
of 19.5% with a localization radius of 3 significantly surpasses existing benchmarks in urban terrains,
confirming the competitive advantage and robustness of the proposed method.</p>
      <p>Key benefits of this research include the ability to perform rapid and accurate UAV localization in
GPS-compromised environments, directly addressing critical limitations found in current
localization systems. The integration of YOLO11’s inherent real-time processing capabilities with
improved vector matching techniques represents a practical and efficient solution, particularly suited
to real-world applications involving time-sensitive operations, such as rescue missions, urban
monitoring, and infrastructure inspections.</p>
      <p>Despite these advances, the proposed method’s applicability currently remains focused primarily
on urban environments during optimal conditions (daylight and favorable weather). Additionally, its
effectiveness relies on the availability and quality of a predefined database of satellite images
corresponding to potential UAV operational routes.</p>
      <p>Future research directions should focus on further enhancing the generalizability and accuracy of
the YOLO11 model for broader segmentation and localization tasks through comprehensive image
augmentation, architecture modifications, and rigorous hyperparameter optimization. Expanding the
scope of localization capabilities to include diverse terrains (e.g., forested areas, fields, and water
bodies) would significantly enhance the versatility and operational value of this approach,
broadening its practical application spectrum across various UAV mission scenarios.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[8] N. Suenderhauf, et al., Place recognition with ConvNet landmarks: Viewpoint-robust,
conditionrobust, training-free, in: D. Hsu (Ed.), Robotics: Science and Systems XI, Robotics: Science and
Systems Conference, 2015, pp. 1–10.
[9] S. Hausler, S. Garg, M. Xu, M. Milford, T. Fischer, Patch-NetVLAD: Multi-Scale Fusion of
Locally-Global Descriptors for Place Recognition, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, CVPR 2021, IEEE Computer Society, 2021, pp.
14141–14152.
[10] J. Komorowski, M. Wysoczańska, T. Trzcinski, MinkLoc++: Lidar and Monocular Image Fusion
for Place Recognition, in: Proceedings of the 2021 International Joint Conference on Neural
Networks, IJCNN 2021, IEEE, 2021, pp. 1–8. doi:10.1109/IJCNN52387.2021.9533373.
[11] G. Berton, C. Masone, B. Caputo, Rethinking Visual Geo-Localization for Large-Scale
Applications, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, CVPR 2022, IEEE Computer Society, 2022, pp. 4878–4888.
[12] A. Ali-bey, B. Chaib-draa, P. Giguère, MixVPR: Feature Mixing for Visual Place Recognition, in:
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV
2023, IEEE, 2023, pp. 2998–3007.
[13] A. Ali-bey, B. Chaib-draa, P. Giguère, GSV-Cities: Toward appropriate supervised visual place
recognition, Neurocomputing 513 (2022) 194–203. doi:10.1016/j.neucom.2022.09.127.
[14] J. Sun, Z. Shen, Y. Wang, H. Bao, X. Zhou, LoFTR: Detector-Free Local Feature Matching With
Transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, CVPR 2021, IEEE Computer Society, 2021, pp. 8922–8931.
[15] R. Wang, Y. Shen, W. Zuo, S. Zhou, N. Zheng, TransVPR: Transformer-Based Place Recognition
With Multi-Level Attention Aggregation, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, CVPR 2022, IEEE Computer Society, 2022, pp. 13648–
13657.
[16] N. Keetha, et al., AnyLoc: Towards Universal Visual Place Recognition, IEEE Robot. Autom. Lett.</p>
      <p>9 (2024) 1286–1293. doi:10.1109/LRA.2023.3343602.
[17] M. Oquab, et al., DINOv2: Learning Robust Visual Features without Supervision,
arXiv:2304.07193 (2024). doi:10.48550/arXiv.2304.07193.
[18] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look Once: Unified, Real-Time Object
Detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2016, IEEE Computer Society, 2016, pp. 779–788.
[19] Ultralytics, YOLO11 NEW, 2025. URL: https://docs.ultralytics.com/models/yolo11.
[20] J. Shao, L. Jiang, Style Alignment-Based Dynamic Observation Method for UAV-View
Geo</p>
      <p>Localization, IEEE Trans. Geosci. Remote Sens. 61 (2023) 1–14. doi:10.1109/TGRS.2023.3337383.
[21] Buildings Instance Segmentation Dataset &gt; Overview, Roboflow, 2025. URL:
https://universe.roboflow.com/roboflow-universe-projects/buildings-instance-segmentation/
dataset/1.
[22] M. Schleiss, F. Rouatbi, D. Cremers, VPAIR – Aerial Visual Place Recognition and Localization in</p>
      <p>Large-scale Outdoor Environments, arXiv:2205.11567 (2022). doi:10.48550/arXiv.2205.11567.
[23] O. Rainio, J. Teuho, R. Klén, Evaluation metrics and statistical tests for machine learning, Sci.</p>
      <p>Rep. 14 (2024) 6086. doi:10.1038/s41598-024-56706-x.
[24] M. Zaffar, et al., VPR-Bench: An Open-Source Visual Place Recognition Evaluation Framework
with Quantifiable Viewpoint and Appearance Change, Int. J. Comput. Vis. 129 (2021) 2136–2174.
doi:10.1007/s11263-021-01469-5.
[25] ultralytics/ultralytics: Ultralytics YOLO11, 2025. URL: https://github.com/ultralytics/ultralytics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Masone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Deep Visual Place Recognition, IEEE Access 9 (</article-title>
          <year>2021</year>
          )
          <fpage>19516</fpage>
          -
          <lpage>19547</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2021</year>
          .
          <volume>3054937</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Aggregating local descriptors into a compact image representation</article-title>
          ,
          <source>in: Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision</source>
          and Pattern Recognition,
          <source>CVPR '10</source>
          , IEEE Computer Society,
          <year>2010</year>
          , pp.
          <fpage>3304</fpage>
          -
          <lpage>3311</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2010</year>
          .
          <volume>5540039</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Milford</surname>
          </string-name>
          , Where Is Your Place,
          <article-title>Visual Place Recognition?</article-title>
          ,
          <source>in: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2021</year>
          ,
          <source>International Joint Conferences on Artificial Intelligence Organization</source>
          , Montreal, Canada,
          <year>2021</year>
          , pp.
          <fpage>4416</fpage>
          -
          <lpage>4425</lpage>
          . doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2021</year>
          /603.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sattler</surname>
          </string-name>
          , et al.,
          <source>Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions, in: Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2018</year>
          , IEEE Computer Society,
          <year>2018</year>
          , pp.
          <fpage>8601</fpage>
          -
          <lpage>8610</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Arandjelovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gronat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pajdla</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Sivic,</surname>
          </string-name>
          <article-title>NetVLAD: CNN Architecture for Weakly Supervised Place Recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2016</year>
          , IEEE Computer Society,
          <year>2016</year>
          , pp.
          <fpage>5297</fpage>
          -
          <lpage>5307</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Torii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sivic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pajdla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Okutomi</surname>
          </string-name>
          ,
          <article-title>Visual Place Recognition with Repetitive Structures</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2013</year>
          , IEEE Computer Society,
          <year>2013</year>
          , pp.
          <fpage>883</fpage>
          -
          <lpage>890</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Torii</surname>
          </string-name>
          , et al.,
          <article-title>Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization?</article-title>
          ,
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>43</volume>
          (
          <year>2021</year>
          )
          <fpage>814</fpage>
          -
          <lpage>829</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2019</year>
          .
          <volume>2941876</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>