<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Eighth International Workshop on Computer Modeling and Intelligent Systems, May</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Autonomous Identification of Distinctive Landmarks from Earth Surface Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zakhar Ostrovskyi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksander Barmak</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iurii Krak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Glushkov Cybernetics Institute</institution>
          ,
          <addr-line>40, Glushkov Ave., Kyiv, 03187</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Khmelnytskyi National University</institution>
          ,
          <addr-line>11, Institutes str., Khmelnytskyi, 29016</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>64/13, Volodymyrska str., Kyiv, 01601</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>5</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>This article addresses the problem of identifying unique objects in aerial images of urban areas on the Earth's surface, which can serve as stable landmarks for UAV navigation without GPS signals. The main contribution lies in proposing an approach to transforming the image into an object-oriented vector representation (embedding) that retains structural information about those objects. The proposed approach automatically identifies the most distinctive objects, which can serve as navigation landmarks. The study focuses on urban and suburban landscapes, where buildings are chosen as landmarks and YOLOv11 is used as the deep learning model. By employing dimensionality reduction methods, in particular PCA and t-SNE, it is demonstrated that in the proposed embedding space, buildings with atypical structural or visual characteristics differ significantly from other buildings and are easily classified as outliers, making them natural landmarks for navigation. Experimental results confirm the effectiveness and potential of the proposed approach for ensuring stable UAV navigation in scenarios where GPS may be inaccessible-the accuracy of identifying buildings designated as landmarks is twice that of ordinary buildings (Recall@1 = 0.51 vs. 0.28).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Instance embeddings</kwd>
        <kwd>Landmark selection</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>UAV navigation</kwd>
        <kwd>Satellite images</kwd>
        <kwd>GPS-denied environments</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Unmanned Aerial Vehicles (UAVs) increasingly operate in environments where GPS signals are
unreliable or absent [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Under such conditions, visual landmarks identified from onboard camera
images become the sole method for determining UAV location [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For accurate localisation,
landmarks must be distinctive and visually recognisable under various conditions of illumination,
altitude, and imaging type (e.g., UAV camera versus satellite imagery). Thus, using a set of landmarks
for a given area, a route can be planned to a specified point, allowing UAV navigation without GPS
signals, relying solely on environmental image analysis.
      </p>
      <p>Depending on terrain characteristics, various types of objects can serve as landmarks. Given their
critical practical relevance for tasks like search-and-rescue operations, deliveries, and path planning
for long-range strike UAVs through densely populated urban areas, this study focuses on urban and
suburban environments. Buildings, frequently prominent in these areas, have thus been chosen as
potential landmarks in this research.</p>
      <p>
        In recent years, deep neural networks have become widely used to obtain vector representations of
features, such as embeddings. In computer vision tasks, embeddings of images or their fragments can
be extracted from the hidden layers of pre-trained neural networks (such as convolutional networks
or transformers), transforming visual data into compact, information-rich vectors while preserving
meaningful similarity between inputs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Essentially, a neural network “encodes” an image into a
point in latent space, placing images with similar content close together and enabling comparison
using distance functions. Moreover, networks can be fine-tuned to produce embeddings with
enhanced properties [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], such as invariance to common disturbances (changes in lighting, angles, etc.),
thus improving their robustness in dynamic environments. Instead of raw pixel processing, UAV
navigation systems can reliably recognise relevant objects using embeddings and distance metrics.
      </p>
      <p>Therefore, this research contributes to developing an approach to generate semantically rich
vector representations of objects (buildings) based on their representations in the hidden layers of a
convolutional neural network when processing satellite and UAV images. It doesn’t require any
additional training and thus can be directly applied to any segmentation CNN and other types of
objects.</p>
      <p>The proposed approach is practically significant, as it addresses the automatic identification of
stable landmarks among numerous similar objects. For example, many urban buildings share
architectural features, materials, or colour schemes, reducing their distinctiveness for reliable visual
identification.</p>
      <p>The structure of this paper is as follows: the Literature Review section surveys previous research on
UAV visual localisation, including marker-based approaches, semantic dictionaries, and feature
aggregation methods. The Materials and Methods section introduces the proposed model for
extracting embeddings and automatically identifying landmark buildings. The Results and Discussion
section presents experimental outcomes using the VPAIR dataset, their analysis, and potential
improvements to the approach.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <p>Below, related research directions are reviewed.</p>
      <p>
        One closely related approach is marker-based localisation, where artificial markers ensure
uniqueness. For instance, YoloTag [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] employs a YOLO-based detector for fiducial markers, enabling
position estimation through geometric algorithms (e.g., EPnP [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). Although effective in indoor or
restricted outdoor environments, it depends on physical marker placement, limiting scalability.
Marker optimisation methods [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] face similar scalability challenges.
      </p>
      <p>
        Semantic mapping and object dictionaries maintain recognised object annotations with geometric
or class characteristics. For example, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] combines detection with depth data (RGB-D cameras) to
create semantic maps (doors, fire extinguishers, etc.). While conceptually similar, these solutions
typically do not produce vector embeddings that differentiate objects within the same class, limiting
the identification of distinctive landmarks. Additionally, dictionary-based approaches usually cover
small-to-medium indoor areas, where objects appear repeatedly from various viewpoints within one
route.
      </p>
      <p>
        Local CNN-descriptor aggregation into global vectors has been investigated in object retrieval
contexts (e.g., SPoC, CroW, R-MAC, NetVLAD [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]), usually tested on datasets for
ground-level place recognition. Despite conceptual similarities, these methods rarely perform precise
object segmentation, particularly from aerial imagery. Furthermore, these methods typically produce
a global vector representation for entire images, not considering individual object vector
representations.
      </p>
      <p>
        Domain adaptation methods such as CLDA-YOLO [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] address environmental variations
(weather, lighting, etc.) in object detection tasks. These methods could enhance embedding robustness
within the developed algorithm.
      </p>
      <p>
        Comprehensive surveys of UAV navigation under GPS-denied conditions [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] emphasise the
importance of tracking visual reference points. Yet, systems like SLAM primarily track key surface
points without treating objects holistically as unique landmarks.
      </p>
      <p>Literature analysis indicates limited attention to landmark-based UAV localisation methods. Most
existing approaches generate descriptor vectors for entire UAV images, matching them against
annotated databases. However, these descriptors may be sensitive to changes in imaging conditions
and background noise.</p>
      <p>Thus, this research aims to improve UAV localisation accuracy using stable landmarks derived
from the hidden layers of a convolutional neural network when processing satellite images. These
landmarks, based exclusively on unique objects, offer robustness to image noise. To achieve this goal,
the following research tasks were formulated.</p>
      <p>1. Develop a method for obtaining embeddings (vector representations) of buildings from
convolutional neural network hidden layers, capable of preserving their visual and structural
characteristics.
2. Create an automatic method for selecting landmark buildings based on embedding space
analysis (e.g., via outlier detection).
3. Experimentally validate the proposed approach for accurately identifying buildings designated
as landmarks (based on prior analysis of satellite images) on UAV images.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and methods</title>
      <sec id="sec-3-1">
        <title>3.1. Process model</title>
        <p>For the problem under consideration, the input data consists of a set of satellite images, denoted by
I ={I 1 , I 2 , … , I N }, covering a specific geographic area, and a fixed set of landmark object types,
defined as LandmarkTypes={Type1 , Type2 , … , Typet } potentially comprising multiple object
categories. Each satellite image I n may contain several objects from the set LandmarkTypes,
represented as {O1n , O2n , … , OnKn}. O – designates the set of all objects from all images I .</p>
        <p>The proposed approach employs convolutional neural networks (CNNS) specialised for image
segmentation tasks. These CNNs are trained to recognise object types from the LandmarkTypes set.
The output of this network for each recognized object Onk is the corresponding segmentation mask
k
M n.</p>
        <p>For each object Onk, it is necessary to apply a mapping function f to a d-dimensional real-valued
embedding vector spaceRd:</p>
        <p>f ( I n , M nk)=enk , enk∈ Rd
The obtained vector representation of an object should possess the following properties:
</p>
        <p>uniqueness – enk must emphasise distinctive visual features of object Onk, enabling reliable
differentiation from others. Let us denote by S the set of visually similar objects to Onk, and by D
the set of visually distinct objects. The following inequality must hold:
∀ e+∈S , ∀ e-∈ D : d (e k , e+)&lt; d (e k , e-) (2)</p>
        <p>n n
where d (⋅ ,⋅ ) is a chosen distance metric (e.g., Euclidean distance).
 robustness – under minor transformations of the object, such as changes in viewpoint, lighting
conditions, or partial occlusions, the vector representation enk remains practically unchanged.</p>
        <p>Suppose T is a transformation modelling changes in imaging conditions, and e (T (Onk)) is the
object embedding after applying transformation T . Stability is ensured if:
(1)
(3)
||e (T (Onk))−enk||&lt; ϵ ,
where ϵ is a small constant defining the permissible deviation level, and‖.‖ denotes a vector norm
(for instance, Euclidean).</p>
        <p>After obtaining all embeddings enk within the dataset, the task arises to automatically identify
instances that stand out significantly in the embedding vector space. Formally, let {e1 , e2 , … , eM }
denote all object embeddings. A criterion based on outlier detection algorithms is introduced,
distinguishing objects with distinctive features from those forming dense clusters. Objects thus
identified are labelled as landmarks. The setL of such landmark objects represents the final output of
the landmark detection task.</p>
        <p>These landmark objects, identified through the described method, form the basis for UAV route
planning. In scenarios lacking GPS signals, a UAV can determine its location by recognising selected
stable and unique landmarks on the terrain.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Hypothesis</title>
        <p>An object database along the flight trajectory is prepared before the mission to facilitate UAV
navigation based on visual features. These objects are extracted from satellite images, each associated
with geolocation markers. During flight, the UAV identifies objects from onboard camera images and
searches for matching objects in this pre-formed database. Upon finding a match, the UAV uses the
object’s geolocation marker to determine its current position.</p>
        <p>Generally, it is reasonable to assume that if an object identified in a satellite image acts as a
landmark, standing out from surrounding objects due to unique features, this distinction will persist
in images captured by UAV cameras. Based on these assumptions, the core hypothesis can be
articulated as follows: if a mapping into vector space preserving semantic and structural features is
applied to the set of objects in an image, landmark objects will distinctly differ from other objects in
the embedding space, regardless of whether they originate from UAV camera images or satellite
imagery. Consequently, when receiving images from onboard cameras, the UAV can significantly
more accurately identify precisely those objects recognised as landmarks by the approach proposed in
this study.</p>
        <p>Thus, the essence of the proposed method lies in the specific utilisation of convolutional neural
networks to create an embedding vector space that preserves semantic and structural characteristics.
It is well-known that convolutional neural networks learn hierarchical feature representations, with
initial layers capturing local textures and edges, and deeper layers encoding higher-level forms or
semantic features. An embedding with hierarchical features of the specific object is effectively
obtained by constructing a vector from activations corresponding to a particular object from various
hidden layers.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Method steps</title>
        <p>The main steps of the proposed approach are illustrated in Fig. 1.</p>
        <p>The input information of the approach (Fig. 1) consists of a set of satellite images of a specific
territory I and a fixed set of object types that potentially serve as landmarks, referred to as
LandmarkTypes.</p>
        <p>In the first step of the proposed approach, the setO of potential landmark objects is formed by
segmenting images from the satellite imagery database. A CNN-based segmentation model, trained to
recognise objects from the LandmarkTypes set, is applied to each image I n, resulting in {M nk} – a set
of masks and corresponding confidence scores. Only objects with a confidence coefficient above a
threshold θ are selected for reliability.</p>
        <p>In step 2, intermediate features of objects are extracted from the hidden layers of the convolutional
neural network. To describe this process, denote the feature map of a CNN at layer
I  as Fl∈ RCl× Hl×Wl, where Cl is the number of channels at layer I , and H l та W l represent the
height and width, in pixels, of the feature maps at layer I , respectively. Let l1 , l2 , … , lL be parameters
corresponding to the backbone CNN layers used for embedding formation. During segmentation in
step 1, feature maps Fl are extracted from the backbone CNN layers l1 , l2 , … , lL. Each layer l is
associated with a stride ρl, defining the reduction in spatial resolution of feature maps compared to
the original image. Accordingly, each mask M nk is resized to dimensions M~nkl based on the
corresponding stride ρl of the feature map Fl. This alignment ensures pixels in the mask correspond
precisely to positions on the feature map, thus isolating only the area corresponding to the detected
object Okl.</p>
        <p>n</p>
        <p>
          Step 3 involves aggregating intermediate features to form the final object embeddings. Since each
object may vary in size, occupying different-sized regions on the activation maps, an aggregation
function must be employed to obtain a fixed-dimensional vector. Generally, the aggregation function
can be a parameterised function trainable via backpropagation, such as a graph neural network [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>For each layer l and object k, the aggregation is computed as:</p>
        <p>zlc=agg ({Fl [ c , u , v ]|(u , v )∈ M~ nkl })
where the aggregation function can be, for example, max pooling or average pooling.
Max pooling:
Average pooling:</p>
        <p>zlc=(u m,v)∈ax~Mnkl Fl [ c , u , v ]
zlc=
1</p>
        <p>∑
|~Mnkl|(u , v)∈ ~Mnkl</p>
        <p>Fl [ c , u , v ]</p>
        <p>Values zlc, computed for all Cl channels across selected convolutional layers l1 , l2 , … , lL, are
concatenated to form the final embeddingenk∈ Rd:</p>
        <p>enk=[ zlc|c∈ 1. . Cl , l∈ l1 , l2 , … , lL] (7)</p>
        <p>Thus, the embedding dimension d is determined by the total number of channels in convolutional
layers l1 , l2 , … , lL, where each dimension effectively represents the presence of patterns detected by
corresponding convolutional filters:
d = ∑ Cl (8)</p>
        <p>l∈ l1,l2,…,lL</p>
        <p>Step 4 identifies landmark objects. Initially, all embeddings{e11 , e12 , …} are combined into the set
E. Since each embedding dimension corresponds to a specific convolutional filter trained to recognize
particular image structures, embeddings implicitly represent visual features of objects. Based on the
initial assumption, objects with atypical visual characteristics yield embeddings with atypical values.
Consequently, the final stage entails differentiating "typical" points in the embedding space from
those with rare features. Theoretically, this problem class corresponds to outlier detection methods
aimed at identifying objects statistically deviating from the majority.</p>
        <p>Generally, the outlier detection task can be formulated as follows: let E={e1 , e2 , … , eM }⊂ Rd,
where each vector ei is an object embedding. Suppose that for most points, the feature distribution
approximates a “typical” (“normal”) subset Enorm, while a few points e j∈ Eout significantly deviate
from this distribution. Formally, an evaluation function is assumed:</p>
        <p>s : Rd → R , (9)
which returns the deviation from the typical distribution for each ei. If s (ei) exceeds a threshold
sthr, ei is considered an outlier (anomaly). In our context, objects with such embeddings possess
distinctive visual characteristics and can serve as stable landmarks. Hence, applying an outlier
detection algorithm to set E forms the final landmark object setL={ei∨s (ei)&gt; sthr }.</p>
        <p>The landmark object set L is the output of the proposed approach.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluation metrics</title>
        <p>
          The UAV localisation problem considered in this study is classically framed as a retrieval task.
Consequently, literature conventionally evaluates UAV localisation methods using the Recall@N
metric [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. This metric considers a retrieval result as a true-positive for a given query if the
corresponding image from the database appears among the top N retrieved images:
Recall@N=
        </p>
        <p>M Q
N Q
where N Q is the total number of query images, and M Q is the number of queries with at least one
correct match within the top-N results.</p>
        <p>This metric is popular within computer vision communities and suits applications employing
postprocessing to eliminate false-positive matches.
(5)
(6)
(10)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          The VPAIR dataset [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] was selected for conducting experiments – a dataset designed explicitly for
evaluating visual place recognition tasks and UAV localisation based on images from onboard
cameras. Data collection occurred on October 13, 2020, during a flight of a light aircraft at altitudes
ranging from 300 to 400 meters above ground, covering an area between Bonn, Germany, and the Eifel
Mountain range, with a total route length of 107 km. The dataset includes imagery captured
perpendicular to the Earth’s surface and high-precision pose/orientation data obtained using
GNSS/INS systems. The VPAIR dataset contains 2,788 aerial photographs paired with corresponding
satellite images and does not provide any annotations about the objects in the images. The satellite
images were gathered from Geobasis NRW, a state-funded geodata repository under a permissive
open data license. It provides comprehensive coverage of the entire state of Nordrhein-Westfalen,
Germany. During image capture, the aircraft maintained a speed of 150 km/h and a frame rate of 1 Hz,
resulting in approximately 41.7 meters between consecutive image centres.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiment description</title>
        <p>The YOLOv11 segmentation convolutional neural network [20], pre-trained for building
segmentation in satellite images, was utilised in the experiments. It is important to emphasise that the
proposed method uses the pre-trained CNN that segments the objects of interest and requires no
additional training on the target dataset.</p>
        <p>For outlier detection – specifically to identify landmark objects – Isolation Forest[21], a tree-based
algorithm, was chosen. The selection of this algorithm was motivated by three main reasons:
treebased algorithms are robust against variations in feature value ranges and thus do not require
normalisation; they operate rapidly; and they only need two primary parameters that are easily
adjustable (the proportion of objects considered as outliers and the number of trees). This
straightforward algorithm facilitated focusing on hypothesis verification and proved sufficient to
confirm it. Subsequent experiments present results obtained with Isolation Forest configured with 500
trees and 1% outliers.</p>
        <p>To validate the proposed hypothesis, the following sub-hypotheses must be tested:
1) The proposed embedding generation approach encodes structural and semantic information
about objects.
2) The accuracy of landmark building retrieval from UAV images is significantly higher than that
of typical (non-landmark) buildings.</p>
        <p>It should be noted that the VPAIR dataset contains no specific annotations for buildings; thus, the
set of buildings used in this study was obtained using the YOLOv11 segmentation model.</p>
        <p>The absence of building ground-truth annotations in the dataset makes it impossible to quantify
misclassifications, false positives, and false negatives in the object detection process on the VPAIR
dataset.</p>
        <p>However, for the pre-trained YOLOv11 used in the experiments, the following metrics are reported
by its developers: 18,794 true positives, 8,462 false positives, and 5,628 false negatives. At the same
time, true-negative background pixels are undefined for segmentation. Across seven random splits the
model attained mAP 0.754, precision 0.771, recall 0.680 and F1 0.722. The reported values establish a
realistic error bound when the model is applied to the VPAIR dataset, and the manual inspection of the
predictions confirms its high performance and generalisation to this dataset.</p>
        <p>To verify the first hypothesis, visualisation of the building embeddings—obtained from segmented
satellite images—was conducted using two dimensionality reduction methods: PCA for analysing
linear dependencies and t-SNE for non-linear dependencies. Researchers then visually inspected the
proposed method and provided qualitative assessments.</p>
        <p>Validation of the second hypothesis required manual data labelling to create a benchmark set, as
the VPAIR dataset contains no building annotations. Given corresponding satellite and UAV images
and buildings previously segmented by YOLOv11, matching identical buildings across UAV and
satellite images was necessary. Considering the time-intensive nature of manual labelling, a random,
non-repetitive sample of 100 landmark buildings and 100 typical buildings was selected for
annotation. For the embeddings of each of the selected 200 UAV buildings, the five nearest
embeddings from satellite images were identified using the L2 norm. The metrics Recall@1 and
Recall@5 were calculated separately for landmark and typical buildings.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Analysis of the obtained embedding space</title>
        <p>a) b)
Figure 3: Visualisation plots of building embeddings from satellite images using dimensionality
reduction methods: left a) – PCA; right b) – t-SNE. Black points represent landmark buildings, and
blue points represent typical buildings.</p>
        <p>Visualisation results of building embeddings obtained via dimensionality reduction methods (Fig.
3) demonstrate that the embedding space is structured.</p>
        <p>The PCA plot shows that most buildings concentrate on the left side, with the remaining points
forming an elongated, sparse tail. It is logical to hypothesise that the dense concentration corresponds
to numerous typical buildings, while the progressively extending tail represents buildings with
increasing visual uniqueness. Visual inspection of points in these areas (Fig. 4) confirms this
assumption (Fig.5 and Fig.6). Thus, the selection of buildings at the tail end of this distribution by the
outlier detection algorithm as landmarks aligns with expectations, as these points correspond to the
most distinctive structures.</p>
        <p>The t-SNE visualisation, which reveals non-linear relationships, displays multiple small clusters
grouping visually similar buildings or identical buildings from adjacent frames. The fact that
landmark buildings cluster at the edges of the point cloud, rather than being dispersed throughout,
indicates good embedding space structure. A particularly notable cluster emerges distinctly in the left
region of the t-SNE plot. Visual inspection revealed that this cluster corresponds to small buildings
with typical structures positioned at image boundaries (so buildings partially extend beyond the frame
edge). Examples of these buildings and their corresponding points in the t-SNE visualisation are
illustrated in Fig.7-9.</p>
        <p>Thus, the proposed method effectively distinguishes landmark buildings from typical ones within
the embedding space. Selected landmark buildings exhibit unique characteristics, often large or
irregular shapes. Visually similar buildings in size, colour, and form have close embeddings. The
neighbourhoods around embeddings situated in regions of greater uniqueness mostly contain
embeddings of the same buildings from adjacent frames, indicating stability of the vector
representation across different viewpoints. However, as uniqueness decreases, the neighbourhoods
increasingly include buildings that, although visually similar, originate from spatially distant
locations.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Comparison of retrieval accuracy for typical and landmark buildings</title>
        <p>The quantitative measurements presented in Table 1 demonstrate that the accuracy of UAV-based
searches for landmark buildings nearly doubles compared to searches for typical buildings, thus
confirming the efficacy of the proposed approach. The Recall@5 value indicates that incorporating a
post-filtering stage for the top-5 most similar buildings could potentially increase the current
implementation’s Recall@1 up to 0.66.</p>
        <p>Several special cases were observed during the evaluation phase. For example, YOLOv11 may
detect the same building twice at the building segmentation stage from images. Still, in one instance,
YOLO might merge the building with an adjacent one, as illustrated in cases a) and b) in Figure 10.
However, the proposed approach effectively generates embeddings that robustly encode semantic and
structural information, rendering these representations resilient to CNN segmentation errors; in both
cases, the correct matching building was identified successfully.</p>
        <p>While cases a) and b) focused on searches of landmark buildings, case c) involved a building
classified as typical, characterised by medium size and a visually distinct angular shape. Despite the
corresponding satellite image building being rotated by more than 90°, it was accurately identified as
the top-1 match, demonstrating the embeddings’ robustness to object rotations. Case d) involved a
typical building—a long residential structure with an orange roof. Such buildings are numerous in the
dataset, and semantically retrieved buildings were correct, matching the elongated rectangular shape
and roof colour. However, none matched the UAV-captured building, emphasising the importance of
selecting truly unique buildings for search accuracy.</p>
        <p>a)
b)
c)
d)
in both cases, demonstrating the robustness of embeddings against CNN segmentation errors. Case c)
involves a uniquely shaped yet classified as a typical building that was rotated more than 90° in the
satellite imagery. Despite this rotation, the correct corresponding building was identified as the top-1
match, indicating robustness of embeddings to object rotations. In case d), a typical elongated
residential building with an orange roof is considered. Although the retrieved buildings are
semantically correct, being elongated rectangles with similar roof colours, none precisely match the
UAV-based query building. This emphasises the importance of selecting unique landmark buildings to
ensure retrieval accuracy.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Limitations</title>
        <p>The experimental validation in this study was conducted within an urban environment, using
buildings as landmarks. Both satellite and UAV images were captured during daylight from the same
vertical, top-down perspective.</p>
        <p>Significantly, the specific set of landmark objects for a given set of satellite images depends on the
convolutional neural network used. Different neural networks might segment the same image
differently, potentially failing to identify buildings or merging multiple adjacent buildings into a
single segment. Additionally, object embeddings generated by these networks may differ, resulting in
variations in the final landmark object set.</p>
        <p>The calculation of Recall@1 and Recall@5 metrics for matching accuracy between UAV and
satellite images required human labelling of search results. Due to the time-intensive nature of this
task, the test dataset size was limited to 200 unique buildings.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Future work</title>
        <p>Future directions for improvement include extending this approach to other types of urban
landmarks (e.g., intersections, roads, sports fields) and different environments (e.g., forests, fields).</p>
        <p>An interesting research aspect involves the impact of aggregation functions on the embedding
space and the objects forming the final landmark set. Combining graph neural networks, trainable via
backpropagation, with Contrastive Learning methods [22] could enhance the invariance of object
embeddings to variations in viewing angles or lighting conditions.</p>
        <p>To create landmark sets without the strict requirement for a fixed number (as seen in Isolation
Forest), and considering additional practical constraints, future improvements may involve more
flexible outlier detection algorithms. An alternative approach could replace outlier detection with
clustering algorithms that do not require a fixed cluster count. Here, landmarks could be represented
by objects in tiny clusters or those lying outside of any cluster, with the introduction of
supplementary constraints, such as a maximum allowable distance between neighbouring landmarks.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This work proposes an approach for identifying unique landmark objects by analysing embeddings
obtained from convolutional neural networks. The study aims to enhance UAV localisation by
isolating distinctive landmark buildings within an embedding space encoding structural and visual
features.</p>
      <p>Experimental results successfully met this goal, revealing landmark building identification
accuracy nearly twice as high as typical building recognition (Recall@1 = 0.51 and Recall@5 = 0.66
versus 0.28 and 0.46, respectively).</p>
      <p>Nevertheless, the current implementation has limitations, notably its application to navigation
within urban and suburban environments under good lighting conditions, and dependency on a
particular segmentation model.</p>
      <p>Future work aims to broaden this approach to various object and terrain types and investigate
more adaptable feature aggregation and anomaly detection methods. Such enhancements could
expand the system’s applicability and navigational accuracy. Ultimately, refining this approach could
enable fully automated UAV route planning based on visual features in GPS-denied environments.
The only parameters needed would include satellite surface imagery, specific landmark set constraints
derived from UAV technical specifications, the selected convolutional neural network deployed on the
UAV, and defined start and end route points.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling
check. After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[20] G. Jocher, J. Qiu, A. Chaurasia, Ultralytics YOLO, 2023. URL:
https://github.com/ultralytics/ultralytics.
[21] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation Forest, in: Proceedings of the 2008 Eighth IEEE
International Conference on Data Mining, ICDM ’08, IEEE, 2008, pp. 413–422.
doi:10.1109/ICDM.2008.17.
[22] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, F. Makedon, A Survey on Contrastive
SelfSupervised Learning, Technologies 9 (2021) 1. doi:10.3390/technologies9010002.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Masone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Deep Visual Place Recognition, IEEE Access 9 (</article-title>
          <year>2021</year>
          )
          <fpage>19516</fpage>
          -
          <lpage>19547</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2021</year>
          .
          <volume>3054937</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ayala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Portela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Buarque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J. T.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <article-title>UAV control in autonomous objectgoal navigation: a systematic literature review</article-title>
          ,
          <source>Artif. Intell. Rev</source>
          .
          <volume>57</volume>
          (
          <year>2024</year>
          )
          <article-title>125</article-title>
          . doi:
          <volume>10</volume>
          .1007/s10462-024-10758-7.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Maurício</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Domingues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bernardino</surname>
          </string-name>
          ,
          <article-title>Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review</article-title>
          , Appl. Sci.
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <article-title>9</article-title>
          . doi:
          <volume>10</volume>
          .3390/app13095521.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rundo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Militello</surname>
          </string-name>
          ,
          <article-title>Image biomarkers and explainable AI: handcrafted features versus deep learned features</article-title>
          ,
          <source>Eur. Radiol. Exp</source>
          .
          <volume>8</volume>
          (
          <year>2024</year>
          )
          <article-title>130</article-title>
          . doi:
          <volume>10</volume>
          .1186/s41747-024-00529-y.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. Shwartz</given-names>
            <surname>Ziv</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>LeCun, To Compress or Not to Compress-Self-Supervised Learning</article-title>
          and
          <source>Information Theory: A Review, Entropy</source>
          <volume>26</volume>
          (
          <year>2024</year>
          )
          <article-title>3</article-title>
          . doi:
          <volume>10</volume>
          .3390/e26030252.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raxit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al Redwan</surname>
          </string-name>
          <string-name>
            <surname>Newaz</surname>
          </string-name>
          ,
          <article-title>YoloTag: Vision-based Robust UAV Navigation with Fiducial Markers</article-title>
          ,
          <source>in: Proceedings of the 2024 33rd IEEE International Conference on Robot and Human Interactive Communication, ROMAN '24</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>316</lpage>
          . doi:
          <volume>10</volume>
          .1109/ROMAN60168.
          <year>2024</year>
          .
          <volume>10731319</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lepetit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moreno-Noguer</surname>
          </string-name>
          , P. Fua,
          <article-title>EPnP: An Accurate O(n) Solution to the PnP Problem</article-title>
          ,
          <source>Int. J. Comput. Vis</source>
          .
          <volume>81</volume>
          (
          <year>2009</year>
          )
          <fpage>155</fpage>
          -
          <lpage>166</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11263-008-0152-6.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. DeGol</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Fragoso</surname>
            ,
            <given-names>S. N.</given-names>
          </string-name>
          <string-name>
            <surname>Sinha</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          <string-name>
            <surname>Leonard</surname>
          </string-name>
          ,
          <article-title>Optimizing Fiducial Marker Placement for Improved Visual Localization</article-title>
          , IEEE Robot. Autom.
          <source>Lett. 8</source>
          (
          <year>2023</year>
          )
          <fpage>2756</fpage>
          -
          <lpage>2763</lpage>
          . doi:
          <volume>10</volume>
          .1109/LRA.
          <year>2023</year>
          .
          <volume>3260700</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bersan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. F. M. Campos</surname>
            ,
            <given-names>E. R.</given-names>
          </string-name>
          <string-name>
            <surname>Nascimento</surname>
          </string-name>
          ,
          <article-title>Extending Maps with Semantic and Contextual Object Information for Robot Navigation: a Learning-Based Framework Using Visual and Depth Cues</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Intell</surname>
          </string-name>
          . Robot. Syst.
          <volume>99</volume>
          (
          <year>2020</year>
          )
          <fpage>555</fpage>
          -
          <lpage>569</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10846-019-01136-5.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Babenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lempitsky</surname>
          </string-name>
          ,
          <article-title>Aggregating Local Deep Features for Image Retrieval</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          , ICCV '15, IEEE Computer Society,
          <year>2015</year>
          , pp.
          <fpage>1269</fpage>
          -
          <lpage>1277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tolias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sicre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Particular object retrieval with integral max-pooling of CNN activations</article-title>
          ,
          <year>2016</year>
          . arXiv:
          <volume>1511</volume>
          .05879. doi:
          <volume>10</volume>
          .48550/arXiv.1511.05879.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kalantidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mellina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Osindero</surname>
          </string-name>
          ,
          <article-title>Cross-Dimensional Weighting for Aggregated Deep Convolutional Features</article-title>
          , in: G. Hua, H. Jégou (Eds.),
          <source>Computer Vision - ECCV 2016 Workshops</source>
          , Springer International Publishing, Cham,
          <year>2016</year>
          , pp.
          <fpage>685</fpage>
          -
          <lpage>701</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -46604-0_
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Arandjelovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gronat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pajdla</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Sivic,</surname>
          </string-name>
          <article-title>NetVLAD: CNN Architecture for Weakly Supervised Place Recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <source>CVPR '16</source>
          , IEEE Computer Society,
          <year>2016</year>
          , pp.
          <fpage>5297</fpage>
          -
          <lpage>5307</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Qiu</surname>
          </string-name>
          , et al.,
          <source>CLDA-YOLO: Visual Contrastive Learning Based Domain Adaptive YOLO Detector</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2412</volume>
          .11812. doi:
          <volume>10</volume>
          .48550/arXiv.2412.11812.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          , Y. Cheng, U. Manzoor,
          <string-name>
            <given-names>J.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <article-title>A review of UAV autonomous navigation in GPSdenied environments</article-title>
          , Robot. Auton. Syst.
          <volume>170</volume>
          (
          <year>2023</year>
          )
          <article-title>104533</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.robot.
          <year>2023</year>
          .
          <volume>104533</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Comprehensive</surname>
          </string-name>
          <article-title>Survey on Graph Neural Networks</article-title>
          ,
          <source>IEEE Trans. Neural Netw. Learn. Syst</source>
          .
          <volume>32</volume>
          (
          <year>2021</year>
          )
          <fpage>4</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .1109/TNNLS.
          <year>2020</year>
          .
          <volume>2978386</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaffar</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>VPR-Bench</surname>
          </string-name>
          :
          <article-title>An Open-Source Visual Place Recognition Evaluation Framework with Quantifiable Viewpoint and Appearance Change</article-title>
          ,
          <source>Int. J. Comput. Vis</source>
          .
          <volume>129</volume>
          (
          <year>2021</year>
          )
          <fpage>2136</fpage>
          -
          <lpage>2174</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11263-021-01469-5.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>O.</given-names>
            <surname>Rainio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Teuho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Klén</surname>
          </string-name>
          ,
          <article-title>Evaluation metrics and statistical tests for machine learning</article-title>
          ,
          <source>Sci. Rep</source>
          .
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <article-title>6086</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41598-024-56706-x.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schleiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rouatbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cremers</surname>
          </string-name>
          ,
          <article-title>VPAIR-Aerial Visual Place Recognition and Localisation in Large-scale Outdoor</article-title>
          <string-name>
            <surname>Environments</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://github.com/AerVisLoc/vpair.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>