<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Advanced AI in Explainability and Ethics for the Sustainable Development Goals, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>CNNs are explainable domain-specific visual embedders</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zakhar Ostrovsky</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrii Biloshchytskyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro Uhryn</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Astana IT University</institution>
          ,
          <addr-line>Astana 010000</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Khmelnytskyi National University</institution>
          ,
          <addr-line>11, Institutes str., Khmelnytskyi, 29016</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yuriy Fedkovych Chernivtsi National University</institution>
          ,
          <addr-line>Chernivtsi, 58012</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>07</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Autonomous aerial systems operating in GPS-denied environments require robust visual place recognition (VPR) to ensure safety and reliability. However, creating stable visual embeddings for complex aerial imagery remains challenging due to high dimensionality, viewpoint variability, and occlusion. In this work, we propose a domainspecific, explainable pipeline for UAV geo-localisation that leverages a pre-trained Convolutional Neural Network (CNN) to extract multi-level features of buildings, followed by unsupervised outlier detection to curate distinctive landmarks. By treating landmark selection as an anomaly detection problem, we automatically build a database of visually unique, geo-tagged structures without additional training. Experiments on the cross-view VPAIR benchmark demonstrate that our method substantially outperforms typical scene features, increasing Top-1 recall from 31% to over 53% and doubling precision in visual place recognition. The resulting embeddings are not only more accurate but also highly interpretable, emphasizing salient architectural features over background clutter. These findings suggest that integrating object-centric embeddings with outlier-based distinctiveness provides a lightweight, transparent path toward reliable autonomous navigation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable AI</kwd>
        <kwd>visual embeddings</kwd>
        <kwd>UAV navigation</kwd>
        <kwd>visual place recognition</kwd>
        <kwd>outlier detection</kwd>
        <kwd>deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Autonomous aerial systems increasingly operate where GNSS is intermittent or denied, forcing reliance
on onboard vision for localisation and navigation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In such settings, visual place recognition (VPR),
matching the current view against a prior map, becomes critical for safety and reliability [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Yet, turning
raw pixels into stable, reusable visual embeddings is substantially harder than embedding text: images
are high-dimensional, viewpoint- and illumination-variant, and rife with occlusion. Efective image
embeddings must jointly encode geometry, appearance, and semantics while remaining discriminative
across scenes and robust within a scene [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>
        A practical route to trustworthy VPR is to ground decisions in explicit, human-interpretable landmarks
(e.g., distinctive buildings), as suggested in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Classical robotics emphasises that a good landmark
is unique and consistently observable [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, many pipelines either preselect key regions by
hand [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or treat all detections of a class (e.g., “building”) as equally valid, which fails in repetitive urban
layouts. Deep CNNs learn hierarchical features, from textures to parts to object structure, that can
support stronger embeddings [8, 9]. Global deep place recognition (e.g., NetVLAD) aggregates scene
evidence efectively [ 10], but still tends to behave like a black box and may be distracted by background
clutter. For explainable autonomy [11], we argue for object-centric, domain-specific embeddings paired
with an automated notion of distinctiveness prioritising the few landmarks that genuinely anchor
localisation.
      </p>
      <p>The main contribution of this research is a lightweight, explainable pipeline that: (i) extracts
multilayer, object-masked CNN embeddings for buildings without extra training, (ii) applies unsupervised
outlier detection to curate a compact set of distinctive, geo-tagged landmarks, and (iii) demonstrates
substantial VPR gains when queries are restricted to these landmarks. The approach yields interpretable
decisions (“matched this specific building”) while boosting accuracy.</p>
      <p>The structure of this paper is the following: Section 2 reviews related work on visual embeddings,
object-centric retrieval, and UAV VPR. Section 3 details our segmentation-to-embedding and
outlierbased landmark selection pipeline. Section 4 reports experiments and analysis. Section 5 concludes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Early VPR relied on local handcrafted descriptors (e.g., SIFT) and global aggregation (e.g., BoVW) [8, 9].
These methods scale but struggle with large viewpoint changes and lack semantic abstraction. CNNs
remedied much of this by learning hierarchical features; NetVLAD integrated a learnable aggregation
to produce strong global scene descriptors [10], while DELF emphasised attentive local regions for
retrieval [12]. Still, global embeddings can entangle background and foreground, and do not inherently
tell which object grounded the match.</p>
      <p>
        Another approach is object-centric retrieval and distinctiveness. Detect-to-Retrieve (D2R) reduces
clutter by detecting objects first, then retrieving with object descriptors [ 13]. Yet most D2R variants
implicitly assume all instances of a class are equally informative. In UAV navigation, surveys highlight
deep learning methods tailored for GPS-denied operation and cross-view matching challenges [14],
with context-enhanced models improving aerial-to-oblique alignment [15]. For detection/segmentation
under compute constraints, YOLO-family models provide competitive accuracy-speed trade-ofs on
aerial data [16]. Manually chosen key regions can help [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], but do not scale or adapt across locales.
      </p>
      <p>
        A better explainability can be achieved via multi-layer features and outliers. Isolation Forest eficiently
surfaces unusual items in high-dimensional spaces, making it a natural tool for landmark curation in
an embedding space [17]. Meanwhile, feature visualisation shows intermediate CNN layers capture
interpretable patterns (textures, parts), suggesting that multi-layer, object-masked embeddings can be
both discriminative and more transparent than global scene vectors [18]. Imaging XAI further supports
blending low- and high-level cues to improve interpretability without sacrificing performance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Guided by this landscape, we pursue the following research tasks:
1. Design domain-specific, multi-layer CNN embeddings for buildings that require no additional
training but retain semantic and fine-detail cues [8, 9, 10, 12, 18].
2. Select distinctive landmarks by unsupervised outlier detection in embedding space (Isolation</p>
      <p>Forest) to down-weight repetitive structures [13, 17].
3. Evaluate the pipeline in UAV-relevant cross-view retrieval, comparing landmark-focused queries
versus typical buildings to quantify gains in accuracy and explainability [14, 15, 16].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and methods</title>
      <sec id="sec-3-1">
        <title>3.1. Problem setup and notation</title>
        <p>Let a geo-referenced collection of satellite images be  = {}=1. A segmentation detector produces
a set of building instances.</p>
        <p>= ⋃︁ (),
=1
() = {(,  , )},
(1)
where  ∈ {0, 1}× is a binary mask,   a confidence score, and  the geo-coordinate (e.g.,
centroid) of the object.</p>
        <p>We retain candidates with   ≥  conf and form  ⊂  . Our goal is to learn an embedding map
 :  → R and curate a landmark subset ℒ ⊂   whose embeddings are distinctive within the
operational area. The approach is summarised by the diagram on Figure 1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Segmentation backbone and masking</title>
        <p>We employ a lightweight YOLO-family segmentation model to detect buildings and obtain pixel-accurate
masks suitable for object-level feature pooling [16]. For an input , let the CNN backbone produce
feature tensors at selected layers  = {1, . . . , }:</p>
        <p>To align the object mask with each feature map, we apply resolution-aware projection
 ()</p>
        <p>∈ R× ×  ,  ∈ ,
() = Π () ∈ {0, 1}×  ,
Ω (,) = {(, ) | ()
(, ) = 1}.
support for channel  is
where Π (·) denotes down/up-sampling consistent with the backbone strides. The masked spatial</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Object-masked, multi-layer embedding</title>
        <p>(·):
For each selected layer  and channel , we aggregate activations over Ω (,) using a pooling operator

(,) =  { ()(, ) | (, ) ∈ Ω (,)})︁ .</p>
        <p>︁(
We consider three standard, training-free choices:</p>
        <p>∈
() = max , () =
|| ∈
1 ∑︁ , () = ∑︁ .</p>
        <p>∈</p>
        <p>The per-layer descriptor is 
concatenates descriptors across layers:
() = [︁(,)1, . . . , ,
() ]︁⊤. The final object-masked, multi-layer embedding
(2)
(3)
(4)
(5)
(6)
(7)
() =  = ⨁︁ () ∈ R ,  = ∑︁ .</p>
        <p>∈
∈</p>
        <p>Max pooling emphasizes salient, viewpoint-tolerant responses, mean pooling captures average
appearance, and sum pooling is sensitive to object extent; we empirically compare them in Section
4, in line with prior observations about intermediate-layer semantics and interpretability [18], and
global/attentive retrieval practice [10, 12].</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Layer subset selection via proxy retrieval metrics</title>
        <p>We pick which CNN layers to use by measuring how well diferent layer combinations help us retrieve
the right building under two quick tests:
• S2S (satellite→satellite): queries and database are neighbouring/overlapping satellite crops;
• S2D (satellite→drone): queries are UAV/aircraft views; the database is satellite imagery.</p>
        <p>For any candidate set of layers: 1. Build embeddings with those layers for all database buildings and
for the query buildings. 2. Nearest neighbour match: for each query, find the most similar database
embedding (Euclidean or cosine distance in the embedding space). 3. Top-1 hit rate: count how often
the nearest neighbour is the correct building, and divide by the number of queries to get an accuracy
score.</p>
        <p>To select layers, we run a greedy forward search, summarised in Algorithm 1:
Algorithm 1 Greedy Algorithm for Optimal Layer Selection
• Start by evaluating every single layer and pick the one with the highest top-1 hit rate on the
chosen proxy (S2S or S2D).
• Then, try adding each remaining layer one at a time to the current set; keep the layer that
improves the score the most.</p>
        <p>• Stop, if adding any new layer no longer improves the score.</p>
        <p>This simple procedure consistently prefers a mix of mid-level and deep layers, the former carry
texture/part details while the latter encode object shape/semantics, yielding more discriminative and
robust embeddings for retrieval, in line with prior findings on global and attentive deep retrieval and
intermediate-layer interpretability [10, 12, 18].</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Distinctive landmark selection via Isolation Forest</title>
        <p>
          With embeddings {}, we estimate distinctiveness using Isolation Forest [17]. The model  assigns
an anomaly score  =  () ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], where higher indicates easier isolation (rarity). We define the
landmark set by thresholding (or contamination-controlled quantile  ):
ℒ = { ∈  |  ≥  }.
(8)
        </p>
        <p>This unsupervised criterion operationalizes the classical “uniqueness” property of landmarks without
manual labels, while remaining computationally light.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Landmark database and UAV retrieval</title>
        <p>After selection, we keep a compact landmark database where each entry stores: (i) the landmark’s
embedding (the descriptor we computed), (ii) its geo-coordinate from the map (or an ID to the source
image). At runtime, the UAV captures a frame, runs the same building segmentation, and computes
query embeddings with the exact same masking-and-pooling steps as during mapping. We then search
only within the landmark set (not all buildings) for the nearest neighbours to each query embedding.
This restriction cuts down confusion among look-alike structures and accelerates matching, both are
crucial in GPS-denied flight scenarios [14, 15, 16].</p>
        <p>There are two natural evaluation modes:
• Top-1 match, if the best match is confident, we immediately use that landmark’s geo-coordinate
as the position hypothesis, simple and fast for real-time control loops.
• Top-K + reranker, if we want extra reliability, we take the top-K candidates (e.g., 5) and apply
lightweight checks (e.g., view consistency, simple geometry, or IMU priors) to pick the final
match. This cascaded setup trades a bit of latency for robustness, which is often desirable in
safety-critical flights [14, 15, 16].</p>
        <p>To evaluate both modes with one metric, we report Recall@K. It answers: “For how many queries
does the correct landmark appear within the top-K retrieved results?” Thus Recall@1 measures the
strict, single-guess case (top-1 updates), while Recall@K (&gt;1) captures pipelines that retrieve a small
candidate set and then confirm the winner via a reranker or sensor fusion. This aligns directly with how
the system is used in practice: instant updates when confident, or top-K shortlists when verification is
enabled.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>We evaluated the proposed method on a challenging cross-view dataset and through a series of
experiments to validate each component of the pipeline. The primary dataset used is VPAIR (Visual Place
Recognition, Aircraft Imagery) [19], which provides paired imagery of urban landscapes: high-altitude
aerial (satellite-like) photos and low-altitude oblique photos taken from a light aircraft (simulating UAV
camera views) [20].</p>
        <p>(a) UAV-view
(b) Satellite</p>
        <p>The area covered in VPAIR includes a mix of dense city blocks, industrial sites, and open areas,
making it a good testbed for landmark selection. We processed the aerial images to build a landmark
database, then tested localisation by using the corresponding low-altitude images as queries (Figure 2).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. The qualitative properties of aggregation functions and one-layer embeddings</title>
        <p>Why does the aggregation function matter? A building embedding is constructed by pooling activations
that fall inside the object mask. The choice of pooling function determines what statistical summary
of the activation multiset is preserved, in much the same way that read-out operators determine the
expressive power of Graph Neural Networks (GNNs). Drawing an analogy to the analysis of multiset
aggregators in the GIN framework [21], we can interpret the three functions used here as follows:
1. Sum-pooling retains multiplicities: larger or more strongly activated regions contribute
proportionally more to the resulting vector. For buildings, this means that overall mass and footprint,
for example, long façades or large warehouse roofs, dominate the descriptor. Thus, the obtained
building embedding will largely depend on the building size. Although sum pooling has the
potential to represent diferent buildings the most distinctly out of the trio, its dependence on the
building’s size may significantly influence the embedding when the building image is taken from
diferent altitudes.
2. Average-pooling captures proportions while discarding scale. It emphasises the distribution of
visual patterns and, as such, it is expected to be the most expressive in terms of texture or material,
making subtle roof details (solar panels, fine tiling) prominent even on compact structures.
3. Max-pooling reduces the multiset to a simple set that records only the strongest response per
channel. It highlights salient local cues, distinct corners, towers, colour patches, irrespective of
object size.</p>
        <p>Hence, we expect sum-pooling to favour geometrically atypical or very large buildings,
averagepooling to surface textural oddities, and max-pooling to benefit most once deeper, semantically rich
layers are consulted.</p>
        <p>(a)
(b)
(c)
(d)
(e)
(f)</p>
        <p>Thus, to make the initial assessment of diferent aggregation functions and feature layers, we followed
the experimental protocol described further. For each of six convolutional layers (indices 1, 3, 5, 7,
9, 11), we formed embeddings from that single layer only, applying the three aggregation functions
individually. Isolation Forest (500 trees, 1% contamination) marked outliers, our candidate landmarks.
The resulting embedding clouds were projected to 2-D via PCA and t-SNE to visually assess landmark
separability. Figure 3 compiles the six plots per aggregator.</p>
        <p>Average-pooling (Fig. 3a and 3d). Across all layers, the brown outlier points are interwoven with
blue inliers, forming no isolated clusters. Manual inspection of outliers reveals many small, otherwise
typical houses with atypical roof textures, precisely the fine-grained cues that average-pooling amplifies.
However, the weak structural signal makes these landmarks hard to separate automatically.</p>
        <p>Max-pooling (Fig. 3b and 3e). Layers 1–5 show partial overlap: deeper spectral features have not yet
matured into strong semantic detectors. Starting at layer 7, a clear split emerges, and by layers 9–11,
the outliers form a compact lobe on the right of the PCA plot and at the periphery of the t-SNE map.
The selected landmarks are visually striking buildings, unusual footprints, vivid colours, confirming
that max-pooling benefits from the higher-level abstractions encoded in late layers.</p>
        <p>Sum-pooling (Fig. 3c and 3f). Separation is already pronounced at layer 1 and grows steadily. Outliers
correspond to large floor-area constructions (shopping centres, factories) that accumulate high activation
mass even in shallower feature maps. The wedge-shaped PCA distribution indicates that embedding
magnitude acts as a proxy for object size. These behaviours support the theoretical expectations above
and validate our first sub-hypothesis: the embedding space does encode semantic and structural cues,
with the nature of those cues dependent on the aggregation operator and CNN depth.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Proxy criteria for the selection of the embeddings configuration</title>
        <p>To rigorously test embedding discriminativeness, we deliberately increased the dificulty of the
evaluation setup. The Isolation Forest algorithm, previously applied with a 1% contamination threshold,
was adjusted to a much higher threshold of 20%. This increased threshold ensured that the pool of
outliers considered as candidate landmarks was significantly larger, more diverse, and inherently less
distinctive. We also strictly evaluated Top-1 accuracy instead of a more forgiving Top-5 criterion used
in other experimental phases, thus enforcing stringent embedding quality demands and ensuring that
our selected embeddings truly represent buildings with highly distinctive characteristics.
(a) S2D accuracy
(b) S2S accuracy</p>
        <p>The results from this quantitative evaluation clearly demonstrate meaningful distinctions among
aggregation strategies and layers. Fig. 4 illustrates the progression of Top-1 accuracy improvements
as layers are incrementally added, and Table 1 succinctly presents the final selected layers for each
combination of proxy metric and aggregation function.</p>
        <p>The analysis reveals several insights. The max aggregation function consistently benefits from
combining deeper semantic layers (9, 10) with intermediate structural layers (4, 6), yielding superior
generalization across domains (S2D) and consistency within domains (S2S). This aligns with our earlier
theoretical speculation that max pooling preserves distinctive local features and semantic signals
efectively, confirming its suitability for reliable landmark embeddings.</p>
        <p>Sum aggregation, while ofering a robust baseline, showed limited incremental gains when adding
deeper layers, particularly evident in the S2S metric. Its strong initial performance, visible even with
shallow layers, confirms our qualitative observation that sum pooling naturally emphasises large-scale
structures. Nonetheless, additional layers proved beneficial in bridging the satellite-to-drone domain
gap.</p>
        <p>Average aggregation exhibited a striking discrepancy across the two metrics. It achieved the highest
accuracy in S2S but notably underperformed in the more challenging cross-domain S2D metric. We
assume this discrepancy arises because average pooling emphasises subtle textural and material cues,
such as roofing patterns, that remain stable within a domain but degrade significantly when viewpoints
and sensor characteristics difer dramatically, as is the case between satellite and drone imagery.</p>
        <p>
          Overall, considering both theoretical arguments and these quantitative outcomes, we conclude
that embeddings formed via the max aggregation method, specifically layers [
          <xref ref-type="bibr" rid="ref6">9, 6, 10</xref>
          ] for the S2D
task, provide the best balance of structural, semantic, and cross-domain discriminative capabilities.
Consequently, these embeddings are most suitable for robust landmark identification and subsequent
UAV navigation tasks, clearly supported by both our previous theoretical speculations and current
quantitative analyses.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Embedding space of the top-rated configuration</title>
        <p>
          The quantitative study has previously singled out the max-pooled embedding built from layers [
          <xref ref-type="bibr" rid="ref6">9, 6, 10</xref>
          ]
as the most reliable representation. We now inspect the geometry of this embedding space and verify
that landmark buildings occupy distinctive, well-separated regions.
        </p>
        <p>PCA and t-SNE dimensionality reduction algorithms were employed to inspect the global structure.
Figure 5 juxtaposes two-dimensional projections of all satellite-image embeddings produced by the best
configuration. In each plot, orange dots denote buildings flagged as landmarks by the Isolation Forest
(contamination = 0.2), while blue dots correspond to typical buildings.</p>
        <p>In the PCA view (Fig. 5a), the majority of embeddings form a compact cloud to the upper left. From
this dense core, a sparse, elongated branch extends down-right, ending in a clearly detached cluster of
orange points. The continuous transition from core to tail suggests a spectrum of visual distinctiveness:
small, repetitive residences populate the high-density nucleus, whereas progressively more unusual
structures migrate towards the periphery.</p>
        <p>The t-SNE map (Fig. 5b) echoes this picture with higher non-linear fidelity. It displays several tight
islands of nearly identical embeddings; most lie inside the blue core, but a pronounced orange enclave
(a)
(b)
appears on the right fringe. Because t-SNE preserves local neighbourhoods, such edge clustering
indicates that landmark embeddings are indeed far from typical ones in the high-dimensional space,
not merely artefacts of linear projection.</p>
        <p>To illustrate how these projections relate to concrete urban scenes, Figure 6 enlarges two annotated
areas from the PCA plot.</p>
        <p>Region a) sits deep inside the blue core. It contains a high concentration of points whose embeddings
are almost indistinguishable. Visual examination confirms that these correspond to small, rectangular
family houses with homogeneous grey roofs, by far the most frequent pattern in the dataset (examples
in Fig. 7).</p>
        <p>Region b) lies at the extreme tip of the orange branch. The highlighted cluster comprises only
landmark points. The corresponding buildings (Fig. 8) are large, architecturally irregular complexes,
shopping malls, sports halls, L-shaped blocks, whose footprints and textures deviate strongly from
suburban norms. Their separation validates the outlier-based landmark criterion.</p>
        <p>A similar analysis on the t-SNE embedding (Fig. 9) yields consistent conclusions. Within the blue
nucleus we find a miniature cluster, again marked a), that gathers houses partially cropped by image
borders (Fig. 10). On the opposite flank, marker b) points to the same striking structures already
observed in PCA (Fig. 11). The fact that both linear and non-linear projections isolate identical landmark
sets reinforces the robustness of the learned representation.</p>
        <p>An encouraging observation is that many orange points occur in small, self-contained groups of
two or three. Manual inspection shows these groups have repeated detections of the same landmark
building in consecutive frames. The tight clustering of their embeddings indicates strong invariance
to minor changes in viewing angle, as was investigated and proved in [22], illumination, and partial
occlusions, an essential property for reliable UAV localisation.</p>
        <p>Conversely, blue points that stray into low-density outskirts often correspond to buildings that are
visually similar yet geographically distant from each other. Their presence cautions that, although our
method suppresses most ambiguity, truly fool-proof disambiguation requires either a larger landmark
pool or an additional geometric consistency check, an avenue we explore in future work.</p>
        <p>In summary, the qualitative evidence aligns with earlier quantitative findings: the max-pooled,
multi-layer embedding carves out a well-structured space where visually distinctive buildings occupy
separable, easily identifiable regions.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Comparison of retrieval accuracy for typical and landmark buildings</title>
        <p>For quantitative evaluation, it was necessary to establish a manual benchmark set due to the lack of
ground truth correspondences in VPAIR. This set comprised 200 manually annotated buildings: 100
landmark and 100 typical buildings. To measure retrieval efectiveness from UAV-captured buildings
back to satellite imagery, embeddings for each of the 200 manually annotated UAV buildings were
compared to all satellite-derived embeddings using the L2 norm. Retrieval performance was quantified
using the metrics Recall@1 and Recall@5, computed independently for landmark and typical buildings,
thus objectively demonstrating the relative advantage of landmark selection. The results for the
best-performing embedding configuration are summarised in Table 2.
0.53
0.31</p>
        <p>K=5</p>
        <p>A clear gap appears: searches that target the automatically selected landmark set succeed almost
twice as often as searches for ordinary buildings. In particular, Recall@1 rises from 0.30 for typical
structures to 0.53 for landmarks, while Recall@5 climbs from 0.51 to 0.70. The latter figure suggests that
a lightweight re-ranking of the top-5 candidates could push single-shot accuracy close to 0.70 without
altering the core pipeline.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Limitations</title>
        <p>The present study is confined to a single public dataset, VPAIR, whose drone imagery was captured
under favourable daylight and near-nadir conditions. Consequently, the learned embeddings have not
yet been stress-tested against seasonal changes, low-sun shadows, or highly oblique UAV views. A
second constraint is the reliance on YOLOv11-nano for building segmentation. Although qualitative
checks confirm good cross-dataset generalisation, occasional mask errors reveal that downstream
performance is ultimately bounded by segmentation quality. As the exact landmark embeddings are
model-dependent, it is important that the same CNN is used both for the UAV on-board camera and the
landmarks preparation pipeline.</p>
        <p>Evaluation, too, is approximate. In the absence of authoritative building-to-building correspondences,
we circumvent these limitations with (i) manually labelled pairs for the final benchmark and (ii) proxy
metrics that exploit index proximity. While the latter proved efective for layer selection, they assume
both suficient image overlap and uniform flight speed during the dataset construction and do not
guarantee the global optima of the solution. Finally, Isolation Forest employs a fixed contamination
rate; adapting this hyper-parameter to scenes with markedly diferent object density remains an open
problem.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>We introduced an end-to-end framework that automatically discovers visually distinctive urban
landmarks and harnesses them for UAV localisation when GNSS is unreliable. The core idea is simple yet
powerful: extract multi-layer CNN features for each object, aggregate them into a semantically rich
embedding, and treat the landmark objects as outliers in the resulting embedding space as a natural way
to select the most distinctive objects automatically. We proposed a lightweight approach, which uses a
greedy search and two proxy retrieval metrics, to guide the selection process of the optimal embedding
parameters without the requirement of ground-truth labels. The selected embedding, max pooling
over layers 9, 6, 10, doubles Recall@1 compared with typical buildings (0.53 vs 0.31) and achieves 0.70
Recall@5, demonstrating that true landmarks are indeed easier to recover. Qualitative visualisations
confirm that these embeddings carve out well-separated clusters for architecturally unique structures
while remaining stable under viewpoint shifts, segmentation noise, and large rotations. Taken together,
the results validate both the theoretical intuition that max pooling preserves salient cues and the
practical viability of outlier-based landmark selection. While future work will address dynamic graph
structures and geometric verification to filter residual false positives, the method currently stands as a
potent, drop-in module for robust GPS-denied navigation.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[8] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of</p>
      <p>Computer Vision 60 (2004) 91–110. doi:10.1023/B:VISI.0000029664.99615.94.
[9] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and
fast spatial matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2007, pp. 1–8. doi:10.1109/CVPR.2007.383172.
[10] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, J. Sivic, NetVLAD: CNN architecture for weakly
supervised place recognition, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2016, pp. 5297–5307. doi:10.1109/CVPR.2016.572.
[11] P. Radiuk, O. Barmak, E. Manziuk, I. Krak, Explainable deep learning: a visual analytics approach
with transition matrices, Mathematics 12 (2024) 1024. doi:10.3390/math12071024.
[12] H. Noh, A. Araujo, J. Sim, T. Weyand, B. Han, Large-scale image retrieval with attentive deep local
features, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017,
pp. 3456–3465. doi:10.1109/ICCV.2017.373.
[13] C. H. Song, J. Yoon, T. Hwang, S. Choi, Y. H. Gu, Y. Avrithis, On train-test class overlap and
detection for image retrieval, arXiv preprint arXiv:2306.02484 (2024).
[14] O. Y. Al-Jarrah, A. S. Shatnawi, M. M. Shurman, O. A. Ramadan, S. Muhaidat, Exploring deep
learning-based visual localization techniques for UAVs in GPS-denied environments, IEEE Access
12 (2024) 113049–113071. doi:10.1109/ACCESS.2024.3440064.
[15] Y. Xu, M. Dai, W. Cai, W. Yang, Precise GPS-denied UAV self-positioning via context-enhanced
cross-view geo-localization, arXiv preprint arXiv:2502.11408 (2025).
[16] C.-Y. Wang, A. Bochkovskiy, H.-Y. M. Liao, YOLOv7: Trainable bag-of-freebies sets new
state-of-theart for real-time object detectors, in: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2023, pp. 7464–7475. doi:10.1109/CVPR52729.2023.00721.
[17] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation forest, in: Proceedings of the IEEE International</p>
      <p>Conference on Data Mining (ICDM), 2008, pp. 413–422. doi:10.1109/ICDM.2008.17.
[18] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Computer</p>
      <p>Vision – ECCV 2014, Springer, 2014, pp. 818–833. doi:10.1007/978-3-319-10590-1_53.
[19] M. Schleiss, F. Rouatbi, D. Cremers, VPAIR – Aerial visual place recognition and localization in
large-scale outdoor environments, 2022. doi:10.48550/arXiv.2205.11567.
[20] S. Javaid, M. A. Khan, H. Fahim, B. He, N. Saeed, Explainable AI and monocular vision for enhanced
UAV navigation in smart cities: prospects and challenges, Frontiers in Sustainable Cities 7 (2025)
1561404. doi:10.3389/frsc.2025.1561404.
[21] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks?, in: Proceedings
of the International Conference on Learning Representations (ICLR), 2019, pp. 1–17. URL: https:
//openreview.net/forum?id=ryGs6iA5Km.
[22] O. Barmak, I. Krak, E. Manziuk, Diversity as the basis for efective clustering-based classification, in:
Proceedings of the 9th International Conference on Information Control Systems &amp; Technologies
(ICST 2020), volume 2711, CEUR-WS.org, Aachen, 2020, pp. 53–67. URL: https://ceur-ws.org/
Vol-2711/paper5.pdf.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Masone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          ,
          <article-title>A survey on deep visual place recognition</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>19516</fpage>
          -
          <lpage>19547</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2021</year>
          .
          <volume>3054937</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ayala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Portela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Buarque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J. T.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <article-title>UAV control in autonomous objectgoal navigation: a systematic literature review</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>57</volume>
          (
          <year>2024</year>
          )
          <article-title>125</article-title>
          . doi:
          <volume>10</volume>
          .1007/s10462-024-10758-7.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Maurício</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Domingues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bernardino</surname>
          </string-name>
          ,
          <article-title>Comparing vision transformers and convolutional neural networks for image classification: a literature review</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <article-title>9</article-title>
          . doi:
          <volume>10</volume>
          .3390/ app13095521.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rundo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Militello</surname>
          </string-name>
          ,
          <article-title>Image biomarkers and explainable AI: handcrafted features versus deep learned features</article-title>
          ,
          <source>European Radiology Experimental</source>
          <volume>8</volume>
          (
          <year>2024</year>
          )
          <article-title>130</article-title>
          . doi:
          <volume>10</volume>
          .1186/ s41747-024-00529-y.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Manziuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wojcik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. V.</given-names>
            <surname>Barmak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. V.</given-names>
            <surname>Krak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Drabovska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Puhach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sundetov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mussabekova</surname>
          </string-name>
          ,
          <article-title>Approach to creating an ensemble on a hierarchy of clusters using model decisions correlation</article-title>
          ,
          <source>Przegląd Elektrotechniczny</source>
          <volume>96</volume>
          (
          <year>2020</year>
          )
          <fpage>108</fpage>
          -
          <lpage>113</lpage>
          . doi:
          <volume>10</volume>
          .15199/48.
          <year>2020</year>
          .
          <volume>09</volume>
          .23.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Se</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Little</surname>
          </string-name>
          ,
          <article-title>Global localization using distinctive visual features</article-title>
          ,
          <source>in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
          . doi:
          <volume>10</volume>
          .1109/IRDS.
          <year>2002</year>
          .
          <volume>1041393</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Karnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rifel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <article-title>Key-region-based UAV visual navigation</article-title>
          ,
          <source>International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLVIII-2</source>
          (
          <year>2024</year>
          )
          <fpage>173</fpage>
          -
          <lpage>179</lpage>
          . doi:
          <volume>10</volume>
          .5194/isprs-archives-XLVIII-2
          <string-name>
            <surname>-</surname>
          </string-name>
          2024-173-
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>