<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Fusion of Global and Local Descriptors with Feature Calibration for Multi-Species Animal Re-Identification: AnimalCLEF 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dongyeon Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bohee Park</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanjun Bae</string-name>
          <email>hbae0830@mju.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sua Lee</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chaeyeon Lee</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Animal Re-ID, Multi-Species, Fusion, Global Descriptor, Local Matching, CEUR-WS</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Myongji University</institution>
          ,
          <addr-line>Seoul</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sogang University</institution>
          ,
          <addr-line>Seoul</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sookmyung Women's University</institution>
          ,
          <addr-line>Seoul</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents a multi-matcher fusion pipeline developed for the AnimalCLEF 2025 individual animal re-identification challenge. The task involves identifying distinct individuals within the same species, under constraints such as one-shot learning and open-set recognition for previously unknown individuals. To address these challenges, we integrate three complementary matchers: MegaDescriptor for global visual features, ALIKED for local keypoint-based matching, and EVA02 for semantic-level similarity. These components are fused using WildFusion-based score calibration and a simple weighted averaging scheme. Additionally, species-specific preprocessing, such as orientation normalization for salamanders and 5-crop Test-Time Augmentation, is applied to enhance robustness. Our final pipeline achieved a public score of 0.50708 and a private score of 0.53185, representing a 23.2 percentage points relative improvement over a baseline solution (0.3002). According to the oficial leaderboard, our system ranks 44th out of 230 participating teams, placing in the top 19%. This outcome demonstrates the efectiveness of combining global, local, and semantic descriptors through calibrated fusion in a multi-species wildlife ReID context. The full implementation of our pipeline is available at https: //github.com/dongyeon1031/AnimalCLEF2025.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The global decline in biodiversity has intensified the demand for automated technologies capable of
supporting wildlife monitoring, including population tracking, migration analysis, and behavioral
studies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Individual-level animal re-identification plays a crucial role in enabling fine-grained
ecological analysis and supporting conservation strategies that go beyond species-level classification [
        <xref ref-type="bibr" rid="ref2 ref3">2,
3</xref>
        ]. However, most existing computer vision systems focus solely on species recognition and often fail
to discriminate between individuals within the same species.
      </p>
      <p>
        This study was conducted based on the individual identification task proposed in LifeCLEF 2025 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
specifically following the problem definition and data configuration of the AnimalCLEF track [
5]. The
AnimalCLEF 2025 competition challenges [5] participants to design robust systems for identifying
individual animals across multiple species [
        <xref ref-type="bibr" rid="ref2">2, 6</xref>
        ]. In particular, we followed the problem definition
and data configuration provided in the AnimalCLEF track. The competition emphasizes open-set
recognition, requiring models to generalize to unknown individuals captured under diverse conditions.
To meet these demands, our approach prioritizes practical and modular design choices tailored for
deployment in real-world scenarios.
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>In contrast to species classification, which aims to distinguish between diferent animal types (e.g.,
lion vs. girafe), the central task of individual re-identification involves diferentiating among distinct
individuals within the same species (e.g., Lion A vs. Lion B). This presents a more complex recognition
problem, particularly under conditions of limited training data and varying pose, lighting, and
background. The key evaluation criterion is the system’s ability to generalize beyond the training identities
and accurately match unknown individuals based on visual cues alone.</p>
      <p>The AnimalCLEF 2025 competition [5] focuses on individual animal re-identification and introduces
several key technical challenges:
• One-shot learning: Many individuals in the reference set are represented by only one or two
images, requiring models to generalize with minimal supervision.
• Open-set recognition: The test set contains individuals not seen during training, necessitating
open-set recognition capabilities beyond standard closed-set classification.
• Pose and illumination variation: Images captured in unconstrained environments exhibit wide
variations in pose, lighting, and resolution, demanding robust feature extraction and invariant
representation learning.
• Data imbalance: The number of images per identity is highly imbalanced, which can introduce
training bias and reduce generalization performance.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Global Descriptor for Animal Re-Identification</title>
        <p>
          Global descriptors are widely used in animal re-identification to compute visual similarity by embedding
images into a feature space that captures overall appearance. MegaDescriptor [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is a Swin
Transformer [7]-based model trained on 29 wildlife datasets using metric learning with ArcFace loss [8]. It
serves as a foundation model for animal re-identification and has demonstrated superior performance
over other pretrained descriptors such as CLIP [9] and DINOv2 [10] across diverse species [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>To further enhance semantic representation, EVA02 [11], a vision transformer pretrained with
CLIPstyle supervision, is known to capture high-level semantics that are often overlooked by conventional
CNN-based descriptors.</p>
        <p>WildFusion [12] performs calibrated fusion of similarity scores from heterogeneous descriptors, and
has demonstrated efectiveness in improving robustness in open-set scenarios.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Local Matching-Based Complementary Methods</title>
        <p>While global descriptors capture holistic visual appearance, they often struggle in scenarios
involving occlusion, viewpoint variation, or partial visibility. To address these challenges, local matching
techniques have been proposed to provide fine-grained correspondence information.</p>
        <p>ALIKED [13] is a learning-based local feature extractor and matcher that balances matching accuracy
with computational eficiency. It leverages patch-level correspondences to estimate image similarity
and has shown strong performance in challenging visual re-identification tasks.</p>
        <p>LoFTR [14, 15] represents a dense matching approach that enables pixel-wise correspondence across
entire images without relying on keypoint detection. While LoFTR improves alignment robustness, its
high computational overhead often limits its practical use in large-scale or real-time systems.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Practical Strategies for Visual Re-identification</title>
        <p>Beyond descriptor design, several practical strategies have been explored to improve the robustness of
visual re-identification pipelines. For instance, geometric normalization techniques have been used to
standardize orientation in datasets with variable poses, and Test-Time Augmentation (TTA) has been
employed to improve generalization by averaging predictions over multiple image views [16].</p>
        <p>Furthermore, fusion strategies such as WildFusion [12] have demonstrated the efectiveness of
combining global and local similarity scores via calibrated score-level fusion, enabling more flexible
matching across representation levels.</p>
        <p>These prior studies on descriptor fusion, local-global integration, and augmentation strategies
collectively inspired the design of our pipeline, which integrates complementary modules to improve
robustness in open-set and fine-grained recognition settings.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset and Evaluation Metrics</title>
      <p>The AnimalCLEF 2025 competition [5] addresses the task of individual-level recognition across multiple
wildlife species. It evaluates models based on their ability to correctly match known individuals and to
generalize to previously unknown ones, under an open-set recognition setting.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Overview</title>
        <p>
          The AnimalCLEF 2025 challenge [5] provides image datasets [
          <xref ref-type="bibr" rid="ref2">2, 6</xref>
          ] for three species: sea turtles
(SeaTurtleID2022), salamanders (SalamanderID2025), and lynxes (LynxID2025), derived from the CzechLynx
dataset [17]. Each dataset is composed of a labeled database set, containing identity annotations for
known individuals (e.g., LynxID2025_lynx_17), and a query set, for which the model must either retrieve
the correct identity or predict new_individual if the target is not present in the database.
        </p>
        <p>Metadata for all images is provided in a unified metadata.csv file, which includes image paths,
individual identifiers, species labels, orientation information, and query/database status.</p>
        <p>Each species presents unique visual challenges that may hinder reliable individual identification:
• Sea turtles (SeaTurtleID2022): Images are often low in resolution and sufer from underwater
distortions.
• Salamanders (SalamanderID2025): Images may include human hands or inconsistent
orientation, leading to feature misalignment.
• Lynxes (LynxID2025) [17]: Images are captured by camera traps, some taken at night. These
often include back-facing individuals or cases where facial features are not clearly visible.</p>
        <p>Beyond the basic provided data, a large auxiliary dataset, WildlifeReID-10k, is available. It includes
images of 10,000 individuals from 36 animal species (marine mammals, birds, primates, livestock, etc.),
totaling around 140,000 images. This auxiliary set can be used for model pre-training.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation Metrics</title>
        <p>The competition evaluates model performance using two complementary metrics that account for both
known and unknown individuals:
• BAKS (Balanced Accuracy for Known Samples): Measures the balanced classification
accuracy for individuals in the reference set, adjusting for class imbalance across identities.
• BAUS (Balanced Accuracy for Unknown Samples): Measures the accuracy of detecting
unknown individuals by evaluating whether new_individual is correctly assigned to novel queries.
The final score is computed as the geometric mean of the two metrics, as shown in Equation ( 1).</p>
        <p>Final Score = √BAKS × BAUS.
(1)</p>
        <p>Equation 1 encourages a balanced treatment of both known identity classification and open-set
novelty detection.</p>
        <p>Image
Preprocessing
MegaDescriptor
EVA02 Similarity</p>
        <p>Retrieval</p>
        <p>Feature Fusion</p>
        <p>WildFusion</p>
        <p>Calibration
Salalmander Image</p>
        <p>Preprocessing
5-crop TTA</p>
        <p>Final Similarity</p>
        <p>Calculation</p>
        <p>Threshold-based</p>
        <p>Identification
Simple Weighted</p>
        <p>Fusion</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>
        The pipeline begins with image preprocessing (resizing and normalization), followed by feature
extraction using three matchers: MegaDescriptor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], EVA02 [11], and ALIKED [13]. Each matcher computes
similarity scores based on global, semantic, or local representations, which are then calibrated and fused
via the WildFusion module [12]. Final predictions are made through threshold-based classification on
the fused scores. The overall pipeline operates through these stages, as summarized in Figure 1.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing and Augmentation</title>
        <p>We applied several preprocessing steps to improve feature consistency and robustness.</p>
        <p>
          Image normalization and size processing ensures compatibility with the pretraining configuration
of each model. Each image is resized and normalized to match the pretraining configuration of the
respective model: MegaDescriptor [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] (384×384, ImageNet mean/std), EVA02 [11] (336×336, CLIP
normalization).
        </p>
        <p>Orientation Normalization for Salamander Dataset addresses inconsistent poses in the
SalamanderID2025 data. Images are rotated based on orientation metadata: right views are rotated −90∘
and left views by +90∘ to standardize top-down alignment.</p>
        <p>5-Crop Test-Time Augmentation (TTA) [16] is used to improve robustness during inference. Five
crops include the center and four corners of the image, and their feature vectors are averaged, as shown
in Equation (2).</p>
        <p>vifnal =
1 5</p>
        <p>∑ v .
5 =1
(2)</p>
        <p>Metadata Utilization supports both preprocessing and matcher calibration. It guides salamander
image orientation correction and helps distinguish query/database samples for calibration dataset
construction.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Matching Strategy and Score Fusion</title>
        <p>
          MegaDescriptor [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] extracts global embeddings using a Swin-Large backbone and computes cosine
similarity, as shown in Equation (3).
        </p>
        <p>sim(q, d) =
q ⋅ d
‖q‖ ‖d‖
.</p>
        <p>The similarity scores are normalized via isotonic regression to enable fusion with other matchers.</p>
        <p>ALIKED [13] detects and matches keypoints to evaluate geometric consistency between image
regions. It is particularly efective in challenging scenarios such as partial visibility, occlusion, and
lowillumination, where global descriptors often fail to provide reliable similarity estimates. Furthermore, it
ofers a favorable trade-of between matching accuracy and computational cost, making it suitable for
large-scale deployment.</p>
        <p>EVA02 [11] uses a ViT-L/14-336 architecture to extract semantic-level embeddings. Cosine similarity
is computed and calibrated for integration. This model is efective in species with low inter-individual
variability.</p>
        <p>WildFusion [12] calibrates raw similarity scores into probability-like outputs using isotonic
regression, as shown in Equation (4).</p>
        <p>Cal() =  ( ̂ = 1 ∣ ).</p>
        <p>Final scores are first obtained through weighted fusion of global and local descriptors, as described in
Equation (5).</p>
        <p>SimWildFusion = Fusion(Mega, ALIKED).</p>
        <p>Then, the final similarity score is computed by combining WildFusion and EVA02 scores, as shown in
Equation (6). We initially set α = 0.5 and subsequently conducted manual hyperparameter tuning to
explore the efect of diferent α values.</p>
        <p>FinalScore =  ⋅ SimWildFusion + (1 − ) ⋅ SimEVA02,  = 0.5.</p>
        <p>The final decision logic is based on a threshold-based rule, as shown in Equation (7).
 =̂ {</p>
        <p>Top-1 ID if FinalScore ≥  ,
Unknown Individual otherwise.
(3)
(4)
(5)
(6)
(7)
As seen in Equation (7), a threshold  determines whether the prediction is assigned to an existing
individual or labeled as unknown. We use an empirically tuned  = 0.35 with species-specific adjustments:
−0.02 for sea turtles and +0.02 for salamanders.</p>
        <p>We also experimented with LoFTR [14, 15]. While it showed robustness in aligning severely deformed
instances, it was excluded from the final pipeline due to its high computational cost and limited
performance gains relative to ALIKED [13] based on results from our validation experiments.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <p>All experiments were conducted on a desktop PC equipped with an NVIDIA RTX 4080 Ti GPU running
Windows 11. The implementation was based on Python 3.12.7 and PyTorch 2.5.1, with CUDA 11.8 for
GPU acceleration.</p>
      <p>
        The EVA02 matcher was implemented using the OpenCLIP library (v2.32.0) and employed the
pretrained merged2b_s6b_b61k checkpoint for the ViT-L/14-336 model [18]. MegaDescriptor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
ALIKED [13] were implemented using the wildlife-tools library (v1.0.1), which provides standardized
wrappers for global and local animal re-identification matchers.
      </p>
      <p>To ensure reproducibility, we fixed the global random seed to 42 across Python random, NumPy, and
PyTorch (including CUDA). No additional random splitting was performed, as the competition provided
ifxed database and query splits. The dataset was divided into a database set and a query set, following
the oficial competition split. For score calibration, we additionally selected the first 1000 samples from
both the query and database sets to form calibration subsets. These were used exclusively for training
the isotonic regression model and were not involved in evaluation or matching.</p>
      <p>Score calibration was performed using isotonic regression, trained on a balanced set of 1000 positive
and 1000 negative image pairs generated from the calibration subset. Positive pairs consisted of images
from the same individual, while negative pairs were drawn from diferent individuals within the same
species. These matching pairs were automatically constructed during the score calibration procedure,
and the internal pairing criteria were handled by the calibration module rather than explicitly defined
in our implementation. Species balance was maintained during this sampling process.</p>
      <p>The final similarity score was computed using late fusion of the calibrated scores from WildFusion [ 12]
and cosine similarity scores from EVA02 [11]. Fusion was implemented manually outside the core
pipeline. We tested multiple candidates for the fusion weight  ∈ {0.3, 0.4, 0.5, 0.6} and selected  = 0.5
based on identification accuracy on the calibration set. A fixed decision threshold of 0.35 was used for
determining identity matches in the final prediction step.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>To evaluate the efectiveness of our final re-identification pipeline, we compared it with the baseline
solution, as summarized in Table 1.</p>
      <p>
        The Baseline solution refers to the oficial starter notebook released for the AnimalCLEF2025
Competition [5]. It extracts global features using the MegaDescriptor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and computes cosine similarity
between feature vectors. A fixed threshold of 0.6 is used to determine whether the match corresponds
to a known or unknown individual. No local matcher or augmentation was applied [16]. This setup
achieved a private score of 0.30898 and a public score of 0.3002, highlighting the limitations of relying
solely on global embeddings for individual animal identification, especially under challenging conditions
like pose changes, occlusions, and background clutter.
      </p>
      <p>
        The WildFusion configuration builds upon the baseline by incorporating the ALIKED [ 13] local
matcher to enhance spatial alignment. Similarity scores from MegaDescriptor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and ALIKED [13]
are independently computed and calibrated using isotonic regression, and subsequently fused via the
WildFusion [12]. While semantic-level descriptors such as EVA02 [11] and Test-Time Augmentation [16]
were not applied, the integration of local keypoint information significantly improved matching accuracy.
This configuration achieved a private score of 0.44362 and a public score of 0.36555, demonstrating
the benefits of coarse-to-fine fusion using global and local features, particularly in scenarios with pose
variation and partial visibility.
      </p>
      <p>In contrast, the Ours (Final) configuration incorporates all proposed enhancements—ALIKED [ 13],
EVA02 [11], and 5-Crop Test-Time Augmentation [16]—to improve re-identification performance. The
EVA02 matcher [11] introduces semantic-level information, and its integration via weighted averaging
with WildFusion [12] outputs led to notable improvements, as shown in Equation (6). In our experiments,
we set  = 0.5 and subsequently conducted manual hyperparameter tuning to explore the impact of
diferent  values. Our final configuration achieved a private score of 0.53185 and a public score of
0.50708, ranking 44th out of 230 teams (top 19%), and reflecting a 23.2 percentage points improvement
over the baseline.</p>
      <p>We applied Test-Time Augmentation [16] to enhance robustness against variations in pose and
viewpoint. Five crops (center and four corners) were used to generate embeddings, which were then
averaged. This method helped mitigate spatial misalignment and local occlusions leading to improved
recognition accuracy compared to single-view inference.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <sec id="sec-7-1">
        <title>7.1. Failure Cases and Analysis</title>
        <p>Several experimental strategies were tested to improve overall performance. However, some approaches
resulted in suboptimal outcomes or failed to deliver the expected improvements. A summary of these
strategies is presented in Table 2.</p>
        <p>
          • Rerank Cascade (Private: 0.45588, Public: 0.42590): This method applied ALIKED matching
to the top-K candidates generated by MegaDescriptor [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and EVA02 [11] fusion. Although
intended to refine the ranking and improve runtime eficiency, this approach resulted in a slight
performance decline. One possible explanation is that the initial top-K candidates were already
accurate in many cases, and the additional re-ranking introduced noise that disrupted correct
top-1 predictions. This suggests that the benefits of re-ranking may be limited when the first-stage
retrieval is already well-calibrated.
• LoFTR Matcher [14] (Private: 0.44131, Public: 0.41803): Replacing ALIKED [13] with the dense
matcher LoFTR [14] resulted in degraded accuracy and slower inference. A plausible reason
is that LoFTR’s dense pixel-level correspondence may have captured irrelevant or background
features, thereby weakening the distinctiveness of local matches. Furthermore, its computational
overhead proved disadvantageous in a multi-matcher setting, making the method less practical
under our pipeline constraints.
• Segmentation Preprocessing (Private: 0.49716, Public: 0.44136): We applied the Segment
Anything Model (SAM) [19] to extract foreground-only regions for salamander and sea turtle
images (see Figure 2). However, this approach did not improve performance and in some cases
reduced it. This may be attributed to inconsistent mask quality and the unintended removal of
contextual background cues that could aid in individual discrimination.
• DINOv2 Matcher [10] Addition (Private: 0.51545, Public: 0.49805): We added DINOv2 [10]
as an additional matcher to MegaDescriptor [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] + ALIKED [13] pipeline. Although it yielded
slightly higher scores than the baseline, the gains were not substantial. This suggests that simply
increasing the number of matchers does not necessarily translate to consistent performance
improvement.
• Late Fusion [20] (Private: 0.51124, Public: 0.49948): This approach simply averaged the similarity
scores from MegaDescriptor [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and EVA02 [11] without calibration or weighting. The resulting
fusion underperformed relative to our WildFusion [12] + EVA02 [11] combination. A likely reason
is that the lack of normalization of the scores prevented the model from efectively leveraging
the complementarity between the two matchers.
• Fusion MLP [21] (Private: 0.52501, Public: 0.49906): A shallow MLP was introduced to learn
a non-linear combination of scores from MegaDescriptor [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and EVA02 [11]. Due to time
constraints, the MLP was submitted without training, using random weights. Consequently, the
untrained model underperformed. With proper training on positive and negative match pairs,
this learning-based fusion approach remains a promising direction for future work.
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Limitations and Computational Considerations</title>
        <p>Although the proposed pipeline demonstrated strong performance in the AnimalCLEF 2025 challenge [5],
several practical limitations remain.</p>
        <p>
          First, the use of multiple independently operating matchers (MegaDescriptor [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], ALIKED [13],
and EVA02 [11]) and 5-crop Test-Time Augmentation [16] increases inference cost. While this design
contributes to robustness, it introduces computational overhead that may limit deployment in
resourceconstrained settings.
        </p>
        <p>Second, the current system relies on manually tuned species-specific thresholds for decision-making.
This reduces its adaptability to new domains or unseen categories. Incorporating more adaptive
calibration techniques could help improve generalization in open-set scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>
        In the context of the AnimalCLEF 2025 individual re-identification task [ 5], we developed a modular
Global–Local–Semantic fusion pipeline. The core structure integrates global descriptors
(MegaDescriptor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), local keypoint matchers (ALIKED [13]), and semantic-level representations (EVA02 [11]),
further enhanced through 5-crop Test-Time Augmentation [16].
      </p>
      <p>This combination enabled the system to maintain robust performance under challenging conditions,
including variation in pose, lighting, resolution, and background. The final model achieved a private
score of 0.53185 and a public score of 0.50708, indicating consistent and improved performance through
matcher-level fusion and spatial augmentation.</p>
      <p>Future extensions of this research may include several directions aimed at improving generalization,
eficiency, and real-world applicability.</p>
      <p>• Adaptive Open-set Handling: The open-set nature of real-world re-identification demands
further investigation into adaptive thresholding and novelty detection. Techniques such as
confidence calibration or meta-recognition may ofer more principled approaches than static
thresholds.
• Computational Eficiency [ 22]: The current pipeline relies on multiple matchers and
augmentation techniques, which increases computational cost. Optimizing for scalability via model
pruning, knowledge distillation, or dynamic inference could enable deployment in large-scale or
latency-sensitive settings.
• Edge Deployment [23]: Real-world applications often require mobile or embedded execution.</p>
      <p>Exploring quantization, lightweight model architectures, and platform-specific optimizations
(e.g., TensorRT or mobile GPU support) would enhance portability for field use cases such as
conservation monitoring or veterinary support.
• Multimodal Integration [24]: Incorporating auxiliary data such as timestamps, GPS locations,
or environmental metadata could improve accuracy, particularly for visually ambiguous or
lowvariance species.
• Cross-domain Transferability [25]: The architecture’s ability to discriminate fine-grained
visual diferences may be extended to other domains. For example, the framework could support
assistive technologies like medication recognition for visually impaired users, where
distinguishing similar visual instances is safety-critical.
• Learning-based Fusion via MLP: Due to time constraints, we were unable to train the
fusion MLP and had to apply it with randomly initialized weights, which resulted in suboptimal
performance. In future work, we plan to train the MLP to enable more efective model integration.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work was carried out as an independent undergraduate team project, without external funding
or institutional afiliation. The authors would like to thank all team members for their dedicated
collaboration and contribution throughout the AnimalCLEF 2025 challenge.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4o and Grammarly in order to assist with
grammar, spelling, and clarity improvement. After using these tools, the author(s) reviewed and edited
the content as needed and take full responsibility for the publication’s content.
prediction and identification, and individual animal identification, in: International Conference of
the Cross-Language Evaluation Forum for European Languages (CLEF), Springer, 2025.
[5] L. Adam, K. Papafitsoros, R. Kovář, V. Čermák, L. Picek, Overview of AnimalCLEF 2025:
Recognizing individual animals in images, Working Notes of CLEF 2025 - Conference and Labs of the
Evaluation Forum (2025).
[6] L. Adam, V. Čermák, K. Papafitsoros, L. Picek, Wildlifereid-10k: Wildlife re-identification dataset
with 10k individual animals, arXiv preprint arXiv:2406.09211 (2024).
[7] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision
transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on
computer vision, 2021, pp. 10012–10022.
[8] J. Deng, J. Guo, N. Xue, S. Zafeiriou, Arcface: Additive angular margin loss for deep face recognition,
in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2019, pp. 4690–4699.
[9] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual models from natural language supervision, in:
Proceedings of the International Conference on Machine Learning (ICML), PMLR, 2021, pp. 8748–8763.
[10] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza,
F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without supervision, arXiv
preprint arXiv:2304.07193 (2023).
[11] Y. Fang, Q. Sun, X. Wang, T. Huang, X. Wang, Y. Cao, Eva-02: A visual representation for neon
genesis, Image and Vision Computing 149 (2024) 105171.
[12] V. Cermak, L. Picek, L. Adam, L. Neumann, J. Matas, Wildfusion: Individual animal identification
with calibrated similarity fusion, in: European Conference on Computer Vision, Springer, 2025,
pp. 18–36.
[13] X. Zhao, X. Wu, W. Chen, P. C. Chen, Q. Xu, Z. Li, Aliked: A lighter keypoint and descriptor
extraction network via deformable transformation, IEEE Transactions on Instrumentation and
Measurement 72 (2023) 1–16.
[14] J. Sun, Z. Shen, Y. Wang, H. Bao, X. Zhou, Loftr: Detector-free local feature matching with
transformers, in: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 2021, pp. 8922–8931.
[15] Y. Wang, X. He, S. Peng, D. Tan, X. Zhou, Eficient loftr: Semi-dense local feature matching with
sparse-like speed, in: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 2024, pp. 21666–21675.
[16] D. Shanmugam, D. Blalock, G. Balakrishnan, J. Guttag, When and why test-time augmentation
works, arXiv preprint arXiv:2011.11156 1 (2020) 4.
[17] L. Picek, E. Belotti, M. Bojda, L. Bufka, V. Cermak, M. Dula, R. Dvorak, L. Hrdy, M. Jirik, V. Kocourek,
et al., Czechlynx: A dataset for individual identification and pose estimation of the eurasian lynx,
arXiv preprint arXiv:2506.04931 (2025).
[18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image
recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
[19] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg,
W.-Y. Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF international conference on
computer vision, 2023, pp. 4015–4026.
[20] J. Kittler, M. Hatef, R. P. Duin, J. Matas, On combining classifiers, IEEE transactions on pattern
analysis and machine intelligence 20 (1998) 226–239.
[21] N. Bodla, J. Zheng, H. Xu, J.-C. Chen, C. Castillo, R. Chellappa, Deep heterogeneous feature fusion
for template-based face recognition, in: 2017 IEEE winter conference on applications of computer
vision (WACV), IEEE, 2017, pp. 586–595.
[22] S. Han, H. Mao, W. J. Dally, Deep compression: Compressing deep neural networks with pruning,
trained quantization and hufman coding, arXiv preprint arXiv:1510.00149 (2015).
[23] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko, Quantization
and training of neural networks for eficient integer-arithmetic-only inference, in: Proceedings of
the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713.
[24] T. Baltrušaitis, C. Ahuja, L.-P. Morency, Multimodal machine learning: A survey and taxonomy,</p>
      <p>IEEE transactions on pattern analysis and machine intelligence 41 (2018) 423–443.
[25] K. Weiss, T. M. Khoshgoftaar, D. Wang, A survey of transfer learning, Journal of Big data 3 (2016)
1–40.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chalmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longmore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wich</surname>
          </string-name>
          ,
          <article-title>Harnessing artificial intelligence for wildlife conservation</article-title>
          ,
          <source>Conservation</source>
          <volume>4</volume>
          (
          <year>2024</year>
          )
          <fpage>685</fpage>
          -
          <lpage>702</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wildlifedatasets:</surname>
          </string-name>
          <article-title>An open-source toolkit for animal re-identification</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>5953</fpage>
          -
          <lpage>5963</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Ravoor</surname>
          </string-name>
          , T. Sudarshan,
          <article-title>Deep learning methods for multi-species animal re-identification and tracking-a survey</article-title>
          ,
          <source>Computer Science Review</source>
          <volume>38</volume>
          (
          <year>2020</year>
          )
          <fpage>100289</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janoušková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of lifeclef 2025:
          <article-title>Challenges on species presence</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>