<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Meta‑Algorithm for Open‑Set Animal Re-ID: WildFusion, XGBoost, and Dual‑Backbone ArcFace⋆</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Moscow Pedagogical State University (MPGU University)</institution>
          ,
          <addr-line>1/1 Malaya Pirogovskaya St., Moscow, 119435, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a three‑stage meta‑algorithm that addresses open‑set individual animal re-identification. The cascade first employs WildFusion to fuse calibrated global-local similarity scores, then feeds concatenated MegaDescriptor‑L and MIEW embeddings into an XGBoost classifier, and finally refines predictions with species‑specific Dual‑Backbone models fine‑tuned using an ArcFace angular‑margin loss. On the AnimalCLEF 2025 challenge, which includes loggerhead sea turtles, fire salamanders, and Eurasian lynxes and exhibits a pronounced long‑tail imbalance, the proposed method achieved a private score of 67.42% and a public score of 65.11%, ranking 2nd out of 172 teams. Ablation analysis shows cumulative improvements of +21 percentage points (pp) from WildFusion over a MegaDescriptor baseline, +2.4 pp from XGBoost, and +3 pp from the Dual‑backbone ArcFace stage. These results demonstrate that species‑aware stacking of heterogeneous cues (global descriptors, calibrated local matches, tabular neighbor context, and metric fine‑tuning) yields a robust and scalable solution for non‑invasive wildlife monitoring.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;animal re-identification</kwd>
        <kwd>wildfusion</kwd>
        <kwd>megadescriptor</kwd>
        <kwd>miew</kwd>
        <kwd>vision transformer</kwd>
        <kwd>eficientnet-v2</kwd>
        <kwd>arcface</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Individual animal re-identification (Animal Re-ID) is the task of recognizing spiencdiificviduals in
images. Accurate identity assignment is critical to ecology and wildlife conservation because it enables
monitoring of population size, migration routes, and behavioral patterns of rare specie1s].iInnsitu [
Human Re‑ID universal biometric cues such as faces or fingerprints are available, whereas these markers
are not directly applicable to animals. Instead, recognition relies on unique natural markings—spot
and stripe patterns, carapace mosaics, and similar traits—that vary markedly with viewpoint, pose, and
illumination2[]. The problem is compounded by a shortage of labeled data: collecting and annotating
photographs of individual animals is labor‑intensive, so Animal Re-ID datasets are several orders of
magnitude smaller than those used in Person Re‑ID [3].</p>
      <p>Traditional biological approaches—ringing, tagging, and DNA analysis—are reliable but invasive and
unsuitable for large‑scale monitor4in].gE[arly computer‑vision algorithms addressed one species at a
time and relied on handcrafted features, which does not scale. With deep neural networks (first CNNs,
later Vision Transformer5s])[Human Re‑ID achieved a high level of accuracy, yet direct transfer to
animals proved inefective: the class set is open, inter‑individual diferences are subtle, and intra‑species
variability is high [6]. These factors motivated specialized methods for Animal Re-ID.</p>
      <p>The AnimalCLEF 2025 competition [78,] poses a multi‑species challenge: identifying loggerhead
sea turtleCsaretta caretta (Greece), fire salamandersSalamandra salamandra (Czech Republic), and
Eurasian lynxeLsynx lynx (Czech Republic) [9]. For each input image the system must decide whether
the depicted animal belongs to a knoinwdnividual in the training database (tdahteabase set or ”gallery”)
or represents naew individual; if known, the correct ID must be returned. Consequently, the task
combines classical Re‑ID with aonpen‑set component. Performance is evaluatedBbAyKS (balanced
accuracy on known samples) anBdAUS (balanced accuracy on unknown samples); the final score is
the geometric mean of these two metrics [7].</p>
      <p>The meta‑algorithm proposed in this study—combining WildFus1io],nX[GBoost, and a
Dual‑backbone model with an ArcFac1e0[] head—ranked second among 172 teams, achievinpgraivate score of
67.42% and apublic score of 65.11% (team Webmaking). Subsequent sections describe the employed
methods in detail (Sec3.) and present a step‑by‑step analysis of the contribution made by each component
to the final performance (Sec4.).</p>
    </sec>
    <sec id="sec-2">
      <title>2. AnimalCLEF 2025 challenge characteristics</title>
      <sec id="sec-2-1">
        <title>2.1. Description and objectives</title>
        <p>The primary goal of AnimalCLEF 2025 is to advance automated biodiversity monitoring, in particular the
tracking of individual animals captured by camera traps and other imaging devices. Precise identification
of individuals is pivotal in ecology: it enables reliable estimates of population size, migration routes, and
behavioral profiles that underpin both scientific studies and conservation measures. Existing algorithms,
however, tend to overfit to background or illumination cues and lose accuracy when applied to novel
conditions. Consequently, the competition focuseusnoivnersal Re‑ID approaches that can generalize
across habitats and reliably recognize animals in a wide range of environments. Participants could either
rely solely on the limited competition data or improve their models by pre‑training on the large external
datasetWildlifeReID‑10k [11]. Overall, AnimalCLEF 2025 serves as a benchmark for state‑of‑the‑art
computer‑vision methods and continues the LifeCL8E]Fse[ries that expands the role of AI in wildlife
monitoring.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data and evaluation metric</title>
        <p>For training and pre‑training, participants were provided with the large WildlifeREID‑10k dataset
containing roughly 140,000 images of more than 10,000 individual animals across many sp1e1c]i.es [
This external resource can be regarded as an extended training set. The competition data were collected
specifically for AnimalCLEF 2025 and split into two parGtasl:lery, which holds annotated images of
known individuals and simultaneously serves as the training set and the gallery for matching, and
Query (Tab. 1).</p>
        <p>LynxID2025. This subset comprises 2957 training images of Eurasian lynx and 946 query images
(3903 in total). The training split covers 77 unique individuals, with an average of 38 photographs
per individual; the distribution is unbalanced—some animals have only a single image, whereas one
individual appears in 353 shots. Image orientation is recorded for every picturrigeh(tl,efft,ront, back,
orunknown). Capture dates are not provided (dtahtee field is empty).</p>
        <p>SalamanderID2025. This subset contains 1388 training images of fire salamanders and 689 query
images (2077 in total). The training split includes 587 unique individuals; the average is ~2.4 images
per individual, the median is 1, and the maximum is 12. Orientation latbope,lsbo(ttom, left, right) are
available for all images. Capture dates span 2017–2023 in the training set and extend to 2024 in the
query set, enabling temporal analysis of the data collection period (for example, training pictures cover
2017–2023, while query images include shots made up to the end of 2024).</p>
        <p>SeaTurtleID2022. This is the largest subset: 8729 training photographs of loggerhead sea turtles and
500 query images (9229 in total). The training split represents 438 unique sea turtles (mean 19.9 images
per individual; median 13). Orientation labels inclelft,udreight, front, top, and composite directions
such astopleft ortopright; orientation is missing for all 500 query images. Capture dates are present for
almost every photo, ranging from 2010 to 2024, reflecting the long‑term nature of data collection.</p>
        <p>The three subsets difer markedly. Lynx ofers fewer individuals but more images per individual,
whereas Salamander provides many individuals yet mostly single‑image observations. Sea Turtle
occupies an intermediate position in terms of individual count, but its total image volume is the largest.
Such heterogeneity in size, orientation metadata, and temporal coverage underscores the need for
adaptive identification strategies tailored to each species.</p>
        <p>AnimalCLEF 2025 employs two metrics that jointly assess recognition quality on known and new
individuals. BAKS is the per‑class balanced accuracy over query images whose individuals are present
in the gallery. BAUS is the balanced accuracy over query images belonging to new individuals absent
from the gallery. The final ranking score is the geometric mean of BAKS and BAUS. The organizer split
the query set into an open (pub)lpicortion comprising about 31% of the images and a hiddperniva(te)
portion comprising the remaining 69%. Only the private leaderboard determined the final standings,
preventing overfitting to the public subset.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Related and preceding competitions</title>
        <p>The task of individual animal identification had been explored before AnimalCLEF 2025. A notable
predecessor is the Happywhale—Whale &amp; DolphinID competition (Kaggle 2022), which required
distinguishing thousands of individuals from 24 marine‑mammal species using the mAP metric; the task
sufered from a strong class imbalance but lacked an open‑set component [12].</p>
        <p>Between 2022 and 2024 several species‑specific re‑ID datasets were released together with
mini‑competitions, including Leopard ID and Hyena ID from WildMe &amp; LILA Scie1n3c,e14[] and SeaTurtleID
[15]. SeaTurtleID first introduced time-aware closed- and open-set splits later adopted by AnimalCLEF.
An internal benchmark demonstrated 86.8% closed-set accuracy when using a Hybrid Task Cascade
equipped with an ArcFace encoder, highlighting the challenges of long-term individual tracking even
within a single species [15].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. State‑of‑the‑art approaches and models</title>
        <p>Modern Animal Re-ID methods rely on deep networks that extract image embeddings, i.e., compact
feature vectors unique to e aincdhividual. Two principal categories of such features exgislotb:al
descriptors that summarize the entire image alnodcal matches that align distinctive regions.</p>
        <p>A prominent global approach MisegaDescriptor [16]. This supervised model is trained on a
collection of many datasets (&gt;10k individuals, ~140k images). Its backbone is a Swin‑L Transformer
with384 × 384 input and about 229 M parameters. Essentially a Vision Transformer tuned for Animal
Re-ID, it markedly outperforms generic models such as CLIP and DINOv2 [16].</p>
        <p>An alternative global encodeMrIiEsW (Multi‑species Individual Embeddings Wild, MiewID‑msv3).
This compact EficientNet‑V2 [17] CNN (about 51 M parameters) is trained with a contrastive Sub‑center
ArcFace loss on a dedicated dataset of 64 species (225k photos, 37k individuals). Unlike MegaDescriptor,
which is trained per species, MIEW is optimized as a single multi‑species model. Experiments show that
this unified model surpasses species‑specific training by an average of 12.5% top‑1 and, more importantly,
generalizes better to unseen species: on unknown taxa MIEW outperforms MegaDescriptor by 19.2%
top‑1 accuracy [3], demonstrating its ability to capture universal cues useful for any animal.</p>
        <p>Global descriptors have limitations: they may miss fine‑grained individual patterns. To compensate,
local methods match image regions that carry unique markings. The modWerinldFusion framework
combines global and local information eficient1l]y. [It fuses (i) cosine similarity of global embeddings
(e.g., MegaDescriptor or DINOv2) and (ii) local keypoint correspondences obtained with matchers such
as LoFTR [18] or LightGlue1[9]. After isotonic calibration, the two similarity sources are merged into a
single score. In a zero‑shot setting WildFusion exceeds the pretrained MegaDescriptor‑L, confirming
that hybrid cues can substantially improve Re‑ID performance [1].</p>
        <p>
          A common path to higher accuracy is ensemble learning. In this work several ways of combining
embeddings from MegaDescriptor and MIEW were explored. The best result was achieved by a
meta‑algorithm that blends predictions from (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) WildFusion3.(3S)e,c(2.a) XGBoost (Sec.3.5), and (2b)
a Dual‑backbone network with an ArcFace head (3S.6e)c..The reliable WildFusion, together with two
strong embedding streams, proved highly efective: the transformer‑based MegaDescriptor yields rich
global representations, whereas the CNN‑based MIEW provides features that remain robust on new
species (Sec. 4).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. MegaDescriptor‑L‑384 model</title>
        <p>MegaDescriptor‑L‑384 is a foundation model for Animal Re-ID introduced16i]na[nd released on
Hugging Face [20]. The backbone isswin_large_patch4_window12_384 with384×384 px input and 228.8
M parameters; it outputs a 1536‑dimensional L2‑normalized embedding suitable for cosine comparison.</p>
        <p>The network was trained in a supervised manner with an ArcFace‑style margin loss on the aggregated
WildlifeDatasets corpus comprising 29 public datasets (~140k images, &gt;10k individuals, 23 species).
Merging such diverse sources exposes the model to wide variations in viewpoint, illumination, and
marking patterns, thereby improving embedding generality. The authors report that
MegaDescriptor‑L‑384 consistently outperforms CLIP (ViT‑L/336) and DINOv2 (ViT‑L/518) on all 29 benchmarks
[16].</p>
        <p>In practice, deployment requires only standard preprocessing: re3s8iz4e×3t8o4, convert to a tensor,
and normalize to mean(s0.485, 0.456, 0.406) and standard deviatio(n0s.229, 0.224, 0.225). A single
forward pass then produces the embeddin20g][. The CC‑BY‑NC‑4.0 license permits non‑commercial
use and modification, making MegaDescriptor‑L‑384 a strong out‑of‑the‑box global descriptor within
the pipeline.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. WildFusion similarity fusion method</title>
        <p>WildFusion [16] addresses a core limitation of purely global embeddings in Animal Re-ID—their
sensitivity to background and illumination. By combining global image similarity with precise local
keypoint verification, the method sharply reduces false matches between difeinrednivtiduals while
recovering true correspondences under strong viewpoint changes. A detailed analysis of its impact on
the final private score is presented in Sec4..</p>
        <p>The algorithm comprises two stages. First, a fast cosine search in the MegaDescriptor‑L‑384
embedding space (Sec.3.2) selects the= 300 most similar gallery images (candidates). Each “query /
candidate” pair is then evaluated by five independent local pipeLlionFeTsR:, SuperPoint [21], ALIKED
[22], DISK [23], andSIFT [24]. For LoFTR, images are converted 1t9o2×192 px grayscale, whereas
the other pipelines operate 5o1n2×512 px color inputs. The local scores and the normalized global
similarity undergo isotonic calibration and are subsequently fused linearly into a single probabilistic
score.</p>
        <p>An open‑source implementation is provided in twhieldlife‑tools package [25] in the
wildlife‑datasets repository1[6]. Released under the GPL‑3.0 license, the code requires no
additional training, making WildFusion easy to integrate into an existing pipeline.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. k-Reciprocal re-ranking strategy</title>
        <p>On top of WildFusion probabilitikes-r,eciprocal re-ranking [26] is applied. Three similarity matrices
 are available: “quer×ygallery”, “galler×yquery”, and “galler×ygallery”. For each image a list of
the 1 = 20 most similar neighbors is formed according .toA gallery imag e retains only those
neighbors that simultaneously pl acweithin their first 2 = 6 positions; the resulting mutual set is
denotedrc() . An analogous procedure yielrdcs() for every query, as the symmetric “gall×erqyuery”
matrix enables reciprocity checks in the reverse direction.</p>
        <p>The final similarity between a quer yand a gallery imag eis calculated as</p>
        <p>This linear convex combination suppresses incidental matches caused by pose, masking, or background,
while requiring no additional model training.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Gradient boosting on combined MegaDescriptor and MIEW embeddings</title>
        <p>Score‑CAM [27] heatmaps in Fig.1 indicate that MegaDescriptor‑L‑384 focuses on compact texture
regions, whereas MIEW‑msv3 distributes attention across fine‑grained spots and extended contours.
The theoretical complementarity of these spatial patterns motivates a direct concatenation of the two
embeddings (ℝ3688), with each component pre‑normalized by it2snorm [3, 16].</p>
        <p>For every image, the feature vector includes (i) the concatenated global embe d=d1i0ng,c(oiis)ine
distances to the ten nearest gallery images together with the corrensepiognhbdoinrgidentifiers passed as
categorical features1, and (iii) one‑hot representations of view orientation and dataset membership (lynx,
salamander, sea turtle). The key idea is that gradient boosting can non‑linearly merge global descriptors,
local density information in the embedding space, and categorical data on thinedcivloidsueaslts. The
maximum depth was capped at 6, which prevents the model from memorizing category values via long
split chains.</p>
        <p>Incorporating both global descriptors enhances feature diversity: the transformer‑based
MegaDescriptor captures coarse texture patterns, whereas the CNN backbone of MIEW remains sensitive to
1The columnsnn_id_1…10 are cast tpoandas.Categorical and processed by XGBooste’snable_categorical option, which
learns optimal subset splits instead of numerical thresholds.
0 Baseline provided by the competition organizers: 30.90</p>
        <p>MegaDescriptor‑L threshold 0.6 for all
1 Baseline: MegaDescriptor‑L cosine nearest neighbor with 40.59</p>
        <p>per‑species thresholds
2 WildFusion global + local similarity fusion with thresholds 61.72
3 k‑reciprocal re‑ranking (Lynx only) applied to WildFusion 62.09
4 XGBoost meta‑classifier on MegaDescriptor + MIEW embed- 64.44</p>
        <p>dings; WildFusion confidence adjustment
5 Dataset-specific meta‑algorithm that combines WildFusion, 67.42</p>
        <p>XGBoost and Dual-backbone ArcFace
—
+9.69
point‑wise diferences [3, 16]. XGBoost trained on this concatenation, augmented with local density
features and metadata, produces a consistent improvementpirnivtahteescore andpublic score relative
to a single embedding baseline and to WildFusion alone4()S. ec.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Dual‑backbone model with an ArcFace head</title>
        <p>The two previous ensemble components (WildFusion, S3e.c3., and XGBoost, Sec3..5) rely onfixed
embeddings obtained without species‑specific fine‑tuning. Although this delivers high baseline accuracy,
the capacity to adapt to species‑specific visual patterns remains limited. The Dual‑backbone model
addresses the opposite need: it refines featupreersspecies and thus complements the rigid matching
scheme of WildFusion and the tabular classifier XGBoost. Methodologically the model unites deep
metric optimization via ArcFac1e0][ with the direct feature focus provided by two heterogeneous
backbones.</p>
        <p>The first stream employsMegaDescriptor‑L‑384, reliable at capturing global textures; the second
employs MIEW‑msv3, sensitive to fine pointwise details. Both classification heads are removed, and
their outputs after individual BatchNorm layers are concatenated into a 3688‑dimensional vector.</p>
        <p>A compact ArcFace head is placed on top of the joint space. ArcFace maximises inter-class angular
margins in the embedding space, imposing a strict separability criterion. This margin‑based approach
is particularly efective under the small‑sample conditions characteristic of AnimalCLEF 2025 [10].</p>
        <p>WildFusion depends on a calibrated “global + local” heuristic, and XGBoost on tabular aggregation
of fixed embeddings and metadata. Three independent Dual-backbone models, one trained for each
species, supply descriptors tailored to their respective AnimalCLEF 2025 subsets and thereby improve
the robustness of the ensemble.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <sec id="sec-4-1">
        <title>Step 1. Threshold selection for the global descriptor MegaDescriptor‑L‑384 (HF hub:</title>
        <p>BVRA/MegaDescriptor‑L‑384; batch 32, input384×384) was evaluated with both shared and
thresholds (Sec.3.2). Each gallery image =( 13,074 ) and each query image (= 2135 ) was encoded as a
1536‑dimensional vector. Cosine similarity was computed between every query vector and every gallery
vector; the most similar gallery image provided the candidate identity for the query. If the similarity
fell below the threshold, the lnaebwe_lindividual was assigned.</p>
        <p>A grid search over a single global threshold in the range 59.0–74.5% yielded the best result at 74.0%
(public 35.76%, private 37.32%). Separate thresholds were then explored for each taxon: Lynx 50–90%,
Salamander 60–90%, Sea turtle 74–90%. The optimal triplet (Lynx 65.5%, Salamander 77.0%, Sea turtle
74.5%) produced a public score of 37.92% and a private score of 40.59%, establishinbagstehliene for
subsequent steps.</p>
        <p>Preliminary analysis of Fig2.revealed that submissions cluster vertically: runs withfsraimc-ilar
tions of new_individual predictions tend to yield comparable public scores, and the highest–scoring
points concentrate around species‑specific share7s0o%f (lynx),60% (salamander) an6d0% (sea turtle).
Therefore, at all later steps (including the WildFusion, XGBoost confidence gate, and the final cascade)
thresholds were selected so as to preserve these empirically favorable ratios. This policy explains the
multiple vertical stripes visible in 2F:ige.ach stripe marks a family of submissions that intentionally
maintain the samenew_individual quota while refining other components of the pipeline.</p>
        <p>Possible future work includes replacing manual threshold search with probability–calibration
techniques such asPlatt scaling ortemperature scaling, which learn a monotone mapping on the validation
split and may further stabilize the species‑spenceificw_individual ratios without exhaustive grid search.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Step 2. WildFusion: fusion of global and local features The open implementation oWfildFusion</title>
        <p>[16, 25] is employed at this step (Sec3.3).</p>
        <p>First, MegaDescriptor‑L‑384 retrieves t he= 300 gallery candidates with the highest cosine
similarity; all 15,209 images (13,074 gallery + 2135 query) are encoded as 1536‑dimensional vectors. Each
”query / candidate” pair is then evaluated by five independent local matchers: SuperPoint–LightGlue,
ALIKE–LightGlue, DISK–LightGlue, SIFT–LightGlue (all at×551122 px RGB) and LoFTR (192× 192 px
grayscale). Figur3e shows that the detectors yield a comparatively small overlap of correspondences;
this diversity underlies the gain obtained after isotonic calibration and score fusion. Calibrating the five
matchers on the validation split required 3h on a single A100 GPU; tqhueerfyul×lgallery evaluation
took a further 27h.</p>
        <p>After combining global and local signals, thresholds fonretwh_eindividual label were tuned separately
per species. The optimal values were Lynx 39.5%, Salamander 12.0%, and Sea turtle 16.0%.</p>
        <p>This configuration achieved a58.98% public score and 6a1.72% private score, yielding a +21 pp
improvement over the baseline in Ta2b..Notably, the tuned WildFusion stage alone would already have
secured an 8th place finish on the final leaderboard, even before adding the later cascade components.</p>
        <p>The results confirm that merging global embeddings with complementary local keypoints is crucial for
the substantial performance gain observed on AnimalCLEF 2025.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Step 3. k-reciprocal re-ranking over WildFusion outputs A full ”gallery × gallery” similarity</title>
        <p>matrix was computed for the Lynx subset and re‑ranking was applied according to the scheme of Z.
Zhong [26]. The method parameters were fixed to the first neighborhood rad iu1s= 20, the reciprocity
radius 2 = 6, and the Jaccard weigh=t 0.1 in the linear combination with the original WildFusion
score (Sec.3.4).</p>
        <p>Lynx was selected because its images share an artificially uniform black background, which increases
the risk of pose‑driven false matches; reciprocal filtering helps to attenuate this artifact. Building the
gallery × gallery matrix required an additional 38 GPU‑hours, so the procedure was not executed for
Salamander or Sea turtle.</p>
        <p>The gain, although modest, was positive: the public score rose from 58.98% to 59.09% (+0.11 pp), and
the private score from 61.72% to 62.09% (+0.37 pp). This confirms the value of mutual neighbor filtering,
yet the improvement did not justify the computational cost; all subsequent steps therefore relied on the
original WildFusion scores (for Salamander or Sea turtle).</p>
      </sec>
      <sec id="sec-4-4">
        <title>Step 4. Gradient‑boosted ensemble of MegaDescriptor and MIEW embeddings For every</title>
        <p>image a dense feature vector3o7f18 dimensions was assembled: the 3688‑D concatenation of
MegaDescriptor‑L‑384 and MIEW‑msv3 embeddings, 10 cosine distances to the nearest gallery images, 10
categorial identifiers of those neighbors, and 10 one‑hot categoroireiesn(t7ation flags + 3 dataset flags).
A unified probability scale simplifies the tuning of the cascade.</p>
        <p>An XGBoost model was trained with depth =6, 0.15 ,  = 2.0 , tree_method=gpu_hist and the
multi:softprob objective; the best iteration was reached at round 296. Validation followed a “single
image per individual” split (Sec.3.5).</p>
        <p>During inference the posterior probabilities of XGBoost acted as a confidence gate on top of
WildFusion. Species‑specific thresholds were tuned empirically: for Salamander, the WildFusion label was
replaced when XGBoost confidence exceeded 20%; for Sea turtle, when it exceeded 95%. This cascaded
refinement raised thepublic score to 61.89% and thperivate score to 64.44%, adding +2.80 pp and +2.35
pp, respectively, over the pure (re‑ranked) WildFusion.</p>
      </sec>
      <sec id="sec-4-5">
        <title>Step 5. Dual‑backbone model with an ArcFace head and its integration into the meta‑algo</title>
        <p>rithm For each taxon an individual Dual‑backbone network was trained that combines
MegaDescriptor‑L‑384 with MIEW‑msv3. After separateBatchNorm layers the two vectors were concatenated into
a 3688‑dimensional feature, which was fed to a compAarcctFace head (Sec.3.6). The head parameters
were fixed per dataset(:, ) = (64, 0.5) for Lynx and Sea turt(l3e0,, 0.35) for Salamander.</p>
        <p>Augmentation pipelines were tailored to the visual specifics of each daLtyansxe:tb.ackground‑mask
removal followed bRyandomResizedCrop with scale≥ 0.9. Salamander: rotation according to the
orientation field and moderate cropping that preserves key anatomical regSieoantsu. rtle: moderate
cropping plus horizontal flip. All datasets additionally recCeoilvoerdJitter andCoarseDropout (one
mask ≤ 10% of the image area).</p>
        <p>Data were split in a stratified fashion: 90% of images for training, 10% for validation. Class imbalance
overindividuals was mitigated with WaeightedRandomSampler.</p>
        <p>
          Optimization employed SGD in three stages: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) a two‑epoch initial training phase only the ArcFace
head and the uppermost 25% of layers at learning =ra1t0e−2; (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) full backbone unfreezing with a
base step 0 = 5×10−3 under a cosine‑annealing schedule; (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) a final fine‑tuning stage of the last two
epochs at = 10 −4.
        </p>
        <p>After fine‑tuning,  2‑normalizedgallery and query embeddings were indexed in aFAISS
IndexFlatIP [28, 29]; for each query the 50 nearest neighbors were retrieved, and confidence was
defined as (cos +1)/2.</p>
        <p>Descriptors from the three Dual‑backbone models complemented WildFusion and XGBoost inside
the final meta‑algorithm, increasing ensemble robustness on challenging and rare cases and yielding an
additional score gain (Ta2b).</p>
      </sec>
      <sec id="sec-4-6">
        <title>Final meta‑algorithm and overall leaderboard performance The definitive submission followed</title>
        <p>a cascading scheme that invoked one to three models per species.</p>
        <p>
          Eurasian lynx. (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) WildFusion predictions afterk-reciprocal re‑ranking=( 0.1 ); images with
confidence below 39.7% assigned the labenlew_individual. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) When XGBoost assigns a probability
≥ 99%, its class replaces the WildFusion label. (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Dual‑backbone embeddings serve as the final filter:
similarity&lt; 64% converts the labelnteow_individual, whereas similarit&gt;y 89.3% overwrites the class
with the Dual‑backbone prediction.
        </p>
        <p>
          Fire salamander. (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) WildFusion with a confidence threshold1o3f.0%. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Dual‑backbone refines the
outcome: similarit&lt;y 62% results in assigning the labnewl_individual, while similarity&gt; 80% accepts
the Dual‑backbone label.
        </p>
        <p>Loggerhead sea turtle. Dual‑backbone operates as the exclusive source: simil&lt;ar7i0t.3y% is
interpreted asnew_individual; otherwise the identifier proposed by the model is retained.</p>
        <p>This species‑specific combination of WildFusion, XGBoost, and Dual‑backbone merges their
errors that exhibit low mutual correlation. The ensemble achpireivveadteascore of 67.420% and a
public score of 65.114%, securing second place among 172 teams in AnimalCLEF 2025 (T2a)b..</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Practical significance and prospects for future work</title>
      <p>High‑precision Animal Re-ID methods open new opportunities in both scientific research and applied
domains. These techniques facilitate automated census work and substantially facilitate field studies:
instead of capturing and tagging animals, researchers can deploy camera traps, drones, or underwater
cameras and then analyze the collected imagery algorithmically. Such approaches are already employed
to monitor endangered species e.g., identifying snow leopards or whales allows estimating population
size, migration routes, and individual longevity. Accurate and scalable Animal Re‑ID thus constitutes a
key enabling factor of non‑invasive biodiversity monitoring.</p>
      <p>Future improvements are envisioned along four complementary directions.</p>
      <p>(i) Transductive graph‑based models. AnimalCLEF 2025 data exhibit a pronounced long‑tail
distribution in the number of images pienrdividual. Under these circumstances, graph re‑ID strategies
such as GCN-based reranking may redistribute confidence from majority classes to minority classes
and raise recall in the tail of the distribu26t,io3n0].[ Initial GCN experiments reduced the public score;
nevertheless, further exploration of neighborhood radii and regularization schemes remains promising.</p>
      <p>(ii) Tiling and localized matching. Dividing images into distortion‑free squares and performing
pairwise tile matching leads to a quadratic growth in computational complexity and GPU memory
consumption, yet can mitigate background influence and raise confidence for fine‑scale spotted patterns.</p>
      <p>(iii) Pseudo‑labeling and self‑training. Unlabeled images from GBIF (Global Biodiversity
Information Facility) or surveillance video streams can augment the trainincognsfideetn.t Aset may be
formed with current models, followed by additional backbone fine‑tuning while strictly controlling
pseudo‑label accuracy.</p>
      <p>(iv) Automatic threshold tuning. Bayesian optimization or diferentiable threshold tuning on the
validation score would eliminate manual adjustment of the 39.7%, 13%, and 70.3% thresholds and adapt
the meta‑algorithm to new species without manual threshold tuning [31].</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The author is grateful to all individuals and organizations involved in the collection and annotation of
data that enable the training of models and the development of animal re‑identification tools.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the following Generative AI tool was employed:
• ChatGPT o3 (OpenAI, June 2025 model) —Text Translation of the paper from Russian to English,</p>
      <p>Grammar and spelling check, andImprove writing style.</p>
      <p>All AI-generated suggestions were reviewed and edited manually; the authors assume full
responsibility for the final content. No Generative AI system was used to create original scientific ideas, analyze
data, or draw conclusions.
[8] L. Picek, S. Kahl, H. Goëau, L. Adam, T. Larcher, C. Leblanc, M. Servajean, K. Janoušková, J. Matas,
V. Čermák, K. Papafitsoros, R. Planqué, W.-P. Vellinga, H. Klinck, T. Denton, J. S. Cañas, G.
Martellucci, F. Vinatier, P. Bonnet, A. Joly, Overview of lifeclef 2025: Challenges on species presence
prediction and identification, and individual animal identification, in: International Conference of
the Cross-Language Evaluation Forum for European Languages (CLEF), Springer, 2025.
[9] L. Picek, E. Belotti, M. Bojda, L. Bufka, V. Cermak, M. Dula, R. Dvorak, L. Hrdy, M. Jirik, V. Kocourek,
et al., Czechlynx: A dataset for individual identification and pose estimation of the eurasian lynx,
arXiv preprint arXiv:2506.04931 (2025).
[10] J. Deng, J. Guo, X. Niannan, S. Zafeiriou, Arcface: Additive angular margin loss for deep face
recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2019, pp. 4690–4699. do1i:0.1109/CVPR.2019.00482.
[11] L. Adam, V. Cermak, K. Papafitsoros, L. Picek, Wildlifereid-10k: Wildlife re-identification dataset
with 10k individual animals, in: Proceedings of the Computer Vision and Pattern Recognition
Conference (CVPR) Workshops, 2025, pp. 2099–2109.
[12] Happywhale – whale and dolphin identification, Kaggle Competition, 2022. UhRttLp:s://www.</p>
      <p>kaggle.com/competitions/happy-whale-and-dolphin.
[13] Wild Me, LILA Science, Leopard id dataset, Dataset, 2022. UhtRtLp:s://lila.science/datasets/
leopard-id.
[14] Wild Me, LILA Science, Hyena id dataset, Dataset, 2022. UhtRtLp: s://lila.science/datasets/hyen. a-id
[15] L. Adam, V. Čermák, K. Papafitsoros, L. Picek, Seaturtleid2022: A long-span dataset for reliable
sea turtle re-identification, in: Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision, 2024, pp. 7146–7156.
[16] V. Čermák, L. Picek, L. Adam, K. Papafitsoros, Wildlifedatasets: An open-source toolkit for animal
re-identification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, 2024, pp. 5953–5963.
[17] M. Tan, Q. V. Le, Eficientnetv2: Smaller models and faster training, in: M. Meila, T. Zhang (Eds.),
Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of
Proceedings of Machine Learning Research, PMLR, 2021, pp. 10096–10106. URL: https://proceedings.
mlr.press/v139/tan21a.html, long presentation.
[18] J. Sun, Z. Shen, Y. Wang, H. Bao, X. Zhou, Loftr: Detector-free local feature matching with
transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2021, pp. 8928–8937. URL: https://openaccess.thecvf.com/content/CVPR2021/html/Sun_LoFTR_
Detector-Free_Local_Feature_Matching_With_Transformers_CVPR_2021_paper.html.
[19] P. Lindenberger, P.-E. Sarlin, M. Pollefeys, Lightglue: Local feature matching at light speed, arXiv
preprint arXiv:2306.13643 (2023). URLh:ttps://arxiv.org/abs/2306.13643.
[20] B. V. R. Alliance, Bvra/megadescriptor-l-3h8t4t, ps://huggingface.co/BVRA/MegaDescriptor-L-,384
2024. Model card on Hugging Face; accessed 2025-07-03.
[21] D. DeTone, T. Malisiewicz, A. Rabinovich, Superpoint: Self-supervised interest point detection and
description, 2018. URLh:ttps://arxiv.org/abs/1712.0762a9r.Xiv:1712.07629.
[22] X. Zhao, X. Wu, W. Chen, P. C. Y. Chen, Q. Xu, Z. Li, Aliked: A lighter keypoint and descriptor
extraction network via deformable transformation, 2023h.tUtRpLs://arxiv.org/abs/2304.03608.
arXiv:2304.03608.
[23] M. J. Tyszkiewicz, P. Fua, E. Trulls, Disk: Learning local features with policy gradient, 2020. URL:
https://arxiv.org/abs/2006.1356a6r.Xiv:2006.13566.
[24] D. G. Lowe, Distinctive image features from scale–invariant keypoints, International Journal of</p>
      <p>Computer Vision 60 (2004) 91–110. do1i:0.1023/B:VISI.0000029664.99615.94.
[25] WildlifeDatasets, wildlife-tools: Reference implementation of wildhftutsiposn:/,/github.com/</p>
      <p>WildlifeDatasets/wildlife-tools, 2024. Accessed 2025-07-03.
[26] Z. Zhong, L. Zheng, D. Cao, S. Li, Re-ranking person re-identification with k-reciprocal encoding,
in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017,
pp. 4656–4665. URL: https://openaccess.thecvf.com/content_cvpr_2017/html/Zhong_Re-Ranking_
Person_Re-Identification_CVPR_2017_paper.html.
[27] H. Wang, Z. Du, M. Du, F. Yang, S. Hu, S. Liu, J. Zhou, X. Hu, Score‑cam: Score‑weighted visual
explanations for convolutional neural networks, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition Workshops (CVPR W), IEEE, 2020, pp. 111–119.
doi:10.1109/CVPRW50498.2020.00020.
[28] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini,</p>
      <p>H. Jégou, The faiss library (2024a).rXiv:2401.08281.
[29] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, IEEE Transactions on</p>
      <p>Big Data 7 (2019) 535–547.
[30] Y. Zhang, Q. Qian, H. Wang, C. Liu, W. Chen, F. Wang, Graph convolution based eficient re-ranking
for visual retrieval, 2023. URhLt:tps://arxiv.org/abs/2306.0879a2r.Xiv:2306.08792.
[31] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization of machine
learning algorithms, in: Advances in Neural Information Processing Systems (NeurIPS),
2012, pp. 2951–2959. URL: https://proceedings.neurips.cc/paper_files/paper/2012/file/
05311655a15b75fab86956663e1819cd-Paper.pdf.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Cermak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          , Wildfusion:
          <article-title>Individual animal identification with calibrated similarity fusion</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Categorical keypoint positional embedding for robust animal re-identification (</article-title>
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2412.0081a8r.Xiv:
          <volume>2412</volume>
          .
          <fpage>00818</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Otarashvili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Holmberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Levenson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <article-title>Multispecies animal re-id using a large community-curated dataset</article-title>
          ,
          <year>2024</year>
          .hUtRtpLs:://arxiv.org/abs/2412.05602. arXiv:
          <volume>2412</volume>
          .
          <fpage>05602</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Bolger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Morrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Farid</surname>
          </string-name>
          ,
          <article-title>A computer-assisted system for photographic mark-recapture analysis</article-title>
          ,
          <source>Methods in Ecology and Evolution</source>
          <volume>3</volume>
          (
          <year>2012</year>
          )
          <fpage>813</fpage>
          -
          <lpage>822h</lpage>
          .tUtRpLs://doi. org/10.1111/j.2041-
          <fpage>210X</fpage>
          .
          <year>2012</year>
          .
          <volume>00212</volume>
          .x.
          <source>doi1:0</source>
          .1111/j.2041-
          <fpage>210X</fpage>
          .
          <year>2012</year>
          .
          <volume>00212</volume>
          .x.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is wor×t1h61w6ords: Transformers for image recognition at scale</article-title>
          ,
          <source>in: Proceedings of the 9th International Conference on Learning Representations (ICLR)</source>
          ,
          <source>ICLR</source>
          ,
          <year>2021</year>
          . URhLt:tps://openreview.net/forum?id=YicbFdNTTy.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <article-title>An individual identity-driven framework for animal reidentification</article-title>
          ,
          <year>2024</year>
          . URL:https://arxiv.org/abs/2410.2292a7r.Xiv:
          <volume>2410</volume>
          .
          <fpage>22927</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kovář</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          , Overview of AnimalCLEF 2025:
          <article-title>Recognizing individual animals in images</article-title>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>