<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>C. R. Clark)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DS@GT AnimalCLEF: Triplet Learning over ViT Manifolds with Nearest Neighbor Classification for Animal Re-identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Miyaguchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chandrasekaran Maruthaiyannan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles R. Clark</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Florida</institution>
          ,
          <addr-line>Gainesville, FL 32610</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper details the DS@GT team's entry for the AnimalCLEF 2025 re-identification challenge. Our key finding is that the efectiveness of post-hoc metric learning is highly contingent on the initial quality and domainspecificity of the backbone embeddings. We compare a general-purpose model (DINOv2) with a domain-specific model (MegaDescriptor) as a backbone. A K-Nearest Neighbor classifier with robust thresholding then identifies known individuals or flags new ones. While a triplet-learning projection head improved the performance of the specialized MegaDescriptor model by 0.13 points, it yielded minimal gains (0.03) for the general-purpose DINOv2 on averaged BAKS and BAUS. We demonstrate that the general-purpose manifold is more dificult to reshape for fine-grained tasks, as evidenced by stagnant validation loss and qualitative visualizations. This work highlights the critical limitations of refining general-purpose features for specialized, limited-data re-ID tasks and underscores the importance of domain-specific pre-training. The implementation for this work is publicly available at github.com/dsgt-arc/animalclef-2025.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Animal Re-identification</kwd>
        <kwd>Open-Set Re-identification</kwd>
        <kwd>Triplet Learning</kwd>
        <kwd>Metric Learning</kwd>
        <kwd>Vision Transformer (ViT)</kwd>
        <kwd>DINOv2</kwd>
        <kwd>MegaDescriptor</kwd>
        <kwd>Nearest Neighbor Classification</kwd>
        <kwd>Kaggle</kwd>
        <kwd>LifeCLEF</kwd>
        <kwd>DS@GT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Individual animal identification is helpful for biologists studying animal populations in the wild. The
ability to track an animal over time in its natural environment gives insights into behaviors and
ecological interactions that are not possible with general census statistics.</p>
      <p>In this paper, we describe the solution developed by the DS@GT team for the AnimalCLEF 2025
challenge hosted on Kaggle. We utilize pre-trained, self-supervised vision transformers to embed animal
images into embedding space and run a K-NN classifier with statistical thresholding to determine a
label for each image. We further refine the manifold learned by the vision transformer using a triplet
learning procedure that learns to map individuals in space more efectively, achieved by projecting
triplets of images from the ViT embedding space to a new projection for the metric. We hypothesize that
a domain-specific backbone (MegaDescriptor) will provide a more suitable initial embedding manifold
for triplet-based refinement than a general-purpose backbone (DINOv2), leading to greater performance
gains on this specialized re-ID task. Our method can overcome a simple baseline provided by the
competition organizers, but further work is necessary.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Animal Re-identification</title>
        <p>
          Animal re-identification (re-ID) refers to a system’s ability to predict the identity of an individual animal
based on its unique physical traits [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. It is critical for biologists and ecologists to monitor populations,
track movement, and study social behavior [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Approaches difer in their descriptor strategies. Early
methods rely on local descriptors, such as SIFT, SURF, or contour features extracted from key points,
to identify individuals via match counts [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. More recent approaches use deep neural networks with
metric learning to generate feature embeddings for identity matching [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Hybrid pipelines combine
object detection and feature extraction to localize animals (or faces) before identifying them [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ].
Numerous datasets support benchmarking: ATRW includes 3,649 images of 92 Amur Tigers in zoos
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]; zebrafishRe-ID ofers 2,224 images of 6 zebrafish in lab settings [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]; and Cows2021 provides 13,784
images of 182 cows on a farm [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The WildlifeReID-10K dataset aggregates 36 wildlife re-ID datasets,
comprising approximately 140,000 images from over 10,000 individuals across multiple species [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
Although many studies frame re-ID as a closed-set task, this assumption often breaks down in ecological
settings where new individuals may appear.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Vision Transformers for Computer Vision</title>
        <p>
          Dosovitskiy et al. introduced the Vision Transformer (ViT), adapting the Transformer architecture to
image patches, which achieved competitive performance with less computing when pre-trained on large
datasets [
          <xref ref-type="bibr" rid="ref9">9, 10</xref>
          ]. However, quadratic scaling in image size led Liu et al. to propose the Swin Transformer,
which utilizes shifted-window attention for hierarchical, linear-complexity feature extraction and
achieves strong performance across dense vision tasks [11]. In re-ID, vision transformers have been
adopted to capture both short- and long-range features. TransReID was the first ViT-based method for
person re-ID [12], while GorillaVision applied a pre-trained ViT backbone to gorilla face recognition
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          More recently, self-supervised and multi-modal transformer models have become powerful tools for
vision tasks. DINOv2, a self-supervised ViT trained via self-distillation, performs well on fine-grained
tasks such as species recognition [13, 14]. CLIP, trained on image-text pairs, learns image embeddings
that generalize well across domains, including re-ID [15, 16]. MegaDescriptor, a Swin-based model
trained on a large multi-species re-ID dataset, achieves state-of-the-art performance across animal re-ID
benchmarks, outperforming models like DINOv2 and CLIP [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Metric Learning</title>
        <p>Metric learning underpins many re-ID approaches. Triplet loss trains models on
anchor-positivenegative triplets to ensure embeddings of same-identity pairs are closer than those of diferent identities
[17]. Popularized initially in face recognition, triplet learning is widely used in re-ID. To enhance
separability, angular and margin-based losses such as ArcFace introduce additive angular margins on
the hypersphere [18], with adaptations for re-ID. Recent methods, such as Matryoshka Representation
Learning (MRL), produce hierarchical embeddings that encode coarse-to-fine details, enabling the model
to dynamically select embedding subspaces for eficient nearest-neighbor search, depending on the
retrieval task.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        Our experimentation is structured into two major phases, each serving a distinct purpose in our research.
Our first experiment validates the relative performance between DINOv2 [ 13] and MegaDescriptor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
using a K-NN classifier. We use the pre-trained models in a zero-shot fashion and tune the threshold
distance for new identities in the model using our dataset split. In our second set of experiments,
we undertake the task of reshaping the manifold. This involves a deliberate efort to bring images of
the same animal closer together and push images of diferent animals further apart. The goal is to
disambiguate individuals and to make thresholds more robust.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Competition Evaluation Metric</title>
        <p>The competition we participate in employs two specialized evaluation metrics, each of which plays
a crucial role in assessing the performance of our models. The first is Balanced Accuracy on Known
Samples (BAKS) which measures the ability to identify individuals present in the training set. The
second is the Balanced Accuracy on Unknown Samples (BAUS) which is the ability to classify new
individuals not in the training set as unknown.</p>
        <p>More formally, we define the set of individuals  into known subset  and unknown subset  . 
is the the number of images for an individual  in set .</p>
        <p>BAKS(, ˆ) =
BAUS(, ˆ) =
1
∑︁</p>
        <p>︃(
| | ∈
1
∑︁</p>
        <p>︃(
| | ∈</p>
        <p>=1

1 ∑︁ 1( = ) · 1(ˆ = )
)︃</p>
        <p>)︃
 =1
1 ∑︁ 1( = ) · 1(ˆ = unknown)
Score(, ˆ) = √︀BAKS(, ˆ) · BAUS(, ˆ)
(1)
(2)
(3)</p>
        <p>Our final score is then the geometric mean of the two measures. Like the F1-score encourages models
to balance precision and recall, the AnimalCLEF metric encourages models to be able to identify known
and unknown individuals.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset Split for Open Set Classification</title>
        <p>The training images are stratified by individual into train, validation, and test sets, ensuring a robust
and comprehensive dataset split. The training set is used to fit a classification model, the validation
set to observe progress across hyperparameter searches, and a test set for objective evaluation. We
organize the split in such a way that we can optimize a machine-learning algorithm to fit both the
BAKS and BAUS objectives e.g. being able to be accurate about identifications of existing individuals
and new individuals.
individuals. The validation split is for hyperparameter tuning and model selection. The test split is for final,
unbiased performance evaluation. In both validation and test splits, BAKS is calculated on the known individuals
and BAUS on the unknown individuals.</p>
        <sec id="sec-3-2-1">
          <title>Split</title>
          <p>Train
Validation
Test
Num
404
458
620
Num
3392
2575
6568</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Individuals Images Individuals Individuals</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Known Unknown</title>
          <p>404
404
404
0
54
216</p>
          <p>To predict the labels of existing individuals, we must ensure that there are known individuals shared
between the training, validation, and test sets. To predict unknown individuals, we select a set of
individuals that are excluded from the training set but are known in the validation and test sets. We
describe the statistics of the train-validation-test split in table 1. We use 60% of the training individuals
for training, 20% for validation, and the remaining 20% for testing. If an individual has only a single
image, it belongs to the training dataset by default.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Transfer Learning via ViT Embedding Extraction</title>
        <p>We hypothesize that pre-trained self-supervised vision transformer models learn an adequate feature
space for distinguishing individuals. The images are projected onto a lower-dimensional manifold that
roughly maps semantic distances found in the original space. The new points are called embeddings and
are vectors of numbers that capture the lower dimensional latent space. Embeddings capture semantic
similarities between images through inductive biases of the model and the distribution of the training
dataset. Vision transformers learn to represent an image through a sequence of tokens derived from
patches of the original image in addition to a special token called the classification (CLS) token. We
capture and transfer the underlying knowledge by extracting embeddings from the model by extracting
the CLS token.</p>
        <p>We can demonstrate a degree of visual separation by projecting the embeddings onto a 2D manifold,
which can be visualized as a scatter plot. In figure 1, we embed the entire dataset using a pre-trained
DINOv2 model from HuggingFace. We then use principle component analysis (PCA) and pairwise
controlled manifold approximation (PaCMAP [19]) to visualize how the points cluster in space. PCA
works by normalizing the vectors into zero-mean and unit-variance matrix and then finding the
rotation of the matrix that minimizes the projection in a lower-rank space. The first two dimensions
correspond to the principal axes of rotation, as determined through an eigen-decomposition. We find
that this projection can clearly separate lynxes from sea turtles, with some overlap between lynxes and
salamanders. We compare this approach with PaCMAP, which takes into account both local and global
geometry by constructing a graph sampled from the original data. This embedding better captures
nuances of the original space and separates images in the training dataset into lynxes, sea turtles, and
salamanders.</p>
        <p>(a) DINO CLS Embeddings with PaCMAP
(b) DINO CLS Embeddings with PCA</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Nearest Neighbor Classification</title>
        <p>The nearest neighbor classification takes the embeddings and determines which individual is closest to
the point. Images in the training dataset are used as prototypes for making the classification. We use
Faiss [20] to index all of the training image embeddings for queries. We use the L2 distance between
points to rank a query vector to all of the training vectors and return the top K points. We look at the
nearest point and determine whether this is a new individual by applying a global threshold to the
points. If the nearest point is too far, then this is a new point. Otherwise, we return the mode of the top
K identities.</p>
        <p>We choose the threshold through a procedure that optimizes the competition metric. The threshold is
selected by searching 100 linearly spaced points within a range of 3 Median Absolute Deviations (MAD)
from the median distance between each image and its nearest neighbor of a diferent species. The
threshold that maximizes the competition score (geometric mean of BAKS and BAUS) on the validation
set is chosen. These statistics are robust indicators of the dataset and are less influenced by outliers.
The MAD is also a value that is independent of the domain of the thresholds, such that we can describe
distances in their modified z-score in the distribution of distances.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Triplet Learning</title>
        <p>We also applied the triplet learning paradigm in order to learn a better representation of the data.
The objective of the triplet loss is to ensure that same-identity pairs are closer than those of diferent
identities using an anchor-positive-negative triplet.</p>
        <p>= max((, ) − (, ) + , 0)</p>
        <p>In this formulation,  represents the embedding of an anchor image,  is the embedding of a
positive image from the same individual, and  is the embedding of a negative image from a diferent
individual. The function  calculates the L2 distance between two embeddings, and the hyperparameter
 represents the margin that enforces separation between the pairs. The loss is minimized only when
the distance between the anchor and positive pair is smaller than the distance between the anchor and
negative pair by at least the margin  . For our experiments, we followed a standard approach and
utilized a unit margin where  = 1.</p>
        <p>We pre-compute embeddings with DINOv2 and MegaDescriptor-L-384 derived from the CLS token in
the ViT. These CLS embeddings were then downsampled by a projection head consisting of two linear
layers with GELU activation and dropout sandwiched between them, followed by L2 normalization
after the second linear layer. The first linear layer is equal to the size of the original embedding space,
and the second layer is set to a value of 256. The model must be parameterized in such a way that it can
capture the relationship between triplets, given their locations in the original manifold, with the ability
to generalize to new examples. This pipeline is depicted in Figure 2.</p>
        <p>The projection head was trained with a batch size of 200 on the CLS embeddings for the images in
the database, split over 100 epochs in total. We experimented with using standard triplet loss [17] as
well as a modified triplet loss using Matryoshka Representation Learning [ 21], both with unit margins.
We also experimented with diferent online triplet mining techniques, specifically random selection and
semi-hard negative selection [17]. The Adam optimizer was used with a learning rate of 5 × 10− 4; a
linear scheduler was employed for warmup, followed by cosine annealing after the 10th epoch.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We report our final private leaderboard score on Kaggle. Our best model, which utilized MegaDescriptor
with triplet loss, achieved a score of 0.39 and ranked 103 out of 174. This is 0.09 points above the
MegaDescriptor baseline provided by the competition organizers and 0.05 points below the WildFusion
baseline.</p>
      <p>We train several models with our described methodology in table 4. Our baseline models are the
result of embedding the data into either DINOv2 base or MegaDescriptor-L-384 and then applying our
K-NN classification procedure with thresholding. We train another model with the triplet learning
embedding head, with a linear projection from the original embedding space down to a dimension of
256. Finally, we compare a non-linear projection learned by the final methodology.</p>
      <p>In addition to the final results, we also report some of the training dynamics of the triplet learning
procedure to illustrate the diferences between DINOv2 and MegaDescriptor. In figure 3, we see that the
triplet loss objective is lower across all epochs in MegaDescriptor over Dino. Both models decrease in
loss over the 100 epoch during training, meaning that fewer triplets are violating the margin constraint
during training, as observed by the valid triplets found in figure 4. However, we note that while the
training loss continues to decrease, the validation loss converges quickly.</p>
      <p>Finally, we report the hyperparameter tuning of the k-NN classification threshold in figure 5. For
the triplet training epoch with the best validation loss, we run our tuning procedure to find the best
threshold.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Through our experiments, we found that domain-specific training is crucial for achieving good
performance on re-identification tasks. We see this in our baseline k-NN behavior against the DINOv2 and
MegaDescriptor embeddings. We observe a 0.07-point diference on the private leaderboard between
these two models, with the MegaDescriptor model performing better.</p>
      <p>Our procedure is 0.04 points lower than the competition baseline despite using the same model. This
may be caused by improper resizing before calling the model or by selecting an inappropriate metric.
The starter notebook uses cosine distance with a fixed threshold of 0.6, while we use Euclidean distance
with a threshold chosen by a hyperparameter search influenced by our dataset split. The cosine distance
is the more appropriate measure, and requires using an inner-product index with unit-normalized
(a) DINO Non-Linear Validation Curve
(b) MegaDescriptor Non-Linear Validation Curve
vectors. It is also possible that the implicit assumptions in the dataset split had a substantial impact on
the distribution of outputs. We use 60% of the dataset as "known" in our experiments, but it is possible
that increasing the set of known images would lead to a diferent optimal hyperparameter score. While
we try to have an ofline approximation of the private leaderboard for development, it has proven
challenging to find a suitable proxy for local development.</p>
      <p>During development, we also found that setting up the triplet loss was particularly tricky. Although
we employed a train-validation-test split for our thresholding scheme, we opted for a distinct
trainvalidation split for the triplet learning pipeline. While we could continue to learn and reduce the triplet
loss, it also indicated that we were overfitting the geometry of the training dataset. When we accounted
for our split, we had a better chance of determining which parameterization of the triplet layer was
most efective for us.</p>
      <p>After refining our dataset split using our triplet pipeline, we found that reshaping DINOv2 was
significantly harder than MegaDescriptor. DINOv2 is a general-purpose feature extractor, and while it
performs well on generalized tasks, it is not optimized for fine-grained classification in this context.
We did not find a mining strategy or parameterization of the embedding head that would lower the
triplet loss on the validation set. Since the training and validation sets are non-overlapping, we can only
indirectly influence the triplet scores of the validation set by reshaping the manifold through triplet
mining on the training set. Individuals are not clustered as tightly on the DINOv2 manifold, and it
becomes dificult to move in a direction that is isomorphic between training and validation sets. The
MegaDescriptor triplet learning works comparably well, increasing the model’s performance on the
task by 0.13, compared to a 0.03 improvement on DINOv2 triplet learning. This could be because the
MegaDescriptor model employs a metric objective that combines ArcFace and Triplet Loss, resulting in
a minor domain shift overall.</p>
      <p>Finally, we observe diferences in the triplet learning in figure 6. We use PaCMAP, a graph-theoretic
embedding method that takes into account both local and global geometry in shape. The large cluster
thus signifies a cloud of points that are challenging to disambiguate. We see that over the process of
triplet learning, the DINO model learns to disambiguate clusters of individuals. In the MegaDescriptor
model, there is already a large number of clusters. Note that the number of clusters is visually larger
than in the DINO model, and this roughly correlates with performance in the final task.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>
        Reflecting on the poor performance achieved using DINO, it would likely have performed better if we
had supplemented training with the WildlifeReID-10K dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Previous experience with similarly
designed pipelines has made us aware of the data-hungry nature of the triplet learning paradigm. Due to
the small size of the provided dataset and the limitations imposed on our triplet mining implementation,
the number of unique triplets used during training was likely insuficient. Using the WildlifeReID-10K
dataset in conjunction with the provided data would likely alleviate these issues.
      </p>
      <p>Additionally, we would like to experiment with a larger number of backbones to ensure that results
are comparable. We enumerate a list of models in Table 5 that would provide a concrete starting point
for future experiments.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>We develop a transfer learning solution for the AnimalCLEF 2025 competition, leveraging inherent
visual knowledge encoded in vision transformers. We define the nearest neighbor classifier that can
tackle the open-set nature of the competition through a rigorously defined thresholding procedure.
While our solution ranks higher than the baseline MegaDescriptor solution, there are limitations
to our methods that should be addressed by augmenting them with a larger individual dataset and
more careful hyperparameter tuning. Code for this paper can be found at https://github.com/dsgt-arc/
animalclef-2025.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>We thank the Data Science at Georgia Tech (DS@GT) CLEF competition group for their support. This
research was supported in part through research cyberinfrastructure resources and services provided by
the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology,
Atlanta, Georgia, USA [22].</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Gemini Pro and Grammarly in order to: Abstract
drafting, formatting assistance, grammar and spelling check. After using these tools/services, the
authors reviewed and edited the content as needed and takes full responsibility for the publication’s
content.
formers for image recognition at scale, in: International Conference on Learning Representations,
2021. URL: https://openreview.net/forum?id=YicbFdNTTy.
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin,
Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S.
Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30,
Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision
transformer using shifted windows, in: 2021 IEEE/CVF International Conference on Computer
Vision (ICCV), 2021, pp. 9992–10002. doi:10.1109/ICCV48922.2021.00986.
[12] S. He, H. Luo, P. Wang, F. Wang, H. Li, W. Jiang, Transreid: Transformer-based object
reidentification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV), 2021, pp. 15013–15022.
[13] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza,
F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without supervision, arXiv
preprint arXiv:2304.07193 (2023).
[14] A. Miyaguchi, M. Gustineli, A. Fischer, R. Lundqvist, Transfer learning with self-supervised vision
transformers for snake identification, 2024. URL: https://arxiv.org/abs/2407.06178.
[15] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from
natural language supervision, in: International Conference on Machine Learning, 2021. URL:
https://api.semanticscholar.org/CorpusID:231591445.
[16] Y. Wu, D. Zhao, J. Zhang, Y. S. Koh, An individual identity-driven framework for animal
reidentification, 2024. URL: https://arxiv.org/abs/2410.22927.
[17] F. Schrof, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and
clustering., in: CVPR, IEEE Computer Society, 2015, pp. 815–823. URL: http://dblp.uni-trier.de/db/
conf/cvpr/cvpr2015.html#SchrofKP15.
[18] J. Deng, J. Guo, N. Xue, S. Zafeiriou, Arcface: Additive angular margin loss for deep face recognition,
in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp.
4685–4694. doi:10.1109/CVPR.2019.00482.
[19] Y. Wang, H. Huang, C. Rudin, Y. Shaposhnik, Understanding how dimension reduction tools work:
An empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization,
Journal of Machine Learning Research 22 (2021) 1–73. URL: http://jmlr.org/papers/v22/20-1061.
html.
[20] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini,</p>
      <p>H. Jégou, The faiss library (2024). arXiv:2401.08281.
[21] A. Kusupati, G. Bhatt, A. Rege, M. Wallingford, A. Sinha, V. Ramanujan, W. Howard-Snyder,
K. Chen, S. Kakade, P. Jain, A. Farhadi, Matryoshka representation learning, in: Proceedings of
the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran
Associates Inc., Red Hook, NY, USA, 2022.
[22] PACE, Partnership for an Advanced Computing Environment (PACE), 2017. URL: http://www.
pace.gatech.edu.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <year>Animalclef 2025</year>
          ,
          <year>2025</year>
          . URL: https://www.imageclef. org/AnimalCLEF2025.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wildlifedatasets:</surname>
          </string-name>
          <article-title>An open-source toolkit for animal re-identification</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>5953</fpage>
          -
          <lpage>5963</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Laskowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sawahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wasmuht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bermejo</surname>
          </string-name>
          , G. de Melo,
          <article-title>Gorillavision - open-set re-identification of wild gorillas</article-title>
          ,
          <year>2023</year>
          . URL: https://inf-cv.uni-jena.de/wordpress/wp-content/ uploads/2023/09/Talk-12
          <string-name>
            <surname>-</surname>
          </string-name>
          Maximilian-Schall.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Miele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dussert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Spataro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chamaillé-Jammes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Allainé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bonenfant</surname>
          </string-name>
          ,
          <article-title>Revisiting animal photo-identification using deep metric learning and network analysis</article-title>
          ,
          <source>Methods in Ecology and Evolution</source>
          <volume>12</volume>
          (
          <year>2021</year>
          )
          <fpage>863</fpage>
          -
          <lpage>873</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Atrw: A benchmark for amur tiger re-identification in the wild</article-title>
          ,
          <year>2019</year>
          . URL: http://arxiv.org/abs/
          <year>1906</year>
          .05586.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Haurum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Bengtson</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Moeslund</surname>
          </string-name>
          ,
          <article-title>Re-identification of zebrafish using metric learning, in: 2020 IEEE Winter Applications of Computer Vision Workshops (WACVW), IEEE</article-title>
          ,
          <string-name>
            <surname>Snowmass</surname>
            <given-names>Village</given-names>
          </string-name>
          , CO, USA,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          . URL: https://ieeexplore.ieee.org/ document/9096922/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Burghardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Andrew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Dowsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Campbell</surname>
          </string-name>
          ,
          <article-title>Towards self-supervision for video identification of individual holstein-friesian cattle: The cows2021 dataset</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2105.
          <year>01938</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          , L. Picek, Wildlifereid-10k:
          <article-title>Wildlife re-identification dataset with 10k individual animals</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2406.09211. arXiv:
          <volume>2406</volume>
          .
          <fpage>09211</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Trans-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>