<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Zero-Shot Segmentation through Prototype-Guidance for Multi-Label Plant Species Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luciano Araujo Dourado Filho</string-name>
          <email>lucianoadfilho@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Almir Moreira da Silva Neto</string-name>
          <email>almirneto338@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rodrigo Pereira David</string-name>
          <email>rpdavid@inmetro.gov.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rodrigo Tripodi Calumby</string-name>
          <email>rtcalumby@uefs.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADAM - Advanced Data Analysis and Management, University of Feira de Santana (UEFS)</institution>
          ,
          <addr-line>Feira de Santana</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Metrology, Technology and Quality</institution>
          ,
          <addr-line>Inmetro</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents an approach developed to address the PlantClef 2025 challenge, which consists of a finegrained multi-label species identification, over high-resolution images. Our solution focused on employing class prototypes obtained from the training dataset as a proxy guidance for training a segmentation Vision Transformer (ViT) on the test set images. To obtain these representations, the proposed method extracts features from training dataset images and create clusters, by applying K-Means, with  equals to the number of classes in the dataset. The segmentation model is a customized narrow ViT, built by replacing the patch embedding layer with a frozen DinoV2, pre-trained on the training dataset for individual species classification. This model is trained to reconstruct the class prototypes of the training dataset from the test dataset images. We then use this model to obtain attention scores that enable to identify and localize areas of interest and consequently guide the classification process. The proposed approach enabled a domain-adaptation from multi-class identification with individual species, into multi-label classification from high-resolution vegetation plots. Our method achieved ifth place in the PlantCLEF 2025 challenge on the private leaderboard, with an F1 score of 0.33331. Besides that, in absolute terms our method scored 0.03 lower than the top-performing submission, suggesting that it may achieved competitive performance in the benchmark task. Our code is available at https://github.com/ADAM∗Corresponding author. https://github.com/rpdavid78/ (R. P. David); https://www.rtcalumby.com.br (R. T. Calumby) 0000-0002-0507-2201 (L. A. D. Filho); 0009-0008-5042-5556 (A. M. d. S. Neto); 0000-0001-9218-8191 (R. P. David); 0000-0001-8515-265X (R. T. Calumby) Proceedings</p>
      </abstract>
      <kwd-group>
        <kwd>clustering</kwd>
        <kwd>vision transformer</kwd>
        <kwd>multispecies classification</kwd>
        <kwd>plant identification</kwd>
        <kwd>fine-grained classification</kwd>
        <kwd>prototype learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The automated identification of plant species from images of complex vegetation plots presents a
significant challenge in ecological research and biodiversity monitoring. Initiatives like the PlantCLEF
aim to advance this field by providing benchmark datasets and tasks, fostering innovation in
multilabel classification from high-resolution plot imagery [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. A primary challenge in this task is
the notable domain shift between training data, which mostly consist of images of individual plants,
and test data, comprising dense, multi-species vegetation plots captured under varied environmental
conditions [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4, 5, 6</xref>
        ]. Therefore, it demands robust methods that can efectively generalize and
diferentiate target species within cluttered natural scenes [ 7].
      </p>
      <p>Vision Transformers (ViTs), particularly those pre-trained using self-supervised learning, have
demonstrated remarkable efectiveness in generating discriminative visual features for various tasks,
including large-scale plant identification [ 8, 9]. In the broad context of image classification, models
such as DINOv2 [10] has demonstrated to ofer powerful backbones for feature extraction. Specifically
in The PlantCLEF 2025, organizers supported participants by providing ViT models pre-trained on</p>
      <p>LGOBE</p>
      <p>https://github.com/FalsoMoralista (L. A. D. Filho); https://github.com/almirneeto99/ (A. M. d. S. Neto);</p>
      <p>CEUR</p>
      <p>ceur-ws.org
relevant flora, considering a DINOv2-based architectures [ 11]. Our proposed method goes beyond the
usual direct end-to-end classification of the entire plot images (which can be confounded by background
noise and species overlap) by focusing on a preliminary segmentation-driven filtering step. We propose
a “decrease-and-conquer” strategy which relies in a dedicated ViT that first identifies plant-relevant
regions, thereby simplifying the subsequent classification task for specialized heuristics. This ViT
leverages its attention mechanisms as a proxy for spatial segmentation, aiming to isolate pertinent
areas from irrelevant background elements.</p>
      <p>A crucial aspect of our methodology is the training of this segmentation ViT without direct pixel-level
supervision. To overcome the absence of segmentation labels, we introduce a novel proxy task as our
core contribution. We trained a narrow, customized ViT to reconstruct a predetermined representation
of the training dataset—specifically, class prototypes derived from k-Means clustering on DINOv2
embeddings of the original training set images. Uniquely, this ViT learns this reconstruction objective
using, as its input, feature embeddings extracted by a separate, frozen DINOv2 model from test images.
Our hypothesis is that for the ViT to successfully map features from unseen test plots back to these
known training set prototypes, it must learn to generate attention scores that highlight plant-relevant
patches while attenuating signals from the background (as illustrated in Figure 3).</p>
      <p>This paper provides a detailed exposition of this proxy task-driven ViT technique. The generation
of the target class prototypes, the specific architecture of the custom ViT, and the operational steps
of its unique training regimen as detailed in Section 3. Furthermore, we describe how the learned
attention maps are utilized within our inference pipeline to select regions of interest prior to applying
classification heuristics for final species identification. The experimental settings, key hyperparameters,
and the results achieved in the PlantCLEF 2025 challenge are presented and discussed, ofering insights
into the efectiveness of this approach for multi-label classification in complex ecological imagery.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <p>The PlantCLEF 2025 challenge addresses the intricate problem of multi-label species identification
within high-resolution images of vegetation plots. The core objective is to predict all plant species
present in a given plot image, a fundamental task to ecological research, biodiversity assessment, and
long-term environmental monitoring.</p>
      <p>The primary dificulty of the challenge stems from a significant domain shift between the provided
datasets. A comprehensive training dataset comprising roughly 1.4 million images representing 7,806
plant species. These training images typically feature single plant species from various perspectives.
In contrast, the test dataset consists of 2,105 high-resolution plot images from Mediterranean and
Pyrenean regions. These test images depict complex, multi-species scenes, often captured from a
top-down perspective within delimited sampling areas (quadrats), and under varying environmental
conditions that can introduce shadows or blurry areas.</p>
      <p>Two pre-trained ViT models based on the DINOv2 architecture were provided along with the dataset.
These models were not only pre-trained on a massive dataset of images but were also specifically
ifne-tuned on the PlantCLEF 2024 training data which includes isolated plant samples, considering
multiple organs and perspectives. The fine-tuned models available were: a) one where only the final
classification layer was trained; b) another where all model layers were unfrozen and fine-tuned for the
task. These models ofer a powerful, ready-to-use foundation resources avoiding new investigations to
demand the extensive computational resources needed to train such large-scale models from scratch.
The classification efectiveness was evaluated using the macro F1 score averaged per plot, a metric
designed to balance precision and recall for each individual plot images, with results tracked on a public
leaderboard. The final score is computed as described in Equation 1, where  is the number of transects
(plots) ,   is the number of quadrats in transect  , 1 ∑= 1  1  is the macro-averaged F1-score per sample
 
of transect  and  1  is the F1 score for test image  .
predicted and    number of plant species missed.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method and Experiments</title>
      <p>The proposed zero-shot learning approach focuses on the segmentation of plants from irrelevant
background elements within test images. In order to achieve that, we trained a dedicated segmentation
model, whose primary function is to identify plant regions. We theorize that a preliminary segmentation
step could allow for more precise application of classification heuristics, over pertinent regions in the
image, in a ”decrease-and-conquer” fashion.</p>
      <p>For the segmentation task, we employed a ViT due to its capability to model contextual relationships
between image patches (tokens), according to an objective. We propose that given the appropriate
objective, these relationships could be guided towards enabling the model to diferentiate plant structures
from other objects. In other words, we anticipate that the ViT’s attention scores could provide a
mechanism for selecting relevant regions from test images.</p>
      <p>The caveat of our approach is that in order for this model to learn semantically meaningful
(patchwise) relationships, a reasonable objective has to be given, which we initially do not have. To tackle
such limitation, a proxy task was designed consisting of reconstructing an approximate representation
of the training dataset. We used the training dataset to obtain features that represented each of its
individual species generating cluster prototypes, which we employ as reconstruction objective for our
ViT. We considered that in order to successfully reconstruct the training dataset representation from
test-set images, the model would have to learn patch-wise relationships (attention scores) that enabled
  =
   =
 =</p>
      <p>Narrow ViT
N
No
Norm
NoNromrm
rorm
m</p>
      <p>AMtuelt -HM</p>
      <p>AMtAuMlitu
ittenonAittenonlit-euadeHitonit-adedeHitnneoilt-edadeitonni-ddeeaHedead +++
AMtneluH
d</p>
      <p>N
++ NoNrom</p>
      <p>NoNromrm
rorm
m</p>
      <p>MLPM +++</p>
      <p>ML
LPMLP +
MLP +
P</p>
      <p>iraenL
(num_tokens,7806)</p>
      <p>Predicted
ClusterPrototypes
(7806,768)</p>
      <p>ClusterPrototypes
(7806,768)
(1)
(2)
(3)
(4)
   
1 =
2 ×</p>
      <p>× 
 + 


The   
and</p>
      <p>, for each test image  is represented in Equation 3 and 4, respectively.</p>
      <p>For each test image  the F1 score is described as follows in Equation 2.</p>
      <p>Where    is the number of species correctly predicted,    is the number of plant species incorrectly
the maximization of this objective. As a consequence, our assumption is that the model would learn
to assign low attention scores to irrelevant regions (background noise) and high scores to patches
containing plants (as illustrated in Figure 1).</p>
      <sec id="sec-3-1">
        <title>3.1. Clustering as proxy task</title>
        <p>To obtain the target representation that guides the ViT’s optimization, we applied K-Means clustering
to the image embeddings of the training dataset. These embeddings were extracted using a separate,
identical instance of the pretrained DINOv2 model, from which the classification head has been removed.
Following embedding extraction, K-Means was performed with K set to the number of classes (7806
species). The resulting cluster centroids, or ’class prototypes’, constitute the predetermined
representation of the training dataset that the narrow ViT is addressed to reconstruct. Figure 1 illustrates the
complete training process wherein the ViT learns this reconstruction.</p>
        <p>Figure 1 details the operational steps for training our narrow ViT on this proxy reconstruction
objective. The process begins with resizing each image from the test dataset into a predefined  × 
resolution, where H and W denote image height and width. Following that, each high-resolution image
is segmented into a grid of non-overlapping 64 × 64 pixel patches (crops). These patches are subsequently
resized to 518 × 518 pixels using bicubic interpolation to meet the input specifications of the pre-trained
DINOv2 model, which serves as the patch feature extractor. This procedure yields  _ℎ =  ×6 42
tokens per image.</p>
        <p>These resized 518 × 518 pixel crops from the test images are then processed by the aforementioned
frozen DINOv2 model to extract the corresponding (768-dimensional) feature embeddings. Positional
encodings are subsequently integrated with these patch embeddings for the ViT processing. This final
sequence of augmented patch embeddings serves as the input to the narrow ViT. The ViT architecture
is a ViT-Base (ViT-B) variant configured with 6 transformer blocks, 12 attention heads per block, an
embedding dimension of 768, and a linear layer. This model is optimized to reconstruct the predetermined
target matrix of class prototypes (dimensions 7806 × 768) from these input embeddings.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Inference</title>
        <p>Following the training procedure, an optimized Vision Transformer (ViT) is employed to generate
attention maps for the test images. These scores facilitate the removal of background and other irrelevant
elements from images, prior to the classification stage. To accomplish this, an image is first processed
through the ViT. The attention scores from all transformer blocks are then aggregated by averaging
them across both blocks and attention heads. This averaged result is subsequently normalized to create
a final attention map. This attention map guides the selection of relevant image regions by filtering
tokens: only tokens whose normalized attention scores exceed a predefined threshold  are retained.
After this filtering step, which discards potentially irrelevant elements, the process advances to the
species identification pipeline. For this identification task, we implemented two distinct procedures
which are detailed in the following.</p>
        <p>As an initial step, the first heuristic comprises resizing each relevant crop to conform to the input
dimensions required by the pretrained DINOv2 model. Following this preprocessing, each resized crop
serves as input to the pretrained DINOv2 (with classifier) to obtain the classification scores for each
crop. From that, the predicted species was determined by selecting the most confident prediction that
surpassed a predefined threshold:   . In instances where no prediction met this criterion, the species
associated with the highest overall probability score was selected by default. The final list of predicted
species for a given quadrat was then compiled by aggregating the unique species IDs from all individual
crop-level predictions within that quadrat.</p>
        <p>Also following the application of a threshold ( ) to the normalized attention map, the second heuristic
consisted of constructing a composite image as follows. For each patch with an attention score above  ,
an image was assembled as a  ×  grid of neighbouring patches, using the relevant patch as the central
element. This newly assembled image was then resized via bicubic interpolation to match the input
Asse(mUbsliyngImrealgeevsanfrtopmatncehiegshbaosrcinegntpraalt)ches</p>
        <p>DinoV2cropembeddings</p>
        <p>Positional
Encoding
Attention Scores</p>
        <p>AVG</p>
        <p>(AcrossBlocks)
AVG</p>
        <p>(AcrossHeads)</p>
        <p>Extract Regions of Interest
(Patches in Which Attention Scores are above some Threshold)</p>
        <p>AverageAttention Scores
Interpolate (B, 3, 518, 518)</p>
        <p>DINOV2 + Classifier
dimensions of the pretrained DINOv2 model and processed for classification. Analogous to the first
heuristic, this approach assumes the classifier can inherently filter irrelevant or outlier regions by
prioritizing predictions that achieve a confidence score above a specified threshold ( prob). Figure 2
illustrates this procedure, which was the approach that yielded the best performance of the proposed
method. During this inference stage, we experimented over the classification parameters depicted in
Table 1 to optimize predictive performance.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experimental Settings</title>
        <p>Our custom ViT implementation is based on the I-JEPA architecture [12], in which the [cls] token
is disregarded. We evaluated two input image resolutions: 2048 × 2048 pixels and 3072 × 2048 pixels.
When processed into a grid of non-overlapping 64 × 64 pixel crops, these resolutions correspond to 1024
and 1536 patches (tokens) per image, respectively.</p>
        <p>Model optimization was performed for up to 30 epochs using the AdamW optimizer. This included a
cosine learning rate decay schedule [12], typically applied over 10 to 20 epochs of the training period,
and a weight decay schedule that linearly increased from 0.04 to 0.4. A consistent learning rate profile
was adopted across all experiments, defined by: a starting learning rate (  start) of 5.0 × 10−6, an efective
peak learning rate ( efective ) of 1.0 × 10−3, and a final learning rate (  final ) of 1.0 × 10−6. Early stopping
was implemented, triggered by observed degradation in the quality of attention maps and the onset of
model overfitting. Ablation studies, extended to 100 epochs, confirmed that training for 30 epochs with
this early stopping criterion was suficient to achieve optimal performance.</p>
        <p>All experiments were conducted on a system equipped with two NVIDIA A100-SXM 80GB GPUs. A
per-GPU batch size of 128 was employed, achieved through gradient accumulation. Our observations
indicated that this VRAM capacity constrained the maximum number of tokens processed simultaneously
to approximately 2048 for the given model configuration.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Table 2 summarizes the results all the configurations evaluated. Configurations 1-4 used our first
classification strategy, which consisted of performing direct patch-wise classification over resized crops.</p>
      <p>As demonstrated in Table 2, the patch-wise classification strategy yielded the lowest performance
among all evaluated. Specifically, after applying an attention threshold of  = 0.6 and classifying the
resulting patches without any probability constraints (Configuration 1), the model achieved an  1-score
of 0.02114 on the private set. While a marginal improvement to an  1-score of 0.09511 was obtained by
introducing a probability threshold of   = 0.5 (Submission 3), the overall strategy demonstrated a
low performance. We attribute these suboptimal results to the low-resolution nature of the individual
patches, which likely misses the necessary contextual information required for the model to identify a
complete plant specimen in a efective way.</p>
      <p>In contrast, the alternative approach, adopting the strategy of assembling a contextual grid of
adjacent patches around each relevant token (as depicted in Figure 2) yielded a remarkable performance
improvement. Using the exact same model weights, this strategy increased the  1-score from 0.09511
to 0.33131. This outcome corroborates our hypothesis that individual, low-resolution patches lack
suficient contextual information for efective classification. Subsequent experiments with a smaller
grid size ( = 5 in submissions 15 and 16) provided further evidence, demonstrating that reducing
the amount of contextual information by using fewer adjacent patches led to deterioration in the
classification performance. More information regarding checkpoint epoch selection is described in
Section 4.1</p>
      <p>The best results were obtained with configuration 14, where the segmentation model was trained with
an input resolution of 3072 × 2048. This submission achieved F1= 0.33331 on the private score, which
represented an improvement of approximately 0.5% over our first submission with this strategy (with
an input resolution of 2048 × 2048 pixels). This findings indicated that increasing the input resolution
may not be worthy, specially considering the computational and memory overhead associated to token
processing in the ViT.</p>
      <p>In summary this decrease-and-conquer strategy enabled the proposed method to reach the 5th place
on the private leaderboard of PlantCLEF 2025. Moreover, our best configuration achieved a  1-score
of 0.33331, in contrast to the best performing proposal in the leaderboard, which achieved 0.36479. In
absolute terms, our strategy presented roughly a 0.03 diference in contrast to the first place, which
demonstrates its high competitiveness. We believe that with further refinements, this strategy holds
the potential to achieve even higher performance in future iterations.</p>
      <sec id="sec-4-1">
        <title>4.1. Training Dynamics and Model Selection</title>
        <p>plant areas began to decrease, while scores for background elements started to intensify. An analysis of
the training loss curve (Figure 5) revealed that this deterioration coincided precisely with the onset
of loss convergence. The same behavior was observed for the 3072 × 2048 resolution, although the
degradation began earlier, after Epoch 10. This analysis suggested that later checkpoints could represent
overfitted models with less meaningful semantic attention relationships. Because of that, we selected
the checkpoints from Epoch 15 (for the 2048 × 2048 model) and Epoch 10 (for the 3072 × 2048 model)
for all inference tasks presented in Table 2.</p>
        <p>In other words, we observed that although the model initially learns correct semantical relations,
focusing its attention on plant features to reconstruct the class prototypes, as training progresses, it
collapses by focusing on semantically irrelevant features, such as the quadrat frame. Our hypothesis is
that this model collapse is somewhat expected, specially considering the objective of reconstructing a
constant target matrix.</p>
        <p>A possible conclusion is due a divergence between the optimization of an explicit reconstruction
objective and the implicit goal of semantic segmentation, which without the appropriate regularization
leads to the trivial solution of focusing on semantically irrelevant features. As a matter of fact, we
anticipated that the objective of reconstructing a constant target matrix could lead to the straightforward
trivial solution of the model ignoring the inputs and producing the outputs that perfectly reconstructs
the target matrix. We believe that this does not occurs in an early stage of training, due to the prior
information added by replacing the original patch-embedding layer with the pre-trained DINOv2 model.
Despite of that, the use of earlier training checkpoints is essential, as an explicit regularization was not
provided.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we introduced a decrease-and-conquer strategy for the PlantCLEF 2025 challenge, where
a Vision Transformer (ViT) was trained on a proxy task to perform segmentation previous to plant
species identification. This strategy demonstrated competitive efectiveness, securing a 5th place on the
private leaderboard of the challenge, with a final  1-score of 0.33331. Our experiments confirmed that
providing local context by assembling adjacent patches for classification is a crucial and highly efective
strategy, significantly outperforming direct patch-wise classification. Given the narrow performance
gap to top-ranking methods in the leaderboard, we are confident the proposed approach represents a
promising direction for state-of-the-art multi-label classification in complex ecological imagery through
domain-adaptation with further refinements holding the potential to achieve superior results.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by UEFS-AUXPPG 2023/2024/2025, CAPES-PROAP 2023/2024/2025,
CAPES grant 88887.159255/2025-00 and 88887.594676/2020-00 and UEFS FINAPESQ (grant 047/2023).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Chat-GPT-4o in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
[5] S. Chulif, H. A. Ishrat, Y. L. Chang, S. H. Lee, Patch-wise inference using pre-trained vision
transformers: Neuon submission to plantclef 2024, in: Working Notes of CLEF 2024 - Conference
and Labs of the Evaluation Forum, volume 3740 of CEUR Workshop Proceedings, 2024.
CEURWS.org/Vol-3740/paper-192.pdf.
[6] S. Foy, S. McLoughlin, Utilising dinov2 for domain adaptation in vegetation plot analysis, in:
Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, volume 3740 of
CEUR Workshop Proceedings, 2024. CEUR-WS.org/Vol-3740/paper-196.pdf.
[7] R. Bao, Y. Sun, Y. Gao, J. Wang, Q. Yang, H. Chen, Z.-H. Mao, Y. Ye, A survey of heterogeneous
transfer learning, arXiv preprint arXiv:2310.08459 (2023). URL: http://arxiv.org/abs/2310.08459.
doi:10.48550/arXiv.2310.08459, arXiv:2310.08459 [cs].
[8] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, A. Joulin, Emerging properties
in self-supervised vision transformers, in: Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV), 2021, pp. 9650–9660.
[9] T. Kattenborn, J. Leitlof, F. Schiefer, S. Hinz, Review on convolutional neural networks (cnn)
in vegetation remote sensing, ISPRS journal of photogrammetry and remote sensing 173 (2021)
24–49.
[10] M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza,
F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra,
M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, P. Bojanowski,
Dinov2: Learning robust visual features without supervision, arXiv preprint arXiv:2304.07193
(2023).
[11] H. Goëau, J.-C. Lombardo, A. Afouard, V. Espitalier, P. Bonnet, A. Joly, Plantclef 2024 pretrained
models on the flora of the south western europe based on a subset of pl@ntnet collaborative
images and a vit base patch 14 dinov2, https://doi.org/10.5281/zenodo.10848263, 2024. doi:10.
5281/zenodo.10848263.
[12] M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, N. Ballas,
Selfsupervised learning from images with a joint-embedding predictive architecture, in: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15619–15629.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Overview of plantclef 2024: Multi-species plant identification in vegetation plot images</article-title>
          , in: Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2024</year>
          :
          <article-title>Challenges on species distribution prediction and identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janoušková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of lifeclef 2025:
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Overview of PlantCLEF 2025: Multi-species plant identification in vegetation quadrat images</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>