<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Murilo Gustineli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anthony Miyaguchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Cheung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Divyansh Khattak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe DS@GT's second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine -tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4 × 4 tiling strategy that aligns patch size with the network's 518 × 518 receptive ifeld, and (iii) domain -prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration ifles, and reproducibility scripts are publicly available at github.com/dsgt-arc/plantclef-2025.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Computer Vision</kwd>
        <kwd>Vision Transformers</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Transfer Learning</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        There were a total of seven teams participating in the PlantCLEF 2024 challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], in which three
of those shared their solutions as working note papers. Most participants leveraged the fine-tuned
ViT provided by the organizers, as training models from scratch using the 1.4M single-label images in
the training set poses significant computational challenges. The three main methods used were: (1)
Tiling-based inference with false positive reduction (best approach) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; (2) Embedding extraction and
dimensionality reduction for classification [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]; and (3) Multi-label classification with composite training
images [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets and Models</title>
      <sec id="sec-3-1">
        <title>3.1. Training dataset</title>
        <p>
          The training dataset is a subset of the Pl@ntNet [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] training data, composed of single-label plant species,
focusing on southwestern Europe (Figure 1). As supplied by the organizers, the dataset comprises 7,806
plant species in 1.4 million images, totaling 281GB (Table 1). The high-resolution images have 800 pixels
on their longest side, allowing the use of classification models that can handle large resolution inputs
and facilitating the prediction of small plants in large vegetative plots. The images were organized
into subfolders by class (i.e., species) and split into predefined train/validation/test sets to facilitate the
training of classification models.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Test dataset</title>
        <p>The test set features image quadrats of many floristic environments, emphasizing Pyrenean and
Mediterranean flora. All datasets are curated by experts and include a total of 2,105 high-resolution quadrat
images (Figure 2). The shooting protocols can difer considerably depending on the context, with
variations such as using wooden frames or measuring tape to outline the plot, or capturing images
from angles that may not be perfectly perpendicular to the ground due to the site’s slope. Furthermore,
image quality can fluctuate based on weather conditions, leading to factors like pronounced shadows,
blurred areas, and other visual inconsistencies.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Fine-tuned models</title>
        <p>
          The ViTD2 and ViTD2PC24 models are Vision Transformers (ViTs) pretrained using the DINOv2
SelfSupervised Learning (SSL) approach on the LVD-142M dataset, which contains 142 million images [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
These models were fine-tuned on the PlantCLEF 2024 dataset to address plant species identification [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
The original ViTD2 model serves as the backbone, pretrained with DINOv2 without the classifier head,
and was not further fine-tuned on PlantCLEF data. This model is mainly used for extracting general
image embeddings. The ViTD2PC24 models, however, build on top of the backbone with additional
supervised training tailored for plant classification.
        </p>
        <p>To simplify their naming, ViTD2PC24OC refers to the version where only the classifier head was
ifne-tuned, while ViTD2PC24All refers to the model where both the backbone and classifier head were
ifne-tuned. The models were made available to participants to facilitate their experiments, particularly
those with limited computational resources, and played a key role in developing solutions for the
PlantCLEF 2024 challenge. We exclusively utilized the ViTD2PC24All model, as it was more efective
in extracting richer embedding representations and achieving higher classification scores as compared
to its counterpart.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluation metric</title>
        <p>The task is evaluated using the Macro-Averaged F1 score per sample, which provides a balance between
precision and recall for multi-label classification. We reproduce the formula for completeness (Eq. 1–2).
The goal is to predict the presence of one or more plant species in high-resolution quadrat images. This
evaluation metric takes the average of the F1 scores computed individually for each vegetation plot.
The F1 score for each quadrat image is calculated as the harmonic mean of precision and recall:
F1 =
2 · Precision · Recall</p>
        <p>Precision + Recall</p>
        <p>Where Precision =  +  and Recall =  +  with   ,   , and   denoting true
positive, false positive, and false negatives for image . To ensure fairness across ecological regions
(transects), macro-averaging is applied in a two-step process:
1. F1 scores are averaged across all quadrat images within each transect.
2. These per-transect averages are then averaged across all transects to yield the final score:
 ⎛
Final Score = 1 ∑︁</p>
        <p>⎞
1 ∑︁  1 ⎠
=1 ⎝  =1</p>
        <p>Where  is the number of transects,  is the number of quadrats in transect , and  1 is the F1
score of image .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>
        Our approach leverages the embedding space learned by the ViTD2PC24All model as a generalized
feature representation of images, which is used for classification (Figure 3). ViTD2PC24All learns robust
feature representations by processing images as sequences of fixed-size patch tokens with an additional
[CLS] token for classification tasks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These tokens serve as low-dimensional representations of
the image patches, similar to words in a phrase for language models. The main challenge lies in
overcoming the domain shift between single-label training images and multi-label test images. We
perform a tiling approach, dividing each test image into a grid of  ×  tiles. Our code is available at
github.com/dsgt-arc/plantclef-2025.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Tiled inference</title>
        <p>We use the fine-tuned ViTD2PC24All model as a baseline classifier. To bridge the gap between
singlelabel training images and multi-label test images, we perform a tiling-based classification strategy on
the multi-label test images. We begin by leveraging the fine-tuned classification head at inference time,
where each high-resolution test image is partitioned into a fixed-size grid of non-overlapping tiles (e.g.,
(1)
(2)
3 × 3 or 4 × 4). Each tile is independently classified using the fine-tuned ViT model. This method enables
localized prediction within the image and helps with the mismatch between the global multi-label test
images and the local single-label learning context of the model. To produce image-level predictions,
we aggregate tile-level predictions across the image and rank species based on their frequency of
occurrence among the top-K predictions per tile. The most frequent predicted species are selected as
the final image-level labels.</p>
        <p>We empirically determined the optimal grid size to be 4 × 4 tiles. That aligns
with the input resolution of the fine-tuned model ViTD2PC24All, which is based on the
timm/vit_large_patch14_dinov2.lvd142m architecture and expects images resized to 518 × 518.
A typical high-resolution quadrat image has a width of approximately 2000 pixels, partitioning in 4 × 4
tiles yields sub-images of roughly 500 pixels per side, closely matching the model’s expected input
size. Using smaller or larger grid sizes leads to image downscaling or upscaling during preprocessing,
resulting in degradation of feature quality and decreasing classification performance.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Geolocation filtering</title>
        <p>To address the domain-shift problem between training and test data – where the training data has 7,806
species and the test data has roughly 800 species from Southwestern Europe – we used geolocation
metadata from the training images to narrow down likely species candidates. We defined a reference
point in Southern France (44°N, 4°E) and computed the squared Euclidean distance between this point
and each species observation. For each species, we selected the closest known geotagged observation.
We then filtered species whose nearest observation falls within the geographic boundaries of countries
relevant to the test set (France, Spain, Italy, and Switzerland), as shown in Figure 4. This geospatial
ifltering reduced the search space from thousands of global species to a plausible subset of 4,981 species,
improving prediction relevance and mitigating the long-tailed class distribution (Table 3).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Visual-Cluster Bayesian prior adaptation</title>
        <p>To address the domain shift and class imbalance between training and test sets, we introduce a strategy
to prioritize likely species in the test set. We grouped test images by their corresponding region
identifiers, which are present in the quadrat_id field. These identifiers represent the origin of the
vegetation plots and were used to cluster images based on their location. We defined 13 regions based
on the test set quadrat_id naming format and assigned each image to its respective region (Table 2).</p>
        <p>
          We utilized the ViTD2PC24All model to extract the [CLS] token embeddings of the test set images and
projected the embeddings into two dimensions using PaCMAP [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to visually explore their structure.
(Figure 5a). By coloring points based on their region labels, we observed that quadrats from the
same region tend to cluster together, revealing three well-defined clusters. This suggests that certain
geographic or ecological similarities—such as altitude or vegetation type—may drive visual similarity
among regions, which can be leveraged to improve classification performance under domain shift.
        </p>
        <p>After visualizing the PaCMAP embeddings, we applied K-Means clustering to group the quadrats into
three unsupervised clusters based on visual similarity. We then assigned each region to its dominant
cluster by identifying where the majority of its images were grouped (Figure 5b). This provided a
meaningful stratification of the test set, allowing us to model regional variation in species composition
and incorporate cluster-specific priors for downstream classification. The final region distribution in
the test set is summarized in Table 2.</p>
        <p>We hypothesize that the three dominant K-Means clusters represent diferent altitude levels where
the test set images were taken:
• Cluster 1: “Coastal and Salt-Tolerant Plants” – Salt-tolerant and drought-resistant, coastal
dunes, salt marshes, and sandy habitats.
• Custer 2: “Alpine and Sub-alpine Specialists” – Hardy, low-growing plants adapted to cold,
high-altitude environments (alpine meadows and rocky slopes).
• Cluster 3: “Alpine Grasses and Ferns” – Resilient grasses and ferns, this cluster thrives in
alpine grasslands and sub-alpine zones, often in rocky or well-drained soils.</p>
        <p>To assign the descriptive labels in the bullet list, we first identified the most frequent species within
each visual cluster by averaging the per-image class-probability vectors. The top species for every
cluster were then given to ChatGPT, which returned concise ecological summaries that we adopted as
cluster names.</p>
        <p>(a) PaCMAP projection colored by region.</p>
        <p>(b) PaCMAP projection colored by K-Means cluster.</p>
        <p>We subsequently incorporated region-specific Bayesian priors into the tile-based inference pipeline.
The PaCMAP + K-Means step yields, for every cluster , an empirical prior distribution  (|) obtained
by averaging the model’s predicted probability vectors across all images in that cluster. During inference,
we re-weight each tile’s class probabilities by this prior, increasing the bias towards species that are
visually and geographically likely for that cluster. This approach helped narrow down the candidate
species space for each test image and improve robustness to underrepresented classes. This is particularly
important given the shift from single-label training images to multi-label plot images in the test set.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We evaluated our approaches on the hidden test set provided on the Kaggle competition leaderboard.
Table 3 presents an ablation study comparing diferent variants of our classification pipeline. The
naive baseline–selecting the top-K most frequent species in the training data–achieved negligible
performance across both public and private leaderboard splits. Introducing the fine-tuned ViT-based
classifier without tiling improved results only marginally, highlighting the dificulty of processing
high-resolution vegetation plots holistically. Tiling test images into a 4 × 4 grid (matching the input
resolution of the fine-tuned ViTD2PC24All model) led to a substantial performance gain. Specifically,
selecting the top-9 predictions per tile yielded a private leaderboard F1 score of 0.3442, representing a
strong baseline for multi-label classification using patch-level aggregation.</p>
      <p>To further mitigate the domain shift and long-tailed class distribution challenges, we incorporated two
complementary strategies: (1) cluster-based priors derived from PaCMAP+K-Means embeddings, and
(2) spatial filtering using geolocation priors. Applying cluster-specific Bayesian reweighting improved
the private leaderboard score to 0.3483. Alternatively, geolocation-based filtering—removing species
unlikely to occur near the test region—resulted in a private score of 0.3449 and the highest public
leaderboard score of 0.3160. These findings demonstrate that both spatially-aware inference and prior
reweighting provide valuable regularization, yielding competitive performance without modifying the
underlying model architecture.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>
        Our ablation study shows that simply forwarding the full-resolution quadrat image through the
finetuned ViT barely surpasses a frequency-based baseline ( 1 ≈ 0.006; Table 3). Once the image is tiled
into 4 × 4 sub-images whose side length roughly matches the 518px receptive field of ViTD2PC24All,
macro-F1 jumps by two orders of magnitude (0.34). This finding echoes recent work on high-resolution
ViTs, where tile- or window-based inference is consistently reported as the most reliable way to preserve
ifne-grained cues without exceeding GPU memory limits [
        <xref ref-type="bibr" rid="ref11">11, 12</xref>
        ].
      </p>
      <p>Adding visual-cluster Bayesian priors yields a further +0.004 improvement. By averaging the
model’s own probability vectors inside PaCMAP + K-Means clusters, we obtain an empirical prior
 (|) that captures region-specific floristic bias; re-weighting tile probabilities with this prior is related
to the "context-conditioned" re-ranking used in training-free zero-shot pipelines [13] and to Bayesian
reweighting strategies explored for low-shot recognition [14, 15]. The alternative geolocation filter
achieves the best public-leaderboard score (0.316) but only matches the prior-adapted model privately.
This suggests that purely spatial heuristics over-prune plausible long-tail species that remain detectable
when appearance cues and cluster priors are combined.</p>
      <sec id="sec-6-1">
        <title>6.1. Limitations and future work</title>
        <p>While our training-free pipeline demonstrates that tile-based ViT inference plus cluster-aware Bayesian
priors can reach competitive accuracy, several limitations remain that shape our next research steps. First,
the backbone we rely on is already fine-tuned on single-label PlantCLEF 2024 data, so our "zero-shot"
claim holds only for the 2025 task; extending this strategy to domains that lack such a pre-fine-tuned
model remains an open challenge. Second, non-overlapping square tiles risk bisecting plants at tile
boundaries; sliding-window inference [16], learned token merging [17], or adaptive receptive-field [ 18]
methods such as ViT-AR [19] could recover boundary context without prohibitive compute. Finally,
a lightweight round of self-training on high-confidence tile pseudo-labels, or ensembling with CNN
backbones that capture texture cues absent in ViTs [20], could raise the current 0.348 macro-F1 ceiling
while keeping compute modest.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We presented a fully training-free pipeline that combines tile-based ViT inference, geolocation filtering,
and visual-cluster Bayesian priors to tackle the PlantCLEF 2025 multi-label plant identification challenge.
Starting from a publicly released, PlantCLEF-fine-tuned ViT, our method boosts macro-F1 from 0.006
to 0.348 on the private leaderboard—good for second place—without updating a single model weight.
The study confirms three take-aways: (1) matching the inference tile scale to the ViT’s receptive field
is critical for high-resolution plant imagery; (2) unsupervised visual clustering provides a cheap yet
powerful prior that complements spatial heuristics; and (3) zero-training adaptation is competitive
when domain-specific compute or labels are scarce. All code and artifacts are open-sourced to support
follow-up research on even more challenging biodiversity datasets.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>We thank the Data Science at Georgia Tech (DS@GT) CLEF competition group for their support. This
research was supported in part through research cyberinfrastructure resources and services provided by
the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology,
Atlanta, Georgia, USA [21].</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools for writing this working note paper.
[12] Z. Li, S. F. Bhat, P. Wonka, Patchfusion: An end-to-end tile-based framework for high-resolution
monocular metric depth estimation, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2024, pp. 10016–10025.
[13] B. An, S. Zhu, M.-A. Panaitescu-Liess, C. K. Mummadi, F. Huang, Perceptionclip: Visual
classification by inferring and conditioning on contexts, arXiv preprint arXiv:2308.01313 (2023).
[14] Y. Miao, Y. Lei, F. Zhou, Z. Deng, Bayesian exploration of pre-trained models for low-shot image
classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2024, pp. 23849–23859.
[15] Z. Ji, X. Chai, Y. Yu, Z. Zhang, Reweighting and information-guidance networks for few-shot
learning, Neurocomputing 423 (2021) 13–23.
[16] A. Dede, H. Nunoo-Mensah, E. T. Tchao, A. S. Agbemenu, P. E. Adjei, F. A. Acheampong, J. J.</p>
      <p>Kponyo, Deep learning for eficient high-resolution image processing: A systematic review,
Intelligent Systems with Applications (2025) 200505.
[17] Y. Niu, Z. Song, Q. Luo, G. Chen, M. Ma, F. Li, Atmformer: An adaptive token merging vision
transformer for remote sensing image scene classification, Remote Sensing 17 (2025) 660.
[18] M. Fayyaz, S. A. Koohpayegani, F. R. Jafari, S. Sengupta, H. R. V. Joze, E. Sommerlade, H. Pirsiavash,
J. Gall, Adaptive token sampling for eficient vision transformers, in: European Conference on
Computer Vision, Springer, 2022, pp. 396–414.
[19] Q. Fan, Q. You, X. Han, Y. Liu, Y. Tao, H. Huang, R. He, H. Yang, Vitar: Vision transformer with
any resolution, arXiv preprint arXiv:2403.18361 (2024).
[20] W. Hussain, M. F. Mushtaq, M. Shahroz, U. Akram, E. S. Ghith, M. Tlija, T.-h. Kim, I. Ashraf,
Ensemble genetic and cnn model-based image classification by enhancing hyperparameter tuning,
Scientific Reports 15 (2025) 1003.
[21] PACE, Partnership for an Advanced Computing Environment (PACE), 2017. URL: http://www.
pace.gatech.edu.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Overview of PlantCLEF 2025: Multi-species plant identification in vegetation quadrat images</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janoušková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of lifeclef 2025:
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF)</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Overview of PlantCLEF 2024: Multi-species plant identification in vegetation plot images</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Foy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. McLoughlin</surname>
          </string-name>
          ,
          <article-title>Utilising dinov2 for domain adaptation in vegetation plot analysis</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gustineli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miyaguchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Stalter, Multi-label plant species classification with self-supervised vision transformers</article-title>
          ,
          <source>arXiv preprint arXiv:2407.06298</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chulif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Ishrat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. L.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Patch-wise inference using pre-trained vision transformers: Neuon submission to plantclef 2024</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bakić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barbe</surname>
          </string-name>
          , I. Yahiaoui,
          <string-name>
            <given-names>S.</given-names>
            <surname>Selmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Barthélémy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Boujemaa</surname>
          </string-name>
          , et al.,
          <article-title>Pl@ ntnet mobile app</article-title>
          ,
          <source>in: Proceedings of the 21st ACM international conference on Multimedia</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>423</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darcet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Moutakanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szafraniec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khalidov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Haziza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Nouby</surname>
          </string-name>
          , et al.,
          <article-title>Dinov2: Learning robust visual features without supervision</article-title>
          ,
          <source>arXiv preprint arXiv:2304.07193</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Jiang,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jin</surname>
          </string-name>
          , W. Zeng, [cls]
          <article-title>token is all you need for zero-shot semantic segmentation</article-title>
          ,
          <source>arXiv preprint arXiv:2304.06212</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shaposhnik</surname>
          </string-name>
          ,
          <article-title>Understanding how dimension reduction tools work: An empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>22</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>73</lpage>
          . URL: http://jmlr.org/papers/v22/
          <fpage>20</fpage>
          -
          <lpage>1061</lpage>
          . html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Leroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Revaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lucas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Weinzaepfel</surname>
          </string-name>
          ,
          <article-title>Win-win: Training high-resolution vision transformers from two windows</article-title>
          ,
          <source>arXiv preprint arXiv:2310.00632</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>