<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mushroom for Improvement: Prototypical Few-Shot Learning with Multimodal Fungal Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tuan-Anh Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Quang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>VNU-HCM University of Science</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We present a multimodal few-shot classification pipeline for the FungiCLEF 2025 challenge, addressing the task of fine-grained fungal species recognition from sparse and heterogeneous observations. Our approach integrates visual features from three pretrained image encoders-BioCLIP, SigLIP ViT-B/16, and DINOv2 - with textual descriptions and structured metadata using a unified multimodal embedding. The model is trained in two stages: initial supervised pretraining of a multimodal encoder followed by prototypical network fine-tuning under an episodic few-shot regime. We further apply an observation-level reranking strategy that aggregates predictions across multiple images per observation via a weighted voting scheme. Evaluation on the oficial FungiCLEF 2025 public and private test sets demonstrates strong performance, with Recall@5 scores of 0.57079 and 0.55498, respectively. Ablation results confirm the additive benefit of combining image, text, and metadata features. The code is available at https://github.com/YangTuanAnh/FungiCLEF2025 • Introduce a robust multimodal classification framework integrating image, text, and metadata features for fungi species recognition.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;ifne-grained visual categorization</kwd>
        <kwd>few-shot learning</kwd>
        <kwd>prototypical networks</kwd>
        <kwd>multimodal representation learning</kwd>
        <kwd>foundational models</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The FungiCLEF 2025 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] Challenge addresses few-shot recognition of fungi species using real-world data
comprising multiple photographs, rich metadata (e.g., location, substrate, toxicity), satellite imagery,
and meteorological variables. The task requires models to return a ranked list of species predictions per
observation, despite the challenges of large class diversity and many rare or under-recorded species
with limited training data.
      </p>
      <p>The motivation behind this challenge lies in the need to support mycologists, citizen scientists, and
nature enthusiasts in species identification while contributing to biodiversity data collection. To be
practical for large-scale citizen science projects, models must eficiently handle numerous classes—including
those with scarce observations—and operate under limited computational resources. Importantly, rare
species are often excluded from training data, complicating AI models’ ability to recognize them.</p>
      <p>In this work, we introduce:</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Fine-grained visual categorization (FGVC) focuses on distinguishing between visually similar categories,
such as species or subspecies, and often requires models to learn subtle appearance diferences under
limited supervision [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Few-shot learning methods address the scarcity of labeled data by enabling
generalization to novel classes with only a few examples; metric-based approaches like Prototypical
Networks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] learn class prototypes in an embedding space and classify queries based on their distances
to these prototypes. Multimodal representation learning has been advanced by models such as CLIP
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and SigLIP [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which align image and text embeddings via contrastive objectives, enabling robust
cross-modal understanding. These foundational models have been adapted to specialized domains,
including biology, through domain-specific fine-tuning [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        We used the oficial FungiCLEF 2025 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] dataset, which uses a sample of the FungiTastic dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
The training and validation sets include fungal observations from the Atlas of Danish Fungi submitted
before 2024. Each entry features expert-annotated images and rich metadata, including satellite data,
weather, timestamps, locations, substrate, habitat, and toxicity. Most entries are fully annotated.
      </p>
      <p>For the oficial evaluation, the test set comprises a separate collection of images that remained
unpublished until the challenge concluded. While partial test results were made publicly available
during the competition, the outcomes on the private subset were revealed after the submission deadline.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Method</title>
      <sec id="sec-4-1">
        <title>4.1. Image Encoders</title>
        <p>
          The FungiTastic [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] paper evaluated the performance of BioCLIP [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], CLIP[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], and DINOv2[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] on
few-shot image classification tasks, identifying BioCLIP[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] as the strongest baseline due to its
domainspecific training on biological imagery. Building on this finding, we investigate whether combining
features from all three models can improve classification performance beyond the baseline.
        </p>
        <p>
          We extract visual features using three pretrained models. BioCLIP [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is a vision-language model
finetuned on biological image data, which enhances its ability to generalize across diverse species. SigLIP
ViT-B/16 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a contrastive vision-language model based on the ViT-B/16 architecture and trained
with a sigmoid loss, ofering improved embedding robustness over traditional CLIP-style objectives
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. DINOv2 Base [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is a self-supervised vision transformer that captures fine-grained spatial and
texture-level representations.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Text and Metadata Features</title>
        <p>Textual descriptions and structured metadata are encoded using both rule-based and statistical methods.
For textual descriptions, we extract interpretable features such as color, shape, texture, growth pattern,
habitat, and size using regular expression matching. These rule-based attributes are complemented by a
TF-IDF vectorizer, which captures general linguistic patterns in the descriptions.</p>
        <p>Structured metadata, including categorical fields like month, habitat, and substrate, is encoded using
one-hot encoding. A fixed schema—defined based on the training set—ensures consistent encoding
across training, validation, and test splits. The text and metadata features are concatenated into a single
vector, which is optionally included in the classification model.
white, cream, yellow, orange, red, pink, purple, blue, green,
brown, black, gray/grey, tan, golden, beige, buf, olive, rusty
spherical, round, globose, ball, cap, pileus, umbrella, conical, cone,
flat, convex, depressed, shelf, bracket, club, coral, fan, cup,
disc, bell
smooth, slimy, sticky, viscid, rough, bumpy, warty, scaly, fibrous,
hairy, velvety, fuzzy, ridged, wrinkled, grooved, pitted,
powdery, granular, cracked
cluster, clustered, group, gregarious, scattered, solitary, single,
individual, caespitose, troops, fairy ring, circle, row, line,
tuft, dense, packed
soil, ground, earth, wood, log, trunk, stump, branch, leaf,
leaves, needle, needles, litter, moss, grass, dung, manure,
compost, mulch
mycelium, hyphae, spore, fruiting body, basidium, basidia,
ascus, asci, cystidia, stipe, pileus, lamellae, gills,
pores, annulus, volva, universal veil, partial veil,
hymenium, cap, stem, stalk</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Multimodal Feature Fusion</title>
        <p>To form a unified representation of each fungal observation, we concatenate embeddings from three
distinct modalities: image embeddings (from BioCLIP, SigLIP, and DINOv2), structured metadata (encoded
as one-hot vectors), and textual features (both rule-based and TF-IDF). The resulting multimodal feature
vector is L2-normalized and used as the input to the metric-based few-shot classification pipeline.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Two-Stage Training Strategy</title>
        <p>
          Our training pipeline follows a two-stage approach. In the first stage, the multimodal encoder is
pretrained using standard supervised classification. In the second stage, we adapt the encoder for
few-shot learning using a prototypical network [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <sec id="sec-4-4-1">
          <title>4.4.1. Stage 1: Supervised Encoder Pretraining</title>
          <p>The encoder is initially trained in a fully supervised setting with a cross-entropy loss. We optimize for
200 epochs using AdamW (learning rate 10− 3, weight decay 10− 4) and a batch size of 256. During this
stage, all backbone encoders (e.g., BioCLIP, DINOv2) remain frozen. The goal is to initialize the encoder
with representations that are discriminative across the full training set.</p>
          <p>The multimodal input is passed through an MLP encoder comprising a hidden layer of size 512, batch
normalization, ReLU activation, and dropout (rate 0.1). The output is projected to a 512-dimensional
L2-normalized embedding. A classifier head maps this embedding to logits over  classes via a linear
layer (256 units), batch normalization, ReLU, dropout (rate 0.2), and a final projection to R .</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>4.4.2. Stage 2: Prototypical Network Fine-Tuning</title>
          <p>
            In the second stage, we adapt the encoder for few-shot classification using a prototypical network [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ].
For each  -way -shot task, class prototypes are constructed by averaging the embeddings of the 
support examples per class:
          </p>
          <p>Given a query embedding  (), classification is performed by computing the squared Euclidean
distance to each prototype and applying a softmax over the negative distances:
  =
1</p>
          <p>∑︁  ()
|| ∈</p>
          <p>exp (︀ −‖  () −  ‖22)︀
( =  | ) = ∑︀′ exp (︀ −‖  () −  ′ ‖22)︀
ℒproto = −
1</p>
          <p>∑︁
|| (x,)∈
log ( | x)
The episodic training objective is then the average cross-entropy loss across all queries in the task:
The encoder parameters  are jointly optimized using the AdamW optimizer [11] with a learning rate
of 10− 3 and weight decay of 10− 4. All pretrained backbones (e.g., BioCLIP, DINOv2) are kept frozen
during this phase.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Episodic Training Procedure</title>
        <p>To enable few-shot generalization, we adopt an episodic training strategy that mirrors the test-time
setting. Each episode is a 5-way 5-shot task with 15 query examples per class (75 queries total),
encouraging adaptation to novel classes under limited supervision. Training spans 20 meta-epochs
of 200 episodes each. For every episode, the model computes class prototypes from the support set
and evaluates the query set, using a classification loss based on distances to prototypes. The shared
encoder and classifier are updated via AdamW with stage-1 hyperparameters, while the image and text
backbones remain frozen.</p>
        <p>
          The loss is the negative log-likelihood over a softmax of distances, pushing embeddings to be
discriminative and robust in data-scarce conditions. This episodic framework is inspired by prior
metric-based few-shot methods such as Matching Networks [12] and Prototypical Networks [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which
have shown strong performance in low-data regimes.
        </p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Observation-Level Reranking via Prediction Aggregation</title>
        <p>
          To evaluate the model at the level of fungal observations—sets of images of the same specimen—we
aggregate image-level predictions using a weighted reranking scheme. Each image yields a top-10 list of
predicted classes with confidence scores. For each observation, we collect all such predictions and apply
a rank-based voting scheme with weights  = [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8">12, 10, 8, 7, 6, 5, 4, 3, 2, 1</xref>
          ], assigning higher scores to
higher-ranked classes. The aggregated scores across all images determine the observation-level top-10
prediction.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We assess model performance using Top-k Accuracy (Recall@k), which measures the fraction of test
samples where the true label is among the top- predictions:
(a) Before prototypical training
(b) After prototypical training
where  is the number of samples,  the true label, ˆ  the top- predictions, and 1(· ) the indicator
function. We report results with  = 5 on both public and private test sets from FungiCLEF 2025.</p>
      <sec id="sec-5-1">
        <title>5.1. Hardware</title>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Main Evaluation</title>
        <p>We conducted all experiments using the freely available NVIDIA Tesla P100 GPUs provided by the
Kaggle platform. This hardware configuration was suficient for extracting features from large image
batches and training our few-shot models without requiring additional computational resources.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Impact of Multimodal Fusion</title>
        <p>To isolate the efect of diferent input modalities, Table 3a presents Recall@5 scores using the same
pretrained architecture with only a single feature modality at a time, or all three combined (image,
image-text, and metadata). These results demonstrate the additive benefit of incorporating textual and
contextual metadata beyond image features alone.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Reranked Ensemble Strategy</title>
        <p>The ensemble result is obtained by merging predictions from the pretrained and trained models at
the observation level using the reranking method described in Section 4.6. Specifically, predictions
from both models are aggregated per observation using rank-based voting, assigning higher weights
to top-ranked predictions. This fusion strategy exploits the complementary strengths of pretrained
generalization and fine-tuned specialization, yielding the best overall performance.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Ablation Study</title>
      <sec id="sec-6-1">
        <title>6.1. Text Embeddings</title>
        <p>We conducted an ablation study to assess the efectiveness of learned representations for textual and
geographic information in our few-shot fungal classification pipeline.</p>
        <p>
          For textual features, we replaced handcrafted rule-based and TF-IDF features with pretrained language
models, including BGE-large-en [13], BioCLIP’s text encoder [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], and e5-large [14]. Despite their
strong performance on general language understanding tasks, all models underperformed when used as
sole text representations in our pipeline. These encoders failed to capture fine-grained domain-specific
cues such as structured habitat descriptions or morphological terms, which are explicitly encoded by
our handcrafted pipeline.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Geographic Embeddings</title>
        <p>We also evaluated domain-adapted geographic representation models, specifically GeoCLIP [15] and
TaxaBind [16], to embed geographic coordinates from the metadata. These models encode spatial and
ecological priors via geolocation-aware training, but when integrated into our multimodal classification
framework, they failed to improve performance. The observed degradation is likely due to the high
intra-class geographic variance and sparse sampling in the training data, which hinders the utility of
learned spatial embeddings. As a result, we decided not to include geographic coordinates in the final
submission.</p>
        <p>These results underscore the importance of carefully engineered feature representations in
lowresource ecological settings. While large-scale pretrained models ofer generality, domain-specific
heuristics currently yield more discriminative power in our few-shot fungal classification task.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We proposed a multimodal framework combining pretrained image encoders, textual descriptions, and
metadata for fungal species classification on the FungiCLEF 2025 dataset. Our two-stage training with
prototypical fine-tuning and observation-level reranking significantly improved few-shot performance.
Multimodal fusion consistently outperformed image-only baselines, and the reranked ensemble achieved
the best results. Future work will focus on enhancing metadata integration and fusion strategies to
further boost accuracy in ecological image recognition.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: Improve writing style,
Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
[10] Anthropic, Introducing claude 3.5 sonnet, 2024. URL: https://www.anthropic.com/news/
claude-3-5-sonnet.
[11] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, 2019. URL: https://arxiv.org/abs/
1711.05101. arXiv:1711.05101.
[12] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, D. Wierstra, Matching networks for one shot
learning, 2017. URL: https://arxiv.org/abs/1606.04080. arXiv:1606.04080.
[13] S. Xiao, Z. Liu, P. Zhang, N. Muennighof, C-pack: Packaged resources to advance general chinese
embedding, 2023. arXiv:2309.07597.
[14] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, F. Wei, Text
embeddings by weakly-supervised contrastive pre-training, 2024. URL: https://arxiv.org/abs/2212.03533.
arXiv:2212.03533.
[15] V. V. Cepeda, G. K. Nayak, M. Shah, Geoclip: Clip-inspired alignment between locations and
images for efective worldwide geo-localization, 2023. URL: https://arxiv.org/abs/2309.16020.
arXiv:2309.16020.
[16] S. Sastry, S. Khanal, A. Dhakal, A. Ahmad, N. Jacobs, Taxabind: A unified embedding space for
ecological applications, 2024. URL: https://arxiv.org/abs/2411.00683. arXiv:2411.00683.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Klarka, picekl, Fungiclef25 @ cvpr-fgvc lifeclef</article-title>
          , https://kaggle.com/competitions/fungi-clef-
          <year>2025</year>
          ,
          <year>2025</year>
          . Kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Berg</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Alexander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Belhumeur</surname>
          </string-name>
          , Birdsnap:
          <article-title>Large-scale ifne-grained visual categorization of birds</article-title>
          ,
          <source>in: 2014 IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>2019</fpage>
          -
          <lpage>2026</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2014</year>
          .
          <volume>259</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Van Horn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Branson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Farrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Haber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ipeirotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <article-title>Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>595</fpage>
          -
          <lpage>604</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Snell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <article-title>Prototypical networks for few-shot learning</article-title>
          ,
          <year>2017</year>
          . URL: https: //arxiv.org/abs/1703.05175. arXiv:
          <volume>1703</volume>
          .
          <fpage>05175</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2103.00020. arXiv:
          <volume>2103</volume>
          .
          <fpage>00020</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , L. Beyer,
          <article-title>Sigmoid loss for language image pre-training</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.15343. arXiv:
          <volume>2303</volume>
          .
          <fpage>15343</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Campolongo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Carlyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Dahdul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Berger-Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-L.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>Bioclip: A vision foundation model for the tree of life, 2024</article-title>
          . URL: https://arxiv.org/abs/2311.18803. arXiv:
          <volume>2311</volume>
          .
          <fpage>18803</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janouskova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cermak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <article-title>Fungitastic: A multi-modal dataset and benchmark for image categorization</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2408.13632. arXiv:
          <volume>2408</volume>
          .
          <fpage>13632</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darcet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Moutakanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szafraniec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khalidov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Haziza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Nouby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Assran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ballas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Galuba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Howes</surname>
          </string-name>
          , P.-Y. Huang,
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rabbat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mairal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Labatut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , P. Bojanowski,
          <article-title>Dinov2: Learning robust visual features without supervision</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/ 2304.07193. arXiv:
          <volume>2304</volume>
          .
          <fpage>07193</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>