<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>I2C-UHU-PEGASUS at FungiCLEF 2025: Multimodal Pipeline for Rare Fungal Species Classification Using Fine-Tuned VLMs and Ecological Context</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fernando Carrillo García</string-name>
          <email>fernando.carrillo051@alu.uhu.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victoria Pachón Álvarez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacinto Mata Vázquez</string-name>
          <email>mata@dti.uhu.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Guerrero García</string-name>
          <email>manuel.guerrero790@alu.uhu.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Huelva</institution>
          ,
          <addr-line>Andalusia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Automatic identification of rare fungal species represents one of the most complex challenges in computational mycology and biodiversity conservation. Analysis of collections like the Atlas of Danish Fungi reveals that approximately 20% of verified observations correspond to poorly documented species, highlighting the critical need for systems capable of accurately identifying these underrepresented taxa. The scarcity of labeled samples prevents inclusion of rare species in conventional training sets, severely limiting traditional AI approaches. This research was conducted within the FungiCLEF 2025 framework, an international challenge focused on automatic fungal species classification with particular emphasis on rare species identification. Our methodology combines Vision-Language Models (VLMs) with advanced transfer learning and few-shot learning techniques, integrating multimodal fine-tuning of BioCLIP, multimodal ensemble with DINOv2, probabilistic ecological context modeling, and comprehensive textual description analysis. The results demonstrate a Recall@5 of 0.57438 on the test set, achieving 22nd position among 74 participating teams in FungiCLEF 2025, demonstrating the efectiveness of multimodal integration for few-shot scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;VLMs</kwd>
        <kwd>Few-shot learning</kwd>
        <kwd>Fungal classification</kwd>
        <kwd>Rare species</kwd>
        <kwd>FungiCLEF</kwd>
        <kwd>Multimodal AI</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Transfer Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Fungal species identification represents a historically complex field where automated solutions are
crucial to support experts and researchers in biodiversity conservation eforts. The challenge is particularly
acute for rare species, which are often underrepresented in training datasets yet critical for ecological
understanding and conservation planning. With an estimated 2.2 to 3.8 million fungal species
worldwide [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], most remaining largely undocumented, the development of accurate automated identification
systems becomes essential for accelerating biodiversity research and conservation initiatives.
      </p>
      <p>
        Unlike other biological domains with extensive labeled datasets, mycology presents a substantially
diferent reality characterized by extreme data scarcity. According to data from the FungiTastic dataset
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] used in the FungiCLEF 2025 few-shot recognition challenge, 84.6% of fungal species are represented
by five or fewer samples in the training set. This distribution reflects the real-world scenario where
rare species are significantly underrepresented in available datasets, creating a critical bottleneck for
traditional machine learning methods that require substantial amounts of labeled data per class.
      </p>
      <p>Conventional supervised learning methods face particular dificulties in this context, as they typically
demand extensive training examples to achieve reasonable performance. This limitation is especially
problematic in biodiversity applications, where taxonomic experts are scarce, field collection is
challenging, and the cost of obtaining high-quality annotations is prohibitive. Furthermore, the morphological
similarity between closely related species and the high intraspecific variability within species compound
the dificulty of accurate identification.</p>
      <p>
        The FungiCLEF 2025 competition [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], part of the LifeCLEF 2025 lab [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], specifically targets this
challenge by focusing on few-shot learning scenarios where models must generalize to rare species
with minimal training examples. This setup closely mirrors real-world applications where new species
discoveries or rarely encountered taxa must be identified based on very limited reference material.
Recent advances in Vision-Language Models (VLMs) and multimodal learning present new opportunities
to address these limitations by leveraging multiple information sources simultaneously.
      </p>
      <p>Our work addresses these challenges through a comprehensive multimodal fine-tuning strategy
that systematically adapts pre-trained Vision-Language Models to the specific requirements of fungal
classification. By this approach, we refer to a systematic methodology that jointly optimizes visual
and textual representations by fine-tuning both the visual encoder and text encoder of pre-trained
VLMs while incorporating structured prompts that combine morphological descriptions, ecological
metadata, and taxonomic hierarchies. This methodology difers from traditional fine-tuning by explicitly
incorporating domain-specific textual knowledge during the adaptation process, enabling the model to
leverage both visual features and structured biological knowledge simultaneously.</p>
      <p>This work presents three fundamental contributions to the field of biodiversity informatics and
few-shot learning:
1. We demonstrate how to efectively adapt pre-trained models from general biological domains to
specific mycology tasks through systematic multimodal fine-tuning that combines visual features
with structured textual prompts and ecological context.
2. We propose a multi-source integration framework that systematically combines visual, textual,
ecological, and hierarchical taxonomic information using probabilistic modeling and ensemble
strategies.
3. We experimentally validate that our methodology achieves reasonable performance in extreme
data scarcity scenarios, particularly for species with fewer than five available examples,
demonstrating the efectiveness of multimodal learning for biodiversity applications.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Recent advances in artificial intelligence have transformed biological species classification, particularly
through the integration of deep learning models and multimodal approaches. We examine key
developments in few-shot learning, vision-language models, and their applications to biodiversity challenges to
establish the theoretical foundation for our work.</p>
      <p>
        Biological species classification using artificial intelligence techniques has experienced significant
advances with the adoption of deep learning models [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Traditional approaches have relied heavily
on convolutional neural networks trained on large datasets, but these methods face limitations when
applied to long-tailed distributions typical of biological data. The challenge becomes particularly
pronounced in biodiversity applications where Zipfian distributions are common, with a few species
having abundant samples while the majority remain severely underrepresented.
      </p>
      <p>
        In the few-shot learning context, various approaches have been developed to address scenarios
with limited training data. Prototypical Networks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] learn to compute prototypes for each class and
classify based on distances to these prototypes, while Model-Agnostic Meta-Learning (MAML) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
learns initialization parameters that can be quickly adapted to new tasks with minimal examples. These
approaches have shown promise in computer vision tasks but require careful adaptation for biological
applications due to the domain-specific challenges involved.
      </p>
      <p>
        In FungiCLEF competitions, successful solutions have explored various architectural innovations
and training strategies tailored to the challenges of fungal identification. Recent work has combined
architectures like MetaFormer [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], implemented specialized loss functions such as Seesaw Loss for
long-tailed distributions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and explored ensemble approaches [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. More recently, Chiu et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
demonstrated the efectiveness of self-supervised models like DINOv2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for feature extraction in
fungal classification tasks, highlighting the value of general-purpose visual representations in biological
domains.
      </p>
      <p>Vision-Language Models (VLMs) represent a paradigm shift in multimodal learning, integrating
both image and text processing capabilities in a unified representation space. These models, trained
using contrastive techniques on large datasets of image-text pairs, have emerged as powerful tools in
biodiversity applications due to their ability to overcome traditional supervised learning limitations
by leveraging textual descriptions and metadata. The key advantage of VLMs lies in their capacity
to understand relationships between visual features and textual descriptions, enabling zero-shot and
few-shot learning capabilities.</p>
      <p>
        BioCLIP [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], one of the base models used in our work, adapts the CLIP architecture specifically to the
biological context, significantly improving performance in organism classification tasks across the Tree
of Life. BioCLIP consists of a ViT-B/16-based visual encoder and an autoregressive text encoder, trained
jointly on TreeOfLife-10M, a dataset spanning over 450,000 taxa with approximately 10 million images.
This specialized training allows BioCLIP to capture taxonomic hierarchical relationships inherent
to biology, consistently outperforming general domain models by 17-20% in fine-grained biological
classification tasks.
      </p>
      <p>
        DINOv2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] —another key component of our system— stands out for its ability to learn robust
visual representations without supervision. DINOv2 employs knowledge distillation and contrastive
techniques with a Vision Transformer-based architecture, generating high-quality visual embeddings
that encode rich semantic information even for classes not seen during training. This characteristic
makes it particularly valuable as a complement to domain-specific models like BioCLIP [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], especially
in few-shot scenarios where visual diversity is limited.
      </p>
      <p>Unlike previous works that focused mainly on traditional CNN architectures or self-supervised
models for visual feature extraction, our approach explores the comprehensive use of multimodal
capabilities of VLMs applied to fungal biodiversity, developing a systematic framework that combines
specific multimodal fine-tuning, probabilistic ecological context modeling, and hierarchical taxonomic
analysis.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset and Evaluation Metrics</title>
      <p>The FungiCLEF 2025 challenge utilizes the FungiTastic dataset, which presents unique characteristics
that make it particularly suitable for evaluating few-shot learning approaches in biological classification.
Understanding these dataset properties and the evaluation framework is essential for interpreting our
experimental results.</p>
      <p>
        The FungiCLEF 2025 dataset is based on FungiTastic [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and exhibits the following characteristics:
• Training: 4,293 observations distributed across 2,427 species.
• Validation: 1,099 observations across 570 species.
• Images: Available in multiple resolutions (300p, 500p, 720p, full size).
• Rich metadata: Complete taxonomic hierarchy, habitat information, substrate, biogeographic
region, and temporal data.
• Textual descriptions: Each observation includes a natural language description generated by
      </p>
      <p>Malmo-7b VLM.</p>
      <p>The challenge requires Top-10 predictions and the extreme long-tail distribution (84.6% of species
with five or fewer examples) creates an ideal scenario for evaluating few-shot learning techniques.</p>
      <sec id="sec-3-1">
        <title>3.1. Evaluation Metric: Recall@5</title>
        <p>The metric used in FungiCLEF 2025 is Recall@K, which evaluates the percentage of cases where the
correct class is found among the k most probable predictions:</p>
        <p>1 ∑︁ ⊮( ∈   )</p>
        <p>ˆ 
 =1</p>
        <p>This metric, unlike precision which measures the percentage of exact matches, allows for a more
realistic evaluation where a prediction is considered correct if the true class appears among the top k
predictions made by the model.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Our approach combines multiple complementary strategies to address the fundamental challenges of
rare species identification through a comprehensive multimodal pipeline. The integration of
visionlanguage models, ecological context, and taxonomic knowledge forms the core of our system designed
to handle extreme data scarcity scenarios.</p>
      <sec id="sec-4-1">
        <title>4.1. Multimodal Pipeline Overview</title>
        <p>
          The developed system implements a comprehensive pipeline that leverages multiple information sources
for rare fungal species classification. The system architecture combines a multimodal feature extraction
backbone (primarily fine-tuned BioCLIP [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]) with: (1) domain-specific data augmentation for fungi, (2)
caption processing and prompt structuring, (3) ecological context modeling, (4) hierarchical taxonomic
analysis, and (5) eficient search through optimized HNSW [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] indices.
        </p>
        <p>The proposed methodology starts from the premise that visual information alone is insuficient for
precise identification of rare species with limited training data. Therefore, the pipeline systematically
integrates textual information (morphological descriptions), ecological context (habitat, substrate,
biogeographic region) and hierarchical taxonomic knowledge to enrich learned representations.</p>
        <p>As illustrated in Figure 1, the pipeline follows a systematic approach where input modalities (images,
metadata, and descriptions) are processed through specialized extraction modules before being
integrated in the processing layer, culminating in ensemble-based predictions through optimized indexing
strategies.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Multimodal Fine-tuning of BioCLIP</title>
        <p>
          Our approach to adapting BioCLIP [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for fungal classification leverages systematic multimodal
ifne-tuning that preserves pre-trained knowledge while adapting to domain-specific patterns.
        </p>
        <p>The system’s core component is the multimodal fine-tuning of BioCLIP, specifically designed to
leverage both visual and textual information available in the dataset. The fine-tuned model architecture
introduces a multimodal classifier that operates on the pre-trained BioCLIP model. The BioCLIP
backbone remains largely frozen, except for the last 6 transformer layers, all normalization layers, and
the visual projection layer.</p>
        <p>The multimodal classifier implements a weighted fusion strategy:</p>
        <p>f =  · f + (1 −  ) · f
where  = 0.75 was optimized empirically, favoring visual information (75%) while retaining a
significant contribution from textual context (25%).</p>
        <p>The multimodal classifier implements a robust three-layer architecture:
(1)
(2)
(3)
h1 = Dropout(0.3, GELU(LayerNorm(Linear(f, 2048))))
h2 = Dropout(0.4, GELU(LayerNorm(Linear(h1, 1024))))</p>
        <p>o = Linear(h2, num_classes)</p>
        <p>
          This deeper architecture enables more efective learning of multimodal representations. Training uses
Focal Loss [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] with alpha=0.3, gamma=1.5 and label smoothing=0.1 to handle extreme class imbalance.
Fine-tuning uses AdamW [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] with diferentiated learning rates (3e-5 for backbone, 2e-4 for adapter)
and cosine annealing with warmup for 10 epochs.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Multimodal Feature Extraction</title>
        <p>The feature extraction process combines visual and textual information at multiple resolutions and
modalities to create robust representations for few-shot learning scenarios.</p>
        <p>The system extracts both pure image embeddings and fused multimodal embeddings. For each image,
the system processes multiple resolutions with specific weights favoring higher resolutions:
e = ∑︁  · e</p>
        <p>∈
where 300 = 0.4, 500 = 0.9, 720 = 1.3,  = 1.5. These weights were determined
experimentally, starting from standard values reported in the literature and optimized for the mycological
domain.</p>
        <p>Structured prompts are generated systematically:
"Identify this fungal species.</p>
        <p>Description: [detailed morphological description].</p>
        <p>Ecological context --- habitat: [habitat]; substrate: [substrate];
region: [region]; collected in: [month].</p>
        <p>Taxonomic information: genus: [genus], family: [family],
order: [order], class: [class], phylum: [phylum]."</p>
        <p>This structure allows the model to leverage both visual information and multiple relevant context
sources for the mycological domain.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.5. Ecological Context Integration</title>
        <p>Our probabilistic approach incorporates ecological metadata into the classification process, leveraging
the strong ecological dependencies exhibited by fungal species to improve identification accuracy.</p>
        <p>Fungal species often exhibit strong ecological dependencies, which can provide valuable signals for
classification. We developed a probabilistic model across four dimensions: habitat types (31 categories),
substrate types (30 categories), biogeographic regions (7 main categories), and temporal patterns (12
monthly distributions).</p>
        <p>For each ecological factor, conditional probabilities are calculated using Laplace smoothing [17] to
handle the long-tail distribution:
 (species|context) =</p>
        <p>count + 
total +  × 
(4)
(5)
(6)</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.4. Multimodal Ensemble Architecture</title>
        <p>
          The ensemble strategy combines BioCLIP [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and DINOv2 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] features to improve classification
robustness, particularly important in few-shot scenarios where individual models may struggle with
limited training data.
        </p>
        <p>The system implements an ensemble combining fine-tuned BioCLIP with DINOv2 to improve the
robustness of learned representations. Ensemble weights (BioCLIP: 1.4, DINOv2: 1.2) were optimized
through systematic experimentation, selecting the configuration that maximized Recall@5 on the
validation set:</p>
        <p>e =  · e + 2 · e2</p>
        <p>This ensemble leverages the domain-specific knowledge of BioCLIP while benefiting from DINOv2’s
general-purpose and resilient visual representations, which are particularly valuable for distinguishing
morphologically similar species.</p>
        <p>where  = 0.1 is used to smooth rare combinations and  represents the total number of
species.</p>
        <p>During prediction, these probabilities are applied as multiplicative boost factors. The complete
ecological context integration is formalized as:
(species|context) =</p>
        <p>∏︁
∈{habitat, substrate, region, month}
 (species|)
where the context weights were optimized empirically: ℎ = 0.45,  = 0.35,  =
0.12, ℎ = 0.25.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Textual Description Processing</title>
        <p>Our approach extracts and utilizes morphological information from textual descriptions, enabling the
system to leverage expert knowledge encoded in natural language descriptions.</p>
        <p>The morphological descriptions represent a rich source of information, requiring comprehensive
processing. The feature extractor identifies terminology specific to fungal morphology across multiple
categories: 16 color terms with location-specific descriptors, 16 shape descriptors for key structures,
and 14 texture characteristics. Feature similarity is calculated using a weighted Jaccard index for
each category, with higher weights for taxonomic information (0.7-0.8) compared to morphological
descriptors (0.2-0.35).</p>
      </sec>
      <sec id="sec-4-7">
        <title>4.7. Hierarchical Taxonomic Classification</title>
        <p>We leverage taxonomic hierarchy information for improved classification accuracy, exploiting the
well-defined biological classification structure to enhance species-level identification.</p>
        <p>Biological classification follows a well-defined hierarchy, providing valuable structural information.
We implemented an approach that exploits these relationships through taxonomic consensus voting
and verification of hierarchical consistency. Mappings are created between species and their taxonomic
classification at five levels: genus, family, order, class, phylum. The weights, optimized through
systematic experimentation, are as follows: genus (0.45), family (0.25), order (0.12), class (0.04), phylum
(0.01).</p>
        <p>Vote aggregations are built in which each species prediction contributes votes to all hierarchical
levels, with taxonomic boost mechanisms providing enhanced accuracy for taxonomically challenging
groups.</p>
      </sec>
      <sec id="sec-4-8">
        <title>4.8. Optimized Multi-Strategy Fusion</title>
        <p>The final integration strategy combines multiple prediction approaches using optimized search
algorithms to create a robust ensemble system capable of handling the challenges of few-shot learning.</p>
        <p>The final prediction combines five complementary strategies with weights specifically optimized for
each strategy in the context of multimodal fusion:
• Multimodal k-NN (weight 0.5) - provides the main foundation through multimodal embedding
space search
• Centroid similarity (weight 0.3) - distances to class prototypes
• Medoid similarity (weight 0.2) - most representative example of each class
• Metadata matching (weight 0.45) - ecological context similarity
• Description similarity (weight 0.5) - evaluates similarity between textual morphological
descriptions</p>
        <p>
          Search is performed using FAISS [18] with HNSW [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] indexing, specifically optimized for handling
multimodal embeddings: M=160, efConstruction=400, efSearch=400, with k_neighbors=50.
        </p>
      </sec>
      <sec id="sec-4-9">
        <title>4.9. Class Imbalance Handling</title>
        <p>We address the extreme class imbalance present in the FungiCLEF 2025 dataset through specialized
techniques beyond traditional resampling approaches.</p>
        <p>The FungiCLEF 2025 dataset exhibits a high degree of class imbalance where 84.6% of species have
ifve or fewer samples. We addressed this challenge without resorting to traditional oversampling or
undersampling techniques, which are inadequate due to extreme data scarcity.</p>
        <sec id="sec-4-9-1">
          <title>4.9.1. Adaptive Weighted Sampling</title>
          <p>To ensure adequate representation of rare species in training batches and prevent the model from being
dominated by abundant classes, we implemented adaptive weighted sampling. This approach assigns
higher sampling probabilities to underrepresented species, maintaining a balanced learning process
across the extreme class imbalance present in the dataset.</p>
          <p>We implemented adaptive weighted sampling:
 = max(0.05, ( max_count )1/3)

(7)
The exponent of 1/3 (the cubic root) provides a moderate compensation that ensures abundant species
remain represented during training.</p>
        </sec>
        <sec id="sec-4-9-2">
          <title>4.9.2. Specific Data Augmentation</title>
          <p>These transformations generate realistic visual variations while preserving key morphological traits
essential for species identification: random rotations ( ±20°), horizontal/vertical flips, photometric
variations (±30%), and grayscale conversion (10% probability).</p>
          <p>For rare species, data augmentation is enhanced through frequency-based boosting to compensate for
limited training samples. We developed a custom rare species boost technique that adaptively increases
prediction weights based on ecological context and taxonomic hierarchy similarity:
boost = 1.0 +  ×
︂(
where  is the class frequency,  = 15 is the threshold to consider a species rare, and  = 0.5
is the boost factor. Test-time augmentation applies seven transformations while maintaining consistency
with the associated textual prompts.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <p>We evaluate our multimodal pipeline through comprehensive experiments on the FungiCLEF 2025
challenge, analyzing both overall performance and individual component contributions to understand
the efectiveness of our approach for rare species classification.</p>
      <sec id="sec-5-1">
        <title>5.1. Overall Performance</title>
        <p>The final model achieves a Recall@5 of 0.57438 on the FungiCLEF 2025 private test set, resulting in 22nd
place out of 74 teams. Experiments were conducted using a 200-sample subset from the FungiCLEF
2025 validation set for ablation studies and component analysis.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Component Analysis</title>
        <p>As shown in Table 1, the ablation study reveals interesting results. Multimodal fine-tuning provides
a solid foundation (+4.0%), while ecological context emerges as the strongest individual contributor
(+7.5%). Notably, the DINOv2 ensemble shows modest regression on validation (-3.1%), but it performs
better on the full test dataset, as the increased visual diversity benefits general-purpose visual features.
Ablation study of pipeline components on validation set</p>
        <p>Component
Base BioCLIP (frozen)
+ Multimodal Fine-tuning
+ DINOv2 Ensemble
+ Ecological Context
+ Description Processing
+ Multi-Strategy Fusion
Total Improvement (Val)
Projected Test Performance</p>
        <p>Recall@5 Improvement
fusion achieves the strongest cumulative efect (+2.3%), validating the synergistic benefits of combining
multiple information sources.</p>
        <p>Table 2 shows that the rare species boost provides a meaningful improvement (5.0%) for species with
limited training data (5 training examples), while strategy fusion achieves progressive enhancements,
with the combined approach yielding an 8.1% improvement over the baseline k-NN method.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Validation vs. Test Performance Analysis</title>
        <p>The validation results reveal important insights about component behavior across diferent dataset
distributions. While the DINOv2 ensemble shows modest regression on the 200-sample validation
subset (-3.1%), our analysis reveals that DINOv2 outperforms BioCLIP in 18% of individual cases (36 out
of 200 validation samples), suggesting potentially stronger performance on the larger, more diverse
test dataset. This discrepancy between validation and test performance highlights the importance of
complementary feature representations in few-shot scenarios, where the limited validation subset may
not fully capture the diversity present in the complete test distribution.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Competitive Analysis</title>
        <p>Our system achieved 22nd place out of 74 teams, placing our approach within the top 30% of participants
and above the competition median (0.4489), demonstrating the efectiveness of our multimodal approach
while highlighting areas for improvement compared to top-performing methods.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Error Analysis</title>
      <p>Understanding the failure modes of our multimodal approach provides valuable insights into the
fundamental challenges of automatic fungal species identification and guides future research directions.
We examine both systematic taxonomic patterns and specific morphological factors that influence
classification accuracy.</p>
      <p>To understand the residual limitations of our approach, we conducted a detailed error analysis of the
ifnal pipeline using the validation set. This analysis focused on identifying persistent error patterns
that highlight persistent challenges in automatic fungal species identification.</p>
      <sec id="sec-6-1">
        <title>6.1. Most Problematic Species</title>
        <p>Our analysis identified several species that were consistently misclassified (100% error rate), as presented
in Table 4. Notably, all samples from species belonging to the genera Diaporthe and Plagiostoma were
misclassified, indicating fundamental dificulties in distinguishing these taxa.</p>
        <p>The species listed in Table 4 represent the most challenging cases for automatic classification, with
perfect error rates indicating that current multimodal approaches struggle with these particular taxa,
highlighting the utility of taxonomic boosting strategies for enhancing species-level classification.</p>
        <p>Confusion patterns reveal frequent misclassifications between species within the same genus or
family, as shown in Figure 2. This structure indicates that confusions tend to occur mainly between
taxonomically related species, reflecting that the model learns to distinguish broad taxonomic groups
but has dificulties diferentiating species within the same clade.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Taxonomic Error Patterns</title>
        <p>Error rates exhibit clear taxonomic patterns, with certain fungal groups consistently more dificult to
classify, as shown in Table 5.</p>
        <p>These patterns suggest that taxonomic relationships strongly influence model performance. As
evidenced in Table 5, orders like Russulales and Boletales, which contain species with more distinctive
morphological characteristics, are classified with much higher accuracy than orders containing species
exhibiting subtle morphological distinctions.</p>
        <p>Figure 3 illustrates the contrasting classification performance across diferent fungal genera,
highlighting the morphological factors that determine model success or failure. The top row shows the most
challenging genera for automatic classification: Diaporthe and Plagiostoma species exhibit atypical
fungal morphologies characterized by extremely small fruiting bodies (perithecia) embedded within
plant substrates. These genera present minute, dark structures that are dificult to distinguish even
for human experts, with microscopic diagnostic features and similar ecological preferences for woody
substrates creating fundamental challenges for visual feature extraction.</p>
        <p>In contrast, the bottom row demonstrates genera with significantly better classification performance.
Puccinia species display distinctive orange circular patterns with high visual contrast against plant
tissue, providing unique colorimetric and geometric features that facilitate accurate identification.
Cortinarius species exemplify conventional mushroom morphology with clearly defined caps, stems,
and gills that ofer multiple discriminative visual characteristics. These morphological diferences
explain the dramatic performance gap: while Diaporthe and Plagiostoma achieve 100% error rates,
Puccinia and Cortinarius maintain relatively low error rates of 44% and 47% respectively, demonstrating
that our multimodal approach performs efectively when distinctive visual features are present at
macroscopic scales.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Implications for Model Design</title>
        <p>These findings influenced our model design in several ways: (1) incorporation of hierarchical taxonomic
information to leverage better performance at higher taxonomic levels, (2) implementation of taxonomic
boost mechanisms that improved accuracy for dificult groups, (3) specific boost techniques for the most
problematic species, and (4) adaptive multimodal fusion giving more weight to discriminatory features
in genera with high error rates. This error analysis provides valuable information about current system
limitations and guides future improvements, especially in handling dificult genera like Diaporthe and
Plagiostoma.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>The multimodal approach developed in this work demonstrates the potential of integrating
visionlanguage models with domain-specific knowledge for addressing challenging biodiversity classification
tasks. Our results provide important insights for the broader application of AI systems in biological
conservation and species identification.</p>
      <p>
        The developed multimodal ensemble approach demonstrates reasonable performance in fungal
classification by systematically integrating BioCLIP-specific fine-tuning, multimodal ensemble with
DINOv2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], probabilistic ecological context modeling, and comprehensive textual description analysis,
contributing valuable insights for multimodal learning in biodiversity applications.
      </p>
      <p>The success of our approach highlights several key lessons for biological classification.
Domainspecific pre-training provides crucial advantages over general models, although careful fine-tuning is
still required to adapt these models to specific tasks. Multimodal integration significantly improves
performance, particularly when systematically incorporating biological knowledge through ecological
and taxonomic modeling.</p>
      <p>Few-shot learning techniques prove essential for handling the long-tail distribution characteristic of
biological datasets, with the combination of test-time augmentation, rare species boost, and ensemble
methods providing robust performance across the full range of species in the dataset. However,
significant challenges remain in distinguishing closely related species, particularly within genera
like Diaporthe and Plagiostoma, indicating the need for more sophisticated approaches.</p>
      <p>The methodology demonstrates that Vision-Language Models, when appropriately adapted through
domain-specific fine-tuning and enriched with ecological, textual and taxonomic knowledge, can
efectively address the challenge of rare species classification in biological domains. While our approach
achieves reasonable performance compared to the median, the substantial gap with top-performing
methods indicates significant opportunities for improvement.</p>
      <p>Regarding computational considerations, our pipeline requires approximately 2.5 hours for training
the complete system on a single A100 GPU, with inference time of approximately 0.8 seconds per
sample when processing the multimodal ensemble. While this computational overhead is manageable
for research applications, practical deployment would benefit from model compression and optimization
techniques to reduce inference time and resource requirements.</p>
      <p>Future work should focus on several key areas: (1) expanding the ecological context modeling to
include more detailed habitat relationships and seasonal patterns, (2) exploring advanced few-shot
learning techniques specifically designed for extreme class imbalance scenarios, (3) investigating the
integration of phylogenetic information to better distinguish closely related species, and (4) developing
more eficient architectures that maintain performance while reducing computational requirements
for practical deployment. Additionally, the integration of citizen science data and active learning
approaches could help address data scarcity for rare species.</p>
      <p>The source code and trained models are available at https://github.com/cgarciafernando/
fungiclef-2025-tfg to facilitate reproducibility and enable further research in multimodal
biodiversity informatics.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Claude (Anthropic) to improve writing style,
translate text, and generate scripts for creating tables and visualizations based on the authors’
experimental pipeline code and results. After using these tools, the author(s) reviewed and edited the content
as needed and take(s) full responsibility for the publication’s content.
[17] S. F. Chen, J. Goodman, Good-turing frequency estimation for feature selection, Computer Speech
&amp; Language 13 (2006) 61–77.
[18] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, IEEE Transactions on
Big Data 7 (2019) 535–547.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Hawksworth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lücking</surname>
          </string-name>
          ,
          <article-title>Fungal diversity revisited: 2.2 to 3.8 million species</article-title>
          , Microbiology
          <string-name>
            <surname>Spectrum</surname>
          </string-name>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janoušková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cermak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <article-title>Fungitastic: A multi-modal dataset and benchmark for image categorization</article-title>
          ,
          <source>arXiv preprint arXiv:2408.13632</source>
          (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2408. 13632.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Janouskova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          , Overview of FungiCLEF 2025:
          <article-title>Few-shot classification with rare fungi species</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janoušková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of lifeclef 2025:
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF)</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , Y. Bengio, G. Hinton,
          <article-title>Deep learning</article-title>
          ,
          <source>Nature</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Snell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <article-title>Prototypical networks for few-shot learning</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <article-title>Model-agnostic meta-learning for fast adaptation of deep networks</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Machine Learning</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ruan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guo</surname>
          </string-name>
          , B. Han,
          <article-title>1st place solution for fungiclef 2022 competition: Fine-grained open-set fungi recognition</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-S.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>A deep learning based solution to fungiclef2023</article-title>
          , in: CLEF Working Notes,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Desingu</surname>
          </string-name>
          , et al.,
          <article-title>Fungiclef: Deep-learning for the visual classification of fungi species using network ensembles</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miyaguchi</surname>
          </string-name>
          ,
          <article-title>Fine-grained classification for poisonous fungi identification with transfer learning</article-title>
          ,
          <source>arXiv preprint arXiv:2407.07492</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2407. 07492.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darcet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Moutakanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szafraniec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khalidov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Haziza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Nouby</surname>
          </string-name>
          , et al.,
          <article-title>Dinov2: Learning robust visual features without supervision</article-title>
          ,
          <source>arXiv preprint arXiv:2304.07193</source>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/abs/2304.07193.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Campolongo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Carlyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Dahdul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Berger-Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-L.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>Bioclip: A vision foundation model for the tree of life</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Malkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Yashunin</surname>
          </string-name>
          ,
          <article-title>Eficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>42</volume>
          (
          <year>2018</year>
          )
          <fpage>824</fpage>
          -
          <lpage>836</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Decoupled weight decay regularization</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>