1. Introduction

1613-0073

Few-Shot Classification of Fungi Species Using Contrastive Representation Learning and Multimodal Fusion

Lianping Lu

Heng Yang

Shuo Li

Fang Liu

Puhua Chen

Wenping Ma

Workshop

FungiCLEF, Few-Shot Learning, Dynamic Weighting Contrastive Loss, Feature Fusion, Fine-grained Classification

0 Intelligent Perception and Image Understanding Lab, Xidian University

The FungiCLEF2025 challenge pioneers few-shot fungi species classification through multimodal observational data integration, specifically targeting the critical bottleneck of identifying rare and under documented taxa in practical biodiversity conservation scenarios. In this work, we present a novel two-stage framework that synergizes: (1) feature space optimization via Dynamic Weighting Contrastive Loss (DWCL), and (2) cross-modal fusion of visual characteristics with ecological metadata to achieve joint representation of environmental context and fine-grained morphological patterns. Through these technical innovations, the framework ultimately secured 2nd place in the competition leaderboard. The code is publicly available at https://github.com/Looploop555/fungi.

Multimodal Fusion

1. Introduction

CEUR

ceur-ws.org encoded using BERT and subsequently fused with visual features through Q-Former [9] based cross modal interaction. • Two-Stage Decoupled Pipeline: By separating feature extraction and contrastive learning from multimodal fusion and final classification, each phase can be optimized independently. The first stage focuses on crafting highly discriminative visual embeddings, and the second stage integrates complementary modal signals.

2. Related Work 2.1. Fine-grained classification of Fungi

The participating teams in FungiCLEF2023 [5, 10, 11, 12], primarily employed Transformer-based [13] architectures for multimodal data processing, efectively combining visual features with metadata through advanced fusion strategies. To address critical challenges in fungi classification, the solutions incorporated specialized techniques including customized loss functions (such as Seesaw loss [14] and poisonous-classification loss) for handling class imbalance and long-tailed distributions. The methods in FungiCLEF2024 [4, 15, 16, 17], primarily focused on multi-modal fusion of visual and metadata features using architectures like Swin Transformer V2 [18] and DINOv2, combined with dynamic MLPs [19] or attention mechanisms for fine-grained species classification. To handle open-set recognition, teams employed entropy-based rejection or generative adversarial approaches like OpenGAN [15] to detect unknown species. Safety-critical optimization was emphasized through poisonous-aware loss functions (e.g., heavily penalizing toxic misclassifications) and post hoc re-ranking to minimize dangerous errors. Auxiliary supervision (e.g., genus-level losses) and techniques like Seesaw Loss improved robustness against class imbalance.

2.2. Contrastive Learning

In the field of fine-grained classification, contrastive learning loss functions demonstrate unique advantages. Triplet Loss [20] constructs anchor-positive-negative triplets to enforce the distance between the anchor and the positive example to be smaller than that between the anchor and the negative example plus a margin. It aims to bring samples of the same class closer while pushing apart those of diferent classes, but its sampling eficiency is constrained by negative sample selection strategies. N-pair Loss [21] extends Triplet Loss by innovatively adopting a multi-negative parallel optimization mechanism, establishing a ”1-positive-N-negative” contrast relationship within a single batch. However, when certain fungi categories have too few samples, their contribution as negative samples diminishes. Supervised Contrastive Loss [22] leverages supervised information to treat multiple samples from the same class as positives and those from diferent classes as negatives. It pulls same class samples closer in the embedding space while pushing apart diferent-class samples through contrastive learning. This approach is particularly suitable for supervised learning scenarios, excelling especially in few-shot learning and fine-grained classification tasks.

3. Method

We propose a two‑stage framework for fine‑grained fungi classification. In the first stage, foundational visual embeddings are extracted via DINOv2 and refined through a single layer Transformer encoder, then optimized with our Dynamic Weighting Contrastive, which incorporates entropy-based sample weighting and adaptive positive or negative pair construction to enhance intra class compactness and inter class separation even under scarce data regimes. In the second stage, we generate structured text from each specimen’s metadata, encode them with BERT, and fuse the resulting text embeddings with the refined visual features using a Q‑Former with a set of learnable queries q. This multimodal representation is trained with cross‑entropy loss to produce habitat aware classification outputs, achieving competitive performance in FungiCLEF2025.

DINOv2 Vanilla ViT

...

Meta Data

date: 2010-10-1 habitat: natural grassland substrate: soil

Template “This fungi specimen was collected on 2020–10–1 in a natural grassland area, growing on a soil substrate.” g n i an Fine-grained Feature Embedding ir T g n ir n a e ievL ... t s a tr n o C

Q-Former

...

Queries

q Stage Ⅰ Stage Ⅱ

Classification Head 3.1. Model Architecture

In the first stage, we extract initial visual features from each fungi image using DINOv2 and feed them into a Transformer based contrastive learning framework. This framework operates on pre-extracted features from a standard ViT and employs a single layer Transformer encoder with a 16 heads self attention mechanism to build a high dimensional attention space, efectively capturing fine‑grained visual cues. In the second stage, for every fungi image, we construct a structured textual description template from its observation metadata-year, month, day, habitat, and substrate as follows: “This fungi specimen was collected on [year]–[month]–[day] in a [habitat] area, growing on a [substrate] substrate.”

We design a two‑stage model as shown in Figure 1. In the first stage, we concentrate on extracting and refining visual features; in the second stage, we carry out multimodal fusion and classification.

Subsequently, we employ BERT to encode the descriptions, then the generated text embeddings and the first stage visual features are jointly fed into the Q-Former module as input. Q-Former serves as the core component for cross modal fusion, establishing semantic relationships between image and ecological text descriptors. A set of learnable query tokens q. is introduced to facilitate cross modal interaction between textual and visual features. Through iterative updates via multi-head self attention, the Q-Former generates query representations that fuse habitat semantics with visual information. These representations are then projected through a classification head and optimized using cross-entropy loss to produce the final species classification results.

3.2. Training Strategy

In the first stage, we designed the Dynamic Weighting Contrastive Loss, an enhanced supervised contrastive loss function [22], which incorporates an entropy-based uncertainty weighting sampling mechanism to prioritize hard examples for optimized model training. Notably, our improvements to the standard loss function are as follows: First, uncertainty aware weighting: In loss calculation, samples with higher prediction uncertainty are assigned greater weights, ensuring the model focuses on ambiguous instances critical for fine-grained discrimination. Second, adaptive pair construction: Positive pairs are formed by randomly sampling up to 4 instances per category, with a strict requirement of at least 2 samples per category to form valid pairs. For categories with fewer than 2 samples, new instances are generated via data augmentation to meet this constraint. Negative pairs are generated across distinct categories using a uniform class sampling strategy to avoid model bias. This design stabilizes the contrastive learning process by balancing positive and negative pairs while dynamically emphasizing samples that contribute most to reducing model uncertainty.

Given a batch of samples, let z denote the feature vector of the -th sample (including augmented instances for sparse categories). We first normalize the features:

The pairwise similarity matrix is computed as: The enhanced loss function is defined as: ẑ = z ‖z ‖2 = ẑ ⋅ ẑ⊤ ℒ =

1 ∑∈ ∈ ∑ ⋅ | ()| 1 ∑ − log ∈()

exp( / ) ∑∉ℐ () exp( / ) ( 1 ) ( 2 ) ( 3 ) • = { ∣ | ()| ≥ 2} • = ( ( instances are included for sparse categories). (self-similarity and invalid pairs).

is the set of valid anchors. • is the temperature parameter. • () = { ∣

= , ≠ } denotes the set of positive samples for anchor , with | ()| ≥ 2 (augmented • ℐ () = {} ∪ { ∣

mask = 0} represents invalid indices excluded by the triple masking mechanism of predicted probabilities ), and is the sigmoid function.

)) is the uncertainty weight for anchor , where ( ) = − ∑=1 , log , (entropy

In the second stage, the text embeddings and visual features are integrated and fed into the Q‑Former module. Meanwhile, the learnable query tokens are initialized as q. Q‑Former performs interactive fusion between the textual and visual features through multi-head self attention, progressively updating the query tokens across multiple layers and representation subspaces to capture the fused multimodal information. The output query representations from Q‑Former are then passed through a classification head for species prediction, and the final classification results are supervised using a cross-entropy loss function.

4. Experiment 4.1. Experimental Settings

Dataset. The FungiCLEF2025 challenge dataset is built from fungi observations submitted to the Atlas of Danish Fungi before the end of 2023, with labels provided by mycologists. It includes not only multiple photographs of the same specimen but also a wealth of supplementary data such as satellite imagery, meteorological records, and structured metadata. The vast majority of observations have been annotated with most of these attributes. As is shown in Table 1, The training set contains 4,293 observations, 7,819 images, and 2,427 classes, while the validation set has 1,099 observations, 2,285 images, and 570 classes. All of the images are also accompanied by tabular metadata and automaticallygenerated text descriptions of the images. Each class in the training set has between 1-4 observations. learning rate scheduler, and the initial learning rate set to 0.0002 and a batch size of 1024.

4.2. Evaluation Metric

The evaluation metrics for this competition is the standard Top@ which is defined as the proportion of instances where the true label is within the top predicted labels:

Top- Accuracy = ∑

=1 ( ∈ ̂ ) , ( 4 ) • is the total number of samples. • is the true label for the -th sample. • ̂ is the set of top predicted labels for the -th sample.

• (⋅) is the indicator function.

We set = 5 for the main evaluation metric.

4.3. Fungi Dataset Experiments

As detailed in Table 2, when using only DINOv2 pretrained visual features, the model demonstrates relatively low Top5 accuracy, demonstrating that global visual features alone are insuficient for distinguishing morphologically similar fungi species. The incorporation of the Transformer encoder led to a significant improvement in accuracy, primarily attributed to the self-attention mechanism’s dynamic focus on locally discriminative features. Further integration of habitat metadata boosted the model’s accuracy to 76.991%, as the metadata provided complementary ecological information constraints to the visual features.

As detailed in Table 3, our enhanced loss function ensuring numerical robustness during training and delivering optimal performance in fine-grained fungi classification tasks. The Dynamic Weighted Contrastive Loss enhances the model’s discriminative capability by focusing on challenging samples near decision boundaries, thereby improving classification performance for ambiguous cases.

As shown in Table 4, when training on small-scale datasets, excessively deep architectures may lead to overfitting, thereby reducing test set performance. The multi-head attention mechanism, as a core component of Transformer, captures richer feature information by simultaneously attending to diferent segments of the input sequence across multiple representation subspaces. In our experiments, the 16 heads configuration demonstrated superior performance compared to the 32 heads setup. The experimental results in Table 4 show that the model achieved high scores at 50, 100, and 150 training epochs. Building upon these three optimal results, we adopted a weighted voting ensemble approach [25] to integrate predictions from these top-performing models as our final competition submission. The aggregated final score reached 78.137%.

5. Conclusion

The proposed two-stage framework secured 2nd place in the FungiCLEF2025 competition. This achievement was accomplished through the integration of pretrained DINOv2 feature embeddings, a customized Transformer architecture, Dynamic Weighting Contrastive Loss, and metadata fusion strategies. Future research will focus on exploring satellite data augmentation and explainable attention mechanisms to facilitate practical field applications.

6. Declaration on Generative AI

During the preparation of this work, we did not use generative AI tools or services for writing assistance, ifgure generation, or data analysis. All text, figures, and results were produced solely by the authors. [5] L. Picek, M. Sulc, R. Chamidullin, J. Matas, Overview of fungiclef 2023: Fungi recognition beyond 1/0 cost., in: CLEF (Working Notes), 2023, pp. 1943–1953. [6] L. Picek, M. Šulc, J. Heilmann-Clausen, J. Matas, Overview of FungiCLEF 2022: Fungi recognition as an open set classification problem, in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, 2022. [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020). [8] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without supervision, arXiv preprint arXiv:2304.07193 (2023). [9] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, in: International conference on machine learning, PMLR, 2023, pp. 19730–19742. [10] H. Ren, H. Jiang, W. Luo, M. Meng, T. Zhang, Entropy-guided open-set fine-grained fungi recognition., in: CLEF (Working Notes), 2023, pp. 2122–2136. [11] S. Wolf, J. Beyerer, Optimizing fine-grained fungi classification for diverse application-oriented open-set metrics., in: CLEF (Working Notes), 2023, pp. 2159–2167. [12] F. Hu, P. Wang, Y. Li, C. Duan, Z. Zhu, Y. Li, X.-S. Wei, A deep learning based solution to fungiclef2023., in: CLEF (Working Notes), 2023, pp. 2051–2059. [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,

Attention is all you need, Advances in neural information processing systems 30 (2017). [14] J. Wang, W. Zhang, Y. Zang, Y. Cao, J. Pang, T. Gong, K. Chen, Z. Liu, C. C. Loy, D. Lin, Seesaw loss for long-tailed instance segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9695–9704. [15] J. Etheredge, Openwgan-gp for fine-grained open-set fungi classification, Working Notes of CLEF (2024). [16] B.-F. Tan, Y.-Y. Li, P. Wang, L. Zhao, X.-S. Wei, Say no to the poisonous fungi: An efective strategy for reducing 0-1 cost in fungiclef2024, Training 1 (2024) 295–938. [17] S. Wolf, P. Thelen, J. Beyerer, Poison-aware open-set fungi classification: Reducing the risk of poisonous confusion, Working Notes of CLEF (2024). [18] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, et al., Swin transformer v2: Scaling up capacity and resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12009–12019. [19] L. Yang, X. Li, R. Song, B. Zhao, J. Tao, S. Zhou, J. Liang, J. Yang, Dynamic mlp for fine-grained image classification by leveraging geographical and temporal information, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10945–10954. [20] F. Schrof, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, IEEE (2015). [21] K. Sohn, Improved deep metric learning with multi-class n-pair loss objective, in: Advances in Neural Information Processing Systems, volume 29, Curran Associates, Inc., 2016, pp. 1857–1865. URL: https://proceedings.neurips.cc/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper. pdf. [22] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, D. Krishnan, Supervised contrastive learning, Advances in neural information processing systems 33 (2020) 18661–18673. [23] A. Paszke, Pytorch: An imperative style, high-performance deep learning library, arXiv preprint arXiv:1912.01703 (2019). [24] I. Loshchilov, F. Hutter, et al., Fixing weight decay regularization in adam, arXiv preprint arXiv:1711.05101 5 (2017) 5. [25] L. Breiman, Bagging predictors, Machine learning 24 (1996) 123–140.

[1]

X.-S.

Wei , Y.-

Song ,

O. Mac

Aodha ,

Wu ,

Peng ,

Tang ,

Yang ,

Belongie , Fine-grained image analysis with deep learning: A survey , IEEE transactions on pattern analysis and machine intelligence 44 ( 2021 ) 8927 - 8948 .

[2]

Janouskova ,

Matas ,

Picek , Overview of FungiCLEF 2025: Few-shot classification with rare fungi species , in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum , 2025 .

[3]

Picek ,

Kahl ,

Goëau ,

Adam ,

Larcher ,

Leblanc ,

Servajean ,

Janoušková ,

Matas ,

Čermák ,

Papafitsoros ,

Planqué ,

W.-P.

Vellinga ,

Klinck ,

Denton ,

J. S.

Cañas ,

Martellucci ,

Vinatier ,

Bonnet ,

Joly , Overview of lifeclef 2025: Challenges on species presence prediction and identification, and individual animal identification , in: International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF) , Springer, 2025 .

[4]

Picek ,

Šulc ,

Matas , Overview of fungiclef 2024: Revisiting fungi species recognition beyond 0-1 cost , in : CLEF 2024 , 2024 .