1. Introduction

Fine-Grained Classification for Poisonous Fungi Identification with Transfer Learning

Christopher Chiu

Maximilian Heil

Teresa Kim

Anthony Miyaguchi

0 0 Georgia Institute of Technology , North Ave NW, Atlanta, GA 30332 , USA

2024

9 12

FungiCLEF 2024 addresses the fine-grained visual categorization (FGVC) of fungi species, with a focus on identifying poisonous species. This task is challenging due to the size and class imbalance of the dataset, subtle inter-class variations, and significant intra-class variability amongst samples. In this paper, we document our approach in tackling this challenge through the use of ensemble classifier heads on pre-computed image embeddings. Our team (DS@GT) demonstrate that state-of-the-art self-supervised vision models can be utilized as robust feature extractors for downstream application of computer vision tasks without the need for taskspecific fine-tuning on the vision backbone. Our approach achieved the best Track 3 score (0.345), accuracy (78.4%) and macro-F1 (0.577) on the private test set in post competition evaluation. Our code is available at https://github.com/dsgt-kaggle-clef/fungiclef-2024.

eol>Fine-Grained Visual Categorization (FGVC) Poisonous Fungi Identification Transfer Learning Vision Transformers CEUR-WS

1. Introduction 1.1. Dataset Overview

The featured dataset for the FungiCLEF competition [ 1 ] is the Danish Fungi dataset [ 3 ]. This dataset includes a training set (DF20), which includes 356,770 images over 1,604 diferent classes of fungi, and a validation / testing dataset (DF21), consisting of 60,832 images over 2,713 species of fungi, covering a year’s worth of observation. For species within the validation dataset that were not in the training dataset, they were marked as an "unknown" class. The dataset provides both full sized images (110GB) and downsized images (300px max dimension, 5.6GB). It also provides metadata for the fungi images including date, location, substrate and metasubstrate of the fungi growth, and the full taxonomical ranks of the classified fungi species, including phylum, class, order, family, and genus.

These two datasets do not have the same distribution of classes (Figure 2). Moreover, there was significant class imbalance in both datasets, with the most common class having 1,913 images, and the least common class only ~30 images in DF20, and down to only one observation for some species in DF21. There were also significant variations in terms of lighting, background, and clarity, due to the real-world conditions under which fungi were photographed (Figure 1). This adds another level of complexity on this task - Fungi classes are not only hard to distinguish due to inter-class variations or intra-class variance, but also due to varying image quality and image features. This highlights the need for a robust model in efectively performing fine-grained classification on this rich and varied dataset.

1.2. Related Work

State-of-the-art work on this dataset primarily utilizes models such as Swin Transformer [ 4 ] and MetaFormer [ 5 ]. However, results from FungiCLEF 2023 [ 6 ] underscore limitations in current research, where the best accuracy from participants have not improved significantly since the competition’s inception in 2022 [ 7 ]. Last year’s winner incorporated metadata into the model with MetaFormer as the vision model [ 8 ], and utilized Seesaw Loss [ 9 ] to handle class imbalance. This led to a macro F1 of 0.571, with a poisonous and edible species confusion rate of 5.31% and 2.05% respectively [ 8 ]. To handle unknown classes, the team also introduced an entropy based approach to identify unknown, out-of-distribution species [ 10 ].

Beyond FungiCLEF, Wei et al. [ 11 ] provides a comprehensive examination of fine-grained visual categorization (FGVC) challenges, such as accurately localizing object parts, selecting informative features under varied conditions, and integrating segmentation with classification. It emphasizes the need for the model to generalize across species, maintain eficiency, and handle real-world issues like occlusions. Other directions that demonstrated promise on FGVC datasets such as CUB-200-2011 [ 12 ] include Mask-CNN [ 13 ] which outperformed other methods by better capturing subtle diferences between species, and SR-GNNs [ 14 ] which extracted context-aware features from relevant image regions to discriminate between object classes.

2. Methodology

Our overall approach to this challenge of fine-grained classifying of fungi species was to: 1. Incorporate metadata as additional input / prediction targets for the model. 2. Learn the concept of unknown classes by incorporating the validation dataset into training. 3. Experiment with objective functions to induce model capability in fine-grained classification task. 4. Train only on metadata and image embeddings for rapid prototyping and model optimization.

Cloud computing resources were supplied by Data Science @ Georgia Tech. Data was hosted on Google Cloud Storage, and models developed on virtual instances with NVIDIA L4. For more memory intensive experiments, GPUs used in model development include NVIDIA RTX 4090 and a distributed cluster with 2x NVIDIA V100.

Libraries used include pandas [ 15 ], PaCMAP [ 16 ], scikit-learn [ 17 ] for data exploration; PySpark [ 18 ], PyArrow [ 19 ], Luigi [ 20 ] for data processing; PyTorch [ 21 ], timm [ 22 ], Lightning [ 23 ], and transformers [ 24 ] for model development. Evaluation functions for the FungiCLEF competition were referenced in the development of internal model benchmarks [ 25 ].

2.1. Dataset Preparation

To improve the eficiency of experiments, we built a data preprocessing pipeline with PySpark (Figure 3). We appended the 300 pixel and full versions of image data with associated metadata, and stored them as parquet files for faster I/O. Embeddings were also precomputed and stored as parquet files separately. A custom PyTorch dataset object was created to serve the image and embedding data alongside metadata.

The metadata columns were grouped based on their potential use as either model inputs or prediction targets. For the validation set / public test set, only substrate, metasubstrate, habitat, date, and location were provided [ 25 ]. As such, these columns were used as additional inputs to the model. Categorical columns were expanded into one-hot vectors. For date information, we converted the month and day into cyclical encoding using sine / cosine transformation [ 26 ]. For location data, we converted longitude and latitude into Geohash, which preserves spatial ordinality and unifying location inputs using a Z-order curve [ 27 ]. Levels 2-5 of the resultant Geohash were extracted and converted from base-32 to normalized base-10 integers. The toxicity and one-hot vectors of the taxonomical levels of fungi classes were included as additional prediction targets. Other metadata columns were excluded.

Given that unknown classes were only present in the validation dataset, we divided DF21 into three equal sections of 20,000 cases, stratified by species. One of the sections was designated as the held out test set, with the remaining two sections utilized as validation set / addition to the training set in a two-fold cross validation training. By percentage, this gives us a ratio of 90.4%, 4.8%, 4.8% for training, validation, and testing over the entire dataset. Significantly, the test and validation sets in each validation fold have the same class distribution.

2.2. Embeddings for Transfer Learning

Embeddings are the learned intermediate representation of deep learning models that capture structure about the input domain. We experimented with two models as the vision model backbone to generate embeddings - DINOv2 [ 28 ] and ResNet [ 29 ]. DINOv2 was chosen as it is state-of-the-art in terms of vision model and its richness and robustness as a visual feature extractor [ 28 ]. ResNet was chosen due to its widespread application and downstream [ 29 ], and serves as a representative for the CNN family in contrast to the transformer family where DINOv2 originated from. For ResNet18, we generated embeddings by extracting the output features from the last hidden state before the classification head. This resulted in embeddings of shape (1000, ) per image. For DINOv2, we utilized the [CLS] token from the last hidden state of model output. In initial experiments and ablation studies, dinov2-small [ 28 ] was used, which results in embedding shape of (768, ). In our optimized model used for competition submission, dinov2-large with register [ 28 ] was used, which had embedding shape of (1024, ).

For training, the image embeddings were precomputed. For the testing set and for our competition submission, the vision backbone model was frozen, and embeddings were generated during inference and fed into our trained classifier heads.

2.3. Model Development

We explored two separate approaches in model development (1) Training a computer vision model from end-to-end, and (2) Training a classifier head only on precomputed embeddings. While approach (1) is the more traditional method for computer vision tasks, it is much more compute intensive due to the number of parameters to be trained [ 30 ]. In comparison, approach (2) had significantly less memory requirements and faster training time (Table 1). While using precomputed image embeddings imply that training data could not undergo traditional augmentation techniques in computer vision such as lfipping and random cropping, we hypothesize that modern vision models had suficient amount of information in the feature representation that the downstream model can be robust and generalisable.

2.3.1. Model Training

For transfer learning with the embedding model, we use a traditional MLP classifier head with a hidden dimension of 4096, with metadata directly concatenated to the embedding. Inspired by Diao et al. [ 31 ], we also experiment with using a transformer block for better integration of metadata into the classifier. We transformed metadata into the same dimensions as the embedding with a separate MLP layer, and added them to the image embeddings before streaming all the data into a transformer block for image classification. To leverage the benefits of cross fold validation, we utilized an ensemble model approach [ 32 ]. Output logits of our model are averaged over all the classifier heads.

All models were first trained on a smaller, exploratory development set, before undergoing training runs on the full dataset. For experiments that appeared promising in its initial outcomes, training parameters were further tuned using Optuna to generate a full model for benchmark. Training performance were logged on Weights & Biases, with the top 2 performing models saved as checkpoints. A two-fold cross-validation was used, where each fold had 1/3rd of DF21 dataset as the validation set, and another 1/3rd incorporated into the training data. Our experiment logs can be viewed at https://wandb.ai/chiu/FungiClef.

All experiments were trained 20 to 50 epochs each, with batch sizes of 64 to 512. Initial learning rates ranged from 1 · 10− 5 to 1 · 10− 3, with AdamW [ 33 ] as optimizer. Learning rate schedulers experimented with include cosine scheduler with restarts [ 34 ], and ReduceLROnPlateau [ 35 ].

Metrics recorded during training include training / validation loss, top-1, top-3 accuracy, macro F1 score, and accuracy for correct identification of poisonous species. Calculation for specific track scores were adapted from the FungiCLEF competition [ 1 ] for model benchmark. This includes classification error (Track 1), cost for poisonousness confusion (Track 2), and user specific cost (Track 3) [ 25 ].

2.3.2. Loss Function

The baseline loss function for model development was unweighted, multi-class cross entropy loss. We also explored incorporating class weights in cross-entropy loss, and other loss functions such as focal loss [ 36 ] and seesaw loss [ 9 ], which was used by last year’s winner [ 8 ] to overcome class imbalance. Additionally, we experimented with using various metadata such as the higher level taxonomy of the fungi class and the toxicity of the fungi class as additional prediction targets.

For our benchmark model, the model was trained with a custom loss function:

composite = seesaw + · poison

Where seesaw is the seesaw loss of the class prediction, poison is the binary cross entropy loss of the model’s prediction in whether the fungi is poisonous, and an adjustable weighting factor for the composite loss function.

2.3.3. Weighted Sampling

While weighted sampler is usually utilized to overcome class imbalance [ 37 ], we utilized this technique in our data loader to ameliorate the diference in class distribution between the training and validation set. Instead of adjusting class weights such that each class is evenly represented, we derived the per-sample weight by dividing the class frequency of the validation set over the training set: =

3. Results 3.1. Training Results

Our best performing model was an ensemble model on DINOv2 embeddings consisting of two classifier heads (180MB each) from the two-folds of cross-validation training. The model was trained on image embeddings precomputed from DINOv2-large. The weighting for poison loss was 0.1. The initial learning rate was 1 · 10− 4, with AdamW [ 33 ] as optimizer, and cosine learning rate scheduler with warm restarts [ 34 ].

3.2. Leaderboard Results

The results of our team’s experiments are outlined in Table 4. For our first submission during competition, we used a pre-trained MetaFormer model from the previous year’s competition as a baseline. In postcompetition evaluation, our best model achieved an accuracy of 78.4% and a macro F1 score of 0.577 in the private test set. Our model’s performance was comparable to previous years’ winners [ 8 ], and was the best performing model in this year’s competition in terms of Track 1, Track 3, and accuracy. Our Track 2 and F1 score was ranked 2nd compared to the rest of the competitors 1. The inference time across the full public test set (40,216 images) was 25:26 minutes, and 0.126s per image on average on a RTX 4090.

4. Discussion

We intially experimented with vision models including EficientNet [ 38], VisionTransformer [39], and MetaFormer [ 5 ]. Due to training time and memory overhead, we opted to focus our eforts into developing a lightweight classifier on precomputed embeddings instead.

4.1. ResNet v.s. DINOv2 as Vision Backbone for Embedding Generation

Overall, while DINOv2 embeddings proved to be good input for image classification, our embedding model using ResNet embeddings did not perform well, with best validation accuracy at 25%. This was likely due to DINOv2 being a class agnostic, self-supervised model, whereas ResNet was trained on ImageNet with specific classification targets. As such, the features extracted from ResNet would be more tailored to the dataset, whereas DINOv2 features were more representative of the underlying image [ 28 ]. To further investigate this, we visualized the embeddings with UMAP [40] in Figure 4, which showed that ResNET embeddings did not separate well, whereas there was a clear separation in DINOv2 embeddings.

4.2. Incorporation of Metadata

We experimented with using metadata as additional prediction targets as seen in our ablation in Table 3. However, this did not yield additional performance, but the additional overhead required to tune the weighting of various targets was not worth the complexity. As such, we did not utilize metadata in our final model. The inclusion of metadata appeared to provide some marginal benefits in validation accuracy and F1 score. This echoes the finding from previous research on the dataset, where the incorporation of metadata as input had a positive contribution to model performance. 1Due to numerous issues with the HuggingFace platform, our best results were not recorded in the oficial competition. Our post competition evaluation was performed under the same constraints as the oficial competition. Post-competition results were provided and verified by the organiser of FungiCLEF. 2Oficial competition results are from test submissions with an under-tuned vision model. These results are included for completeness.

5. Future Work

Whilst using embeddings allowed for much faster model development time, there is still an additional gap in the performance of the embedding classifier compared to traditional image-based models. It is likely that the information loss in the transformation of image to embeddings was too significant for the simple classifier architecture to overcome. It would be interesting to further fine-tune DINOv2 on the DanishFungi dataset, and repeat our experiments. Moreover, a more rigorous incorporation of metadata into our models could provide a more holistic understanding of the data, leading to more accurate and reliable classification systems.

6. Conclusion

In summary, we addressed the complex task of fine-grained visual categorization (FGVC) for identifying poisonous fungi using transfer learning and advanced deep learning methodologies. The Danish Fungi 2020 dataset presented significant challenges such as class imbalance, subtle inter-class variations, and high intra-class variability, necessitating a comprehensive data preprocessing and augmentation pipeline.

Our experiments with various deep learning models, including vision transformers, convolutional neural networks, and linear classifiers with embeddings, highlighted the potential of DINOv2 embeddings combined with a multi-layer perceptron. Integrating multimodal metadata further enhanced classification performance, emphasizing the value of auxiliary information. Despite promising results, embedding-based classifiers faced limitations due to potential information loss, suggesting the need for ifne-tuning self-supervised models on domain-specific datasets and improved metadata incorporation. Overall, our research advances FGVC technical capabilities, providing valuable methodologies for mycological safety and educational applications, and contributes to the broader field of fine-grained classification tasks.

Acknowledgements

We thank the DS@GT CLEF team for providing the development and research environment for our machine learning experiments as well as valuable comments and suggestions. problem, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 875–884. [38] M. Tan, Q. V. Le, Eficientnetv2: Smaller models and faster training, CoRR abs/2104.00298 (2021).

URL: https://arxiv.org/abs/2104.00298. arXiv:2104.00298. [39] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, CoRR abs/2010.11929 (2020). URL: https://arxiv.org/ abs/2010.11929. arXiv:2010.11929. [40] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projection for dimension reduction (2018). arXiv:1802.03426.

[1]

Picek ,

Sulc ,

Matas , Overview of FungiCLEF 2024: Revisiting fungi species recognition beyond 0-1 cost , in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum , 2024 .

[2]

Joly ,

Picek ,

Kahl ,

Goëau ,

Espitalier ,

Botella ,

Deneu ,

Marcos ,

Estopinan ,

Leblanc ,

Larcher ,

Šulc ,

Hrúz ,

Servajean , et al., Overview of lifeclef 2024 : Challenges on species distribution prediction and identification , in: International Conference of the CrossLanguage Evaluation Forum for European Languages , Springer, 2024 .

[3]

Picek ,

Šulc ,

Matas ,

T. S.

Jeppesen ,

Heilmann-Clausen ,

Laessøe , T. Frøslev, Danish fungi 2020 - not just another image recognition dataset , in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , 2022 , pp. 1525 - 1535 .

[4]

Liu ,

Lin ,

Cao ,

Hu ,

Wei ,

Zhang ,

Lin ,

Guo , Swin transformer: Hierarchical vision transformer using shifted windows ( 2021 ). arXiv: 2103 . 14030 .

[5]

Yu ,

Luo ,

Zhou ,

Si ,

Zhou ,

Wang ,

Feng ,

Yan , Metaformer is actually what you need for vision ( 2021 ). arXiv: 2111 . 11418 .

[6]

Picek ,

Šulc ,

Chamidullin ,

Matas , Overview of fungiclef 2023: Fungi recognition beyond 1/0 cost , Proceedings of the Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum (CLEF) ( 2023 ).

[7]

Heilmann-Clausen ,

Sulc ,

Picek , Overview of fungiclef 2022: Fungi recognition as an open set classification problem , in: CLEF 2022 Conference and Labs of the Evaluation Forum , volume 3180 , 2022 , pp. 1 - 10 .

[8]

Ren ,

Jiang ,

Luo ,

Meng , T. Zhang, Entropy-guided open-set fine-grained fungi recognition , Proceedings of the Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum (CLEF) ( 2023 ).

[9]

Wang ,

Zhang ,

Zang ,

Cao ,

Pang ,

Gong ,

Chen ,

Liu ,

C. C.

Loy ,

Lin , Seesaw loss for long-tailed instance segmentation ( 2020 ). arXiv: 2008 .10032.

[10]

Macêdo ,

T. I.

Ren ,

Zanchettin ,

A. L. I.

Oliveira , T. Ludermir, Entropic out-of-distribution detection: Seamless detection of unknown examples ( 2020 ). arXiv: 2006 .04005.

[11]

X.-S.

Wei ,

C.-W.

Xie , J. Wu, Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization ( 2016 ). arXiv: 1605 . 06878 .

[12]

Wah ,

Branson ,

Welinder ,

Perona , S. Belongie,

The caltech-ucsd birds

- 200-2011 dataset, 2011 .

[13]

He , G. Gkioxari,

Dollár ,

Girshick , Mask r-cnn ( 2017 ). arXiv: 1703 . 06870 .

[14]

Bera ,

Wharton ,

Liu ,

Bessis ,

Behera , Sr-gnn: Spatial relation-aware graph neural network for fine-grained image categorization ( 2022 ). arXiv: 2209 . 02109v1 .

[15]

McKinney , Data structures for statistical computing in python , in: 9th Python in Science Conference (SciPy 2010 ), 2010 , pp. 51 - 56 .

[16]

Wang ,

Huang ,

Rudin ,

Shaposhnik , Understanding how dimension reduction tools work: An empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization , 2021 .

[17]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , Édouard Duchesnay, Scikit-learn: Machine learning in python , 2011 .

[18]

A. S.

Developers , Pyspark: Python api for apache spark , 2024 . URL: https://spark.apache.org/docs/ latest/api/python/index.html.

[19]

McKinney , Pyarrow: Python api for apache arrow , 2024 . URL: https://arrow.apache.org/docs/ latest/api/python/index.html.

[20]

Bernhardsson , E. Freider, Luigi: A python package for building complex pipelines of batch jobs , 2024 . URL: https://luigi.readthedocs.io.

[21]

Paszke ,

Gross ,

Massa ,

Lerer ,

Bradbury , G. Chanan,

Killeen ,

Lin ,

Gimelshein ,

Antiga ,

Desmaison ,

Kopf ,

Yang ,

DeVito ,

Raison ,

Tejani ,

Chilamkurthy ,

Steiner ,

Fang ,

Bai ,

Chintala , Pytorch:

An imperative style, high-performance deep learning library (

2019 ). arXiv: 1912 .01703.

[22]

Wightman , timm: Pytorch image models, 2019 . URL: https://github.com/rwightman/ pytorch-image-models.

[23]

Falcon , T. P. L. team, Pytorch lightning, 2024 . URL: https://lightning.ai/docs/pytorch/stable/.

[24]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. L.

Scao ,

Gugger ,

Drame ,

Lhoest ,

A. M.

Rush , Transformers: State-of-the-art natural language processing , in: 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , 2020 , pp. 38 - 45 .

[25]

Picek ,

Šulc ,

Matas ,

Heilmann-Clausen ,

T. S.

Jeppesen ,

Laessøe , T. Frøslev, Danish fungi 2020 - not just another image recognition dataset , 2021 . arXiv: 2103 . 10107 .

[26] I. London, Encoding cyclical continuous features - 24-hour time , 2016 . URL: https://ianlondon. github.io/blog/encoding-cyclical-features - 24hour-time/.

[27]

I. S.

Suwardi ,

Dharma ,

D. P.

Satya ,

D. P.

Lestari , Geohash index based spatial data model for corporate , in: 2015 International Conference on Electrical Engineering and Informatics (ICEEI) , 2015 , pp. 478 - 483 . doi: 10 .1109/ICEEI. 2015 . 7352548 .

[28]

Oquab ,

Darcet ,

Moutakanni ,

Vo ,

Szafraniec ,

Khalidov ,

Fernandez ,

Haziza ,

Massa ,

El-Nouby ,

Assran ,

Ballas ,

Galuba ,

Howes , P.-Y. Huang,

S.-W.

Li , I. Misra ,

Rabbat ,

Sharma , G. Synnaeve,

Xu ,

Jegou ,

Mairal ,

Labatut ,

Joulin , P. Bojanowski, Dinov2: Learning robust visual features without supervision , 2024 . arXiv: 2304 . 07193 .

[29]

He ,

Zhang , S. Ren,

Sun , Deep residual learning for image recognition ( 2015 ). arXiv: 1512 . 03385 .

[30]

Z.-Y.

Dou ,

Xu ,

Gan ,

Wang ,

Zhu ,

Zhang , L. Yuan,

Peng ,

Liu ,

Zeng , An empirical study of training end-to-end vision-and-language transformers ( 2021 ). arXiv: 2111 . 02387 .

[31]

Diao ,

Jiang ,

Wen ,

Sun ,

Yuan , Metaformer: A unified meta framework for fine-grained recognition ( 2022 ). arXiv: 2203 . 02751 .

[32]

M. A.

Ganaie ,

Hu ,

A. K.

Malik ,

Tanveer ,

P. N.

Suganthan , Ensemble deep learning: A review ( 2021 ). arXiv: 2104 . 02395 .

[33]

Loshchilov ,

Hutter , Decoupled weight decay regularization ( 2017 ). arXiv: 1711 . 05101 .

[34]

Loshchilov ,

Hutter , Sgdr: Stochastic gradient descent with warm restarts ( 2017 ). arXiv: 1608 . 03983 .

[35]

Al-Kababji ,

Bensaali ,

S. P.

Dakua , Scheduling techniques for liver segmentation: Reducelronplateau vs onecyclelr ( 2022 ). arXiv: 2202 . 06373 .

[36] T.-Y. Lin , P.

Goyal , R.

Girshick , K.

He , P.

Dollár , Focal loss for dense object detection ( 2017 ). arXiv:1708 . 02002 .

[37]

Zhang ,

Dong ,

Wu ,

Zheng , A new weighted sampling method to handle class imbalance