1. Introduction

Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers

Hanna Herasimchyk

Robin Labryga

Tomislav Prusina

0 0 University of Hamburg , 177 Mittelweg, Hamburg, 20148 , Germany

2025

We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at diferent scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.

eol>Multi-Label Classification DINOv2 Vision Transformer Species Identification Vegetation Plot Images Biodiversity PlantCLEF

1. Introduction

Vegetation plot inventories are essential in ecological research, enabling the sampling and assessment of biodiversity as well as the monitoring of environmental changes. They generate valuable data that supports ecosystem analysis, biodiversity conservation, and evidence-based environmental decisionmaking. A standard vegetation inventory examines small quadrats that are rectangular frames of about half a square meter placed on the ground to define specific sampling areas. Trained botanists record all plant species found and quantify their presence using metrics such as biomass, ecological scores, or coverage observed in images.

Integrating machine learning methods into this process could drastically enhance eficiency, enabling broader ecological studies with reduced expert involvement. However, developing models capable of identifying multiple plant species among thousands in a single image remains a significant technical challenge.

Having a quadrat image dataset annotated with all present plant species is crucial, yet expensive and challenging to create due to the numerous species in a given area. In contrast, substantial collections of images containing only single plant species already exist, making it much easier to train single-species classification models.

The PlantCLEF 2025 challenge [ 1, 2, 3 ] seeks to address this gap by evaluating models designed to predict the presence of multiple plant species in high-resolution quadrat images. In this competition, models are trained using single-label images of individual plants but are tested on multi-label quadrat images, highlighting the challenge of domain shift between training and test data.

Our main approach utilizes a vision transformer architecture [ 4, 5 ] equipped with multiple classification heads, enabling the model to simultaneously predict species, genus, and family from a shared feature extraction backbone. This multi-head design efectively integrates taxonomic knowledge and leverages hierarchical relationships, significantly enhancing the robustness of species predictions in complex vegetation plot images.

Key contributions of our work towards improving multi-label classification of plant species in quadrat images include: • We use multi-head predictions and static knowledge of plant taxonomy to harness information contained in the metadata of the training images. • We introduce multi-scale tiling to improve the model’s ability to recognize plants at diferent scales in quadrat images. • We dynamically determine prediction thresholds by optimizing for the mean prediction length. • We utilize bagging to enhance the model’s robustness and generalization capabilities. Our code is available on GitHub1.

2. Background

2.1. Data The training dataset consists of approximately 1.4 million images (about 281 GB) of individual plants, each accompanied by metadata. This large scale presents a significant computational challenge for model training. The dataset, also used in the PlantCLEF 2024 competition, covers 7,806 plant species, 1,446 genera, and 181 families.

The distribution of images across species is shown in Figure 2, while the distribution of species across genera and families is depicted in Fig. 3. Each image is labeled with a single plant species, single genus,

800 t 600 n u o ce 400 g a m I 200

0 and single family, and includes metadata such as organ type and geographic location. A genus describes a group of plant species, while a family describes a group of plant genera. Example training images are shown in Fig. 1.

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000

Species rank (sorted by image count)

2.2. Metric

Unlike PlantCLEF 2024 [ 6, 7 ], this competition uses a modified evaluation metric. The final score is the average of macro-averaged F1 scores, computed for each transect in the test set. A transect is a sequence of vegetation plots (quadrats) placed along a defined path in the field to systematically record species occurrences.

⎛ 1 ∑︁

⎞ 1 ∑︁ F1 ⎠ =1 ⎝ =1 • is the total number of transects, • is the number of quadrats in transect , • 1 ∑︀=1 F1 is the macro-averaged F1-score for quadrat in transect .

2.3. DINOv2 Model

We used a DINOv2 model [ 5, 8 ] provided by the PlantCLEF organizers, pre-trained on single-species training images. The architecture is based on the distilled Vision Transformer Base (ViT-B/14) with registers [9] serving as the backbone for feature extraction. For each input image, the model generates an embedding that is then passed through a classification head consisting of one linear layer to predict the species. Further details can be found in [ 6 ].

Our choice to use DINOv2 was based on empirical evidence from the PlantCLEF 2024 challenge, where ViT-B architectures demonstrated superior performance compared to alternative model architectures [ 6, 10, 11, 12 ]. Furthermore, given the computational constraints associated with the dataset (1.4 million images, 281 GB), training large-scale deep neural networks from scratch would have been computationally prohibitive. Hence, we used the already pre-trained DINOv2 backbone provided by the organizers without additional finetuning.

3. Methodology 3.1. Training Data Preparation

For several of our methods, it is necessary to train or retrain models, including the newly added genus and family classifiers, as well as models for distinguishing between plant and non-plant samples. The training procedure we use is described below.

Data Augmentations During training, we employed a variety of data augmentation techniques to enhance the model’s robustness and generalization capabilities. These augmentations included random cropping, random horizontal and vertical flipping, perspective transformations, and random rotations. Additionally, we applied color jittering to introduce variability in brightness, contrast, and saturation.

We also applied standard normalization and resizing procedure to ensure that input images matched the distribution and size expected by the DINOv2 architecture. This included subtracting the mean and dividing by the standard deviation as well as resizing input images to 518 × 518.

Data Split The provided training dataset was already pre-split. We decided to use all available data for training, including images that were not original used for pre-training. For internal evaluation, we performed a stratified split of the training data to ensure a balanced representation of species. LUCAS Dataset The organizers provided an additional training dataset called LUCAS (Land Use/Cover Area frame Survey) [13], comprising 212,782 unannotated ground vegetation images in a vertical quadrat-like format, amounting to 170GB of data. We explored continued pre-training of the DINOv2 model to incorporate this data, motivated by the idea that exposure to domain-specific vegetation plot imagery during pre-training could enhance the model’s representational capacity for downstream classification. However, this approach proved infeasible due to hardware constraints. As a result, we proceeded with the original DINOv2 weights without additional pre-training on the LUCAS dataset.

3.2. Test Data Preprocessing

Image Cropping Initial visual inspection of the vegetation plot imagery revealed the frequent presence of non-plant artifacts, such as wooden plot frame edges, measuring tapes, and footwear, usually located at the image borders (see Fig. 4). To reduce the influence of these non-plant objects on the model, we experimented with centrally cropping 5% to 15% from all four image sides. The 10% cropping strategy yielded the best results on the public leaderboard, while the 5% strategy was more efective on the private one, suggesting that the 10% approach may have been excessive. Multi-Scale Tiling To address the challenge of varying plant sizes and densities within vegetation plots, we implemented a multi-scale tiling approach. This involved splitting the image into a grid of multiple tiles (2 × 2 , 3 × 3 , . . . ), allowing the model to capture both small and large plant species efectively. Each tile is used as an input image for the model. All pre-processing steps are applied to each tile accordingly. We additionally experimented with overlapping tiles to ensure that plants on the edges of tiles were not missed. However, we found that using multiple tiles without overlap was suficient, as the overlap did not lead to any improvement in the results.

3.3. Model Architecture and Training

Multi-Head Classification To leverage taxonomic information, along the original species MLP classification head, we incorporated additional MLP classification heads for genus and family prediction on top of the DINOv2 ViT-B backbone. These heads utilized metadata associated with each image. We also experimented with the number of layers in each classification head.

Given the strict hierarchical relationship—where each species uniquely belongs to one genus, and each genus to one family, we multiplied the predicted probabilities for species, genus, and family, discarding combinations that do not exist in the provided metadata. This ensured that only valid taxonomic assignments were considered.

In addition to the taxonomic classification heads, we trained a dedicated classification head for organ prediction, designed to identify the type of plant organ depicted in each image (e.g., leaf, flower, stem). However, integrating organ-based information into the overall prediction pipeline proved challenging due to the inherent variability in organ representation among diferent species.

Furthermore, the dataset included a "scan" organ label indicating images obtained by scanning plants rather than capturing them in natural settings. Since our primary focus is on vegetation plot analysis, which relies on photos of plants in real settings, we hypothesize that removing such images from the training dataset could improve final accuracy.

Hydra Model Architecture We used independent classification heads that shared the same embedding from a frozen backbone. Several versions of each head with diferent numbers of layers were trained simultaneously. During testing, we could swap these pre-trained heads to create various model versions from one main architecture. We refer to this ensemble approach as the Hydra model. The best Hydra model we trained included a one-layer head for species classification and two-layer heads with ReLU activation function in between for genus and family classification.

DINOv2 ViT-L We explored scaling the model architecture by training a DINOv2 implementation based on the Vision Transformer Large (ViT-L/14 [ 4 ]) backbone. While this architecture ofers greater representational capacity compared to smaller variants, preliminary experiments revealed significant computational limitations. A single training iteration on the full PlantCLEF dataset required approximately 30 hours using our GPU cluster (see Model Training in Section 3.3). Given the need for at least roughly 50 iterations to achieve convergence, the total training time would exceed 1,500 hours (about 62.5 days), rendering this approach infeasible within the project’s resource constraints. Plant/Non-Plant Filtering To reduce false positives from irrelevant foreground clutter (e.g., rocks or soil patches), we trained a binary classifier to distinguish between plant and non-plant regions. We created a separate dataset of non-plant images (primarily rocks) from publicly available sources and trained logistic regression, random forest, and a ViT-based classifier. Out of these three approaches, the Random Forest classifier achieved the highest overall accuracy, correctly identifying plant and non-plant tiles 95% of the time on our validation data. As a result, we adopted the Random Forest model for filtering of non-plant objects in our primary pipeline. However, the model failed to generalize on the vegetation plot images and did not improve the final prediction quality.

Model Training We trained the described model architectures on our GPU cluster, utilizing 2× NVIDIA A6000 GPUs for each experiment. Each ViT-based model was trained for approximately three days, with the duration varying depending on the specific architecture. For detailed technical specifications and code, please refer to our publicly available GitHub repository (see Section 1).

3.4. Inference

We implemented a multi-step prediction pipeline to adapt the single-species classifier to the multispecies quadrat prediction task. Several strategies were empirically tested and integrated, with varying levels of success across the public and private datasets.

Top-n and Bottom-n Filtering Given that each vegetation plot image typically contains no more than a dozen distinct plant species, we constrained the number of species predictions per image by limiting the maximum (top-n) allowed predictions. Through experimentation, we found that tuning this upper bound improved scores on the public leaderboard. The same experiments after the challenge revealed that this often leads to worse performance on the private leaderboard. Additionally, enforcing a minimum (bottom-n) of at least one species prediction per image proved beneficial. Logit Thresholding For each tile, we allowed at most one species contribution to the final prediction. To ensure that only the most confident predictions were included, we applied a logit thresholding strategy. One approach was to set a minimum logit value for species predictions, filtering out lowconfidence predictions. Another approach involved dynamically adjusting the logit threshold based on the mean prediction length across all test images. To perform this dynamic adjustment eficiently, we utilized pre-computed logits for each test image and tile, and found appropriate thresholds using a bisection search algorithm. We ended up using dynamically adjusted thresholding with an average of four species per image because of its simplicity of use and apparent performance. Metadata Merging A subset of the test set vegetation plot images included identifiers and dates within their filenames. We investigated whether using image metadata, specifically, merging predictions across images taken in the same field and year, could enhance the score. For example, if a species was identified more than three times across all images of the same plot, it was predicted in every image of that plot. The idea was that such an approach might enhance recall by consolidating information from related plots. However, we did not use this method because: first, this method did not improve our score; second, metadata was not available for the entire test set; third, this method contradicts the goal of the challenge, which is to discover the changes in biodiversity from the vegetation plot. Bagging To further improve the robustness of our predictions, we implemented a bagging strategy (see [14]). We combined multiple models by averaging their logits from each image tile before generating the final prediction. This method helps reduce variability and increases the reliability of our results by using information from diferent models.

Kernels We implemented a kernel-based smoothing approach applied to the logit outputs of each image tile. Specifically, the logits of neighboring tiles were added to each tile’s prediction logits with a weighting coeficient (e.g., 0.5), allowing the predictions of adjacent tiles to influence one another. The idea was that plants might span across tile boundaries. However, initial experiments with kernel-based smoothing did not yield improvements in the final evaluation scores. Consequently, we did not try any alternative kernels. It is likely that the lack of improvement was due to our use of multi-scale tiling, which efectively served a similar purpose.

Other Techniques We explored several additional strategies, such as z-score normalization of logits instead of thresholding or filtering out rare species, but observed no consistent improvements across datasets. Due to marginal returns, these methods were ultimately not included in the final pipeline.

4. Results

Table 2 presents our top-5 submissions on both the public and private PlantCLEF leaderboards, as well as our five selected predictions. While all models achieve higher scores on the public leaderboard, there is a consistent drop in performance on the private leaderboard across all submissions. This pattern suggests that the public and private test sets are not well-balanced, and that models optimized for the public set may not generalize well to the private set. The relatively small score diferences between submissions on the private leaderboard, contrasted with larger variations on the public leaderboard, further highlight this imbalance. These results indicate that leaderboard-driven optimization likely led to overfitting on the public test set. In particular, we experienced the smallest drop on the private leaderboard in comparison to the top-performing solutions on the public leaderboard.

Due to a substantial domain shift between the training and test data, we were unable to validate our approaches locally, which forced us to rely on the public leaderboard for model selection. Despite our eforts to select a diverse set of models, none of our five chosen submissions appear among the top-5 on the private leaderboard, highlighting the challenges presented by the test data split and the limitations of leaderboard-based evaluation.

Our primary multi-head classification approach achieved a substantial improvement over the baseline, which relied on simple single-head plant species classification. As shown in Table 2, all reported results utilize multi-head classification, highlighting this improvement. 2https://www.kaggle.com/models/juliostat/dinov2_patch14_reg4_onlyclassifier_then_all/PyTorch/default 4,5 4,5 4,5 4,5 4,5 4,5 4,5 4 4,5 4,5 4,5 4,5 4,5 4,5 1,2,4,5 8,10,12 10 10 10 10 10 10 5 10 10 10 10 10 10 10

We evaluated several hyperparameter configurations and observed that the 10% cropping strategy yielded the most promising results on the public test set, while the 5% strategy performed better on the private set, suggesting that the former likely resulted in excessive cropping of informative visual regions. Top-9 and top-10 filtering did not improve the score, and top-n filtering generally decreased performance on the private leaderboard. Always predicting at least one species positively improved the score. Dynamically adjusting the threshold with an average of four species per image enhanced the ifnal score. Our best Hydra model featured a one-layer head for species classification and two-layer heads with an activation function applied between layers for genus and family classification. Merging metadata did not improve results, likely because metadata was not available for the entire test set and because this approach contradicts the challenge’s goal of discovering changes in biodiversity from the vegetation plot. For multi-scale tiling, we found that using multiple, non-overlapping tiles of sizes 4 and 5 was suficient, as overlap did not ofer any performance gains. Although plant/non-plant filtering via a Random Forest achieved 95% validation accuracy on our separate dataset, it failed to generalize to the vegetation plot images and did not enhance the final predictions. While bagging significantly improved results on the public leaderboard, it had a negative efect on the private leaderboard score. However, bagging did improve the private score when applied to models using diferent cropping parameters, as seen in our second-best submission on the private leaderboard. Finally, initial experiments with kernel-based smoothing did not improve the final evaluation scores, possibly because multi-scale tiling already provided a similar efect.

5. Related Work

Deep learning and computer vision methods have been widely applied to plant species identification and vegetation analysis. Early work focused on convolutional neural networks (CNNs) for remote sensing and vegetation mapping, as reviewed by Kattenborn et al. [15]. More recently, transformer-based architectures have shown promise for plant-related tasks, such as weed detection in UAV imagery [16], and our work builds on this trend by utilizing a vision transformer backbone for multi-label plant species prediction.

Patch-based and multi-scale approaches have been explored to address the challenge of varying object sizes in images. Adelson et al. [17] introduced image pyramid methods, which similar to our use of multi-scale tiling captures information at diferent spatial resolutions.

Hierarchical classification, which exploits for example taxonomic relationships, has been studied in various domains. Silla and Freitas [18] provide a comprehensive survey of hierarchical machine learning. An example of hierarchical classification in the context of taxonomy is the work by Colonna et al. [ 19] that used a top-down approach to predict family, genus, and species in frogs. Several works [20, 21] propose multiplying probabilities along the taxonomic hierarchy with some using one classifier per hierarchical layer, and some using one per inner node in the hierarchy. It is similar to our multi-head architecture that predicts species, genus, and family independently and fuses their outputs.

Data augmentation remains a key technique for improving model robustness. Shorten and Khoshgoftaar [22] provide a comprehensive survey of image augmentation methods, many of which we use in our training pipeline.

Previous work in the PlantCLEF2024 challenge [ 6, 7 ] featured diverse deep learning approaches for plant species identification. Foy and McLoughlin [ 11] leveraged the vision transformer (ViT) architecture together with the Segment-Anything Model (SAM) to efectively suppress false positives in non-plant image regions. Gustineli et al. [10] explored multiple embedding methods and classifier architectures based on ViT, while Chulif et al. [12] combined CNNs and ViT with Bayesian Model Averaging for enhanced prediction. These approaches highlight a trend toward vision transformers and advanced post-processing techniques for robust plant species identification.

6. Conclusion

We present a metadata-enhanced multi-head vision transformer for multi-label plant species prediction, combining species, genus, and family outputs through taxonomic fusion. Using multi-scale tiling, dynamic thresholding, and ensemble strategies (Hydra), our model achieved strong results on the public leaderboard.

However, performance dropped on the private test set, revealing sensitivity to domain shift and the limitations of leaderboard-based tuning, but still having competitive results.

Future work should address domain adaptation, incorporate organ-specific cues, and explore finetuning strategies to improve real-world robustness.

Acknowledgments

We want to thank the organizers of PlantCLEF 2025 and LifeCLEF 2025 for hosting the competition.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT, GitHub Copilot, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. [9] T. Darcet, M. Oquab, J. Mairal, P. Bojanowski, Vision Transformers Need Registers, International

Conference on Learning Representations (2024). [10] M. Gustineli, A. Miyaguchi, I. Stalter, Multi-Label Plant Species Classification with Self-Supervised

Vision Transformers, Conference and Labs of the Evaluation Forum (2024). [11] S. Foy, S. McLoughlin, Utilizing Dino V2 for Domain Adaptation in Vegetation Plot Analysis,

Conference and Labs of the Evaluation Forum (2024). [12] S. Chulif, H. Ishrat, Y. Chang, S. Lee, Patch-wise inference using pretrained vision transformers:

Neuon submission to plantclef2024, Conference and Labs of the Evaluation Forum (2024). [13] R. d’Andrimont, M. Yordanov, L. Martinez-Sanchez, P. Haub, O. Buck, C. Haub, B. Eiselt, M. van der Velde, Lucas cover photos 2006–2018 over the eu: 874 646 spatially distributed geo-tagged close-up photos with land cover and plant species label, Earth System Science Data (2022). [14] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning, 2009. [15] T. Kattenborn, J. Leitlof, F. Schiefer, S. Hinz, Review on convolutional neural networks (cnn) in vegetation remote sensing, ISPRS Journal of Photogrammetry and Remote Sensing (2021). [16] R. Reedha, E. Dericquebourg, R. Canals, A. Hafiane, Transformer neural network for weed and crop classification of high resolution uav images, Remote sensing (2022). [17] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, J. M. Ogden, Pyramid methods in image processing, RCA Engineer (1984). [18] C. N. Silla, A. A. Freitas, A survey of hierarchical classification across diferent application domains,

Data mining and knowledge discovery (2011). [19] J. G. Colonna, J. Gama, E. F. Nakamura, A comparison of hierarchical multi-output recognition approaches for anuran classification, Machine Learning (2018). [20] J. N. Hernandez, L. E. Sucar, E. F. Morales, A hybrid global-local approach for hierarchical classification., Florida Artificial Intelligence Research Society (2013). [21] L. Fiaschi, M. Cococcioni, Informed deep hierarchical classification: a non-standard analysis inspired approach, IEEE Transactions on Neural Networks and Learning Systems (2024). [22] C. Shorten, T. M. Khoshgoftaar, A survey on image data augmentation for deep learning, Journal of Big Data (2019).

[1]

Goëau , G. Martellucci,

Bonnet ,

Vinatier ,

Joly , PlantCLEF2025 @ LifeCLEF & CVPR-FGVC , https://kaggle.com/competitions/plantclef-2025, 2025 . Kaggle.

[2]

Picek ,

Kahl ,

Goëau ,

Adam ,

Larcher ,

Leblanc ,

Servajean ,

Janoušková ,

Matas ,

Čermák ,

Papafitsoros ,

Planqué ,

W.-P.

Vellinga ,

Klinck ,

Denton ,

J. S.

Cañas ,

Martellucci ,

Vinatier ,

Bonnet ,

Joly , Overview of lifeclef 2025: Challenges on species presence prediction and identification, and individual animal identification , in: International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF) , Springer, 2025 .

[3]

Janouskova ,

Matas ,

Picek , Overview of FungiCLEF 2025: Few-shot classification with rare fungi species , in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum , 2025 .

[4]

Kolesnikov ,

Dosovitskiy ,

Weissenborn , G. Heigold,

Uszkoreit ,

Beyer ,

Minderer ,

Dehghani ,

Houlsby ,

Gelly ,

Unterthiner ,

Zhai , An image is worth 16x16 words: Transformers for image recognition at scale , International Conference on Learning Representations ( 2021 ).

[5]

Oquab ,

Darcet ,

Moutakanni ,

Vo ,

Szafraniec ,

Khalidov ,

Fernandez ,

Haziza ,

Massa ,

El-Nouby ,

Assran ,

Ballas ,

Galuba ,

Howes , P.-Y. Huang,

S.-W.

Li , I. Misra ,

Rabbat ,

Sharma , G. Synnaeve,

Xu ,

Jegou ,

Mairal ,

Labatut ,

Joulin , P. Bojanowski, Dinov2: Learning robust visual features without supervision , Transactions on Machine Learning Research ( 2024 ).

[6]

Goëau ,

Espitalier ,

Bonnet ,

Joly , Overview of PlantCLEF 2024

: multi-species plant identification in vegetation plot images, Conference and Labs of the Evaluation Forum (

2024 ).

[7]

Joly ,

Picek ,

Kahl ,

Goëau ,

Espitalier ,

Botella ,

Marcos ,

Estopinan ,

Leblanc ,

Larcher ,

Šulc ,

Hrúz ,

Servajean ,

Glotin ,

Planqué ,

W.-P.

Vellinga ,

Klinck ,

Denton , I. Eggel ,

Bonnet ,

Müller , Overview of lifeclef 2024: Challenges on species distribution prediction and identification, Conference and Labs of the Evaluation Forum ( 2024 ).

[8]

Goëau ,

Lombardo ,

Afouard ,

Espitalier ,

Bonnet , A . Joly, PlantCLEF 2024 pretrained models on the flora of the south western Europe based on a subset of Pl@ntNet collaborative images and a ViT base patch 14 dinoV2, 2024 .