1. Introduction

Overview of PlantCLEF 2024: multi-species plant identification in vegetation plot images⋆

Hervé Goëau

0 1

Vincent Espitalier

Pierre Bonnet

Alexis Joly

1 0 CIRAD, UMR AMAP, Montpellier , Occitanie , France 1 Inria, LIRMM, Univ Montpellier , CNRS, Montpellier , France

2024

Plot images are essential for ecological studies, enabling standardized sampling, biodiversity assessment, longterm monitoring and remote, large-scale surveys. Plot images are typically fifty centimetres or one square meter in size, and botanists meticulously identify all the species found there. The integration of AI could significantly improve the eficiency of specialists, helping them to extend the scope and coverage of ecological studies. To evaluate advances in this regard, the PlantCLEF 2024 challenge leverages a new test set of thousands of multi-label images annotated by experts and covering over 800 species. In addition, it provides a large training set of 1.7 million individual plant images as well as state-of-the-art vision transformer models pre-trained on this data. The task is evaluated as a (weakly-labeled) multi-label classification task where the aim is to predict all the plant species present on a high-resolution plot image (using the single-label training data). In this paper, we provide an detailed description of the data, the evaluation methodology, the methods and models employed by the participants and the results achieved.

eol>LifeCLEF fine-grained classification species identification vegetation plot images multi-label classification biodiversity informatics evaluation benchmark

1. Introduction

Vegetation plot inventories are essential for ecological studies, as they enable standardized sampling, biodiversity assessment, long-term monitoring and large-scale, remote surveys. They provide valuable data on ecosystems, biodiversity conservation, and evidence-based environmental decision-making. Plot images, also known as "quadrat" are typically 0.5x0.5 meter size, and botanists meticulously identify all the species found there. In addition, they quantify species abundance using indicators such as biomass, qualification factors, and areas occupied in photographs. AI could significantly improve the eficiency of specialists, helping them to extend the frequency and coverage of ecological studies. It could also enable the engagement of non-expert citizen scientists in monitoring programs. Existing plant identification applications (such as Pl@ntNet or iNaturalist) are already a valuable tool in this respect. They can automatically identify specimens present in the plot by photographing them one by one. A much more efective approach, however, would be to be able to directly identify all the individuals in a quadrat by taking a single high-resolution photo of the it (such as the one shown in Figure 1). However, the development of AI models solving this multi-label classification task remains challenging. Ideally, we would need very large volumes of quadrat images labeled with all plants present but this would require a colossal amount of expert work to cover the thousands of species of an entire flora. On the other side, very large volumes of labeled images of individual plant specimens are now available and it is easy to (a) Vegetative plot (test) (b) Individual plant images (training) train eficient classification models on top of them [ 1, 2 ]. Thus, the PlantCLEF2024 challenge proposes to evaluate the task of identifying the species present in quadrat images based on a large training set of individual plant images. As illustrated in Figure 1, the main dificulty lies in the shift between both types of data. Whereas the test data is composed of high-resolution multi-label images of vegetation plots (with potentially many species), the training data is composed essentially of single-label images of individual plants or parts of individual plants (coming from the Pl@ntNet collaborative platform [ 3 ]). In addition to the diferent conditions and angles of the shots, the phenological stages can further increase the data disparity. The individual plant data is indeed over represented by close-up views of lfowers (facilitating the identification), whereas vegetation plots are most often areas observed multiple times over one or several years, without prior assumptions about the phenological stage of the plants (some may be flowering, others fruiting, some in seedling stage, and others senescent or even afected by a disease).

2. Dataset 2.1. Training dataset

The training dataset is composed of observations of individual plants, such as those used in previous editions of PlantCLEF. More precisely, it is a subset of the Pl@ntNet training data focusing on south western Europe and covering 7,806 plant species. It contains about 1.4 million images extended with some images with trusted labels aggregated from the GBIF platform to complete the less illustrated species. Links to original images are provided in the ’url’ column of the metadata csv file shared with the participants. The images have a relatively high resolution (minimum or maximum of 800 pixels on the biggest side) to allow the use of classification models that can handle relatively large resolution inputs and may reduce the dificulty of predicting small plants in large vegetative plot images. Images are pre-organized into subfolders by class (i.e., by species) and split into a predefined train-validation-test sets to facilitate the training of individual plant classification models.

2.2. Test dataset

The test set is a compilation of several image datasets of quadrats in diferent floristic contexts, including Pyrenean and Mediterranean floras. These datasets are all produced by experts and consist of a total of 1,695 high-resolution images. The shooting protocol can vary significantly from one context to another: the use of wooden frames or measuring tape to delimit the plot or not, angles of view more or less perpendicular to the ground due to the slope of the site photographed (see Figure 2). Additionally, the quality of the images may vary depending on the weather, which can result in more or less pronounced shadows, blurry areas, etc.

2.3. Pre-trained models

For participants who may have dificulty finding the computational power necessary to train a plant image identification model on such a large volume of data, or to enable direct work on top of a backbone model for image embedding extraction for instance, two pre-trained models are shared through Zenodo [ 4 ]. Both are based on a state-of-th-art Vision Transformer (ViT) architecture initially pretrained with the DinoV2 Self-Supervised Learning (SSL) approach [ 5, 6 ] and fine-tuned on PlantCLEF 2024 training data.

The vit_base_patch14_reg4_dinov2_lvd142m_onlyclassifier model relies on the original SSL pretrained weights from the LVD-142M dataset using the DinoV2 SSL method, where only the classification head was trained in a conventional supervised manner, with the backbone being frozen. The second model named vit_base_patch14_reg4_dinov2_lvd142m_onlyclassifier_then_all adjusted the backbone weights in a classical supervised manner with the classifier head, resulting in a backbone that no longer aligns with an SSL pre-trained model.

The models have been pre-trained via the timm library [7] and if we follow the way models are named in this library the second model vit_base_patch14_reg4_dinov2_lvd142m_onlyclassifier_then_all could be named vit_base_patch14_reg4_dinov2_lvd142m_ft_pc24, expressing the fact that the initial DinoV2 ViT pre-trained model is subsequently finetuned (ft) on the PlantCLEF2024 dataset in a usual supervised way (all layers are fine-tuned). The vit_base_patch14_reg4_dinov2_lvd142m corresponds to the backbone only, without the head, meaning the original pre-trained model with DinoV2 on the huge generalist LVD142M dataset of 142 million images. But for ease of reading, we will subsequently refer to these two models as ViTD2PC24OC and ViTD2PC24All for vit_base_patch14_reg4_dinov2_lvd142m_onlyclassifier and vit_base_patch14_reg4_dinov2_lvd142m_onlyclassifier_then_all, respectively. ViTD2 will stand for the original pre-trained backbone vit_base_patch14_dinov2_lvd142m but without the four registers features.

The training set was split into three sub-directories, subdividing the data into (sub) training, validation, and test sets (see Table 1 for the statistics) to facilitate the training of individual plant identification models, for the organizers and the participants. It is important not to confuse this test set dedicated to individual plant identification with the challenge test set, which contains large multi-species images.

The two pre-trained models were both trained on an A100 octo-GPU node server using the timm library (version 0.9.16) under torch (version 2.2.1+cu121). The ViTD2PC24OC model, with only the classifier head being fine-tuned, was trained for approximately 17 hours over 92 epochs with a batch size of 1280 per GPU and a high learning rate of 0.01 (hyperparameters for data augmentation techniques and other features are detailed in the Zenodo package [ 4 ]). The second model, ViTD2PC24All, was initialized with the weights of the previously trained ViTD2PC24OC model and was then fully fine-tuned (all layers) for approximately 36 hours over 92 epochs with a significantly lower batch size per GPU (144) and learning rate (0.00008). Both models were trained with a cross-entropy loss since the objective of the training is to predict a single species per image.

Table 2 shows the performances of the two pretrained models on the subset excluded from the training and used as a "Test" set for evaluating performances and model’s generalization capacity to identify on individual plants (note again that this "Test" set of individual plants should not be confused with the challenge test set of vegetation plot images).

3. Task description

The aim of the challenge is to detect the presence of every plant species on each high-resolution vegetation plot image, from among more than 7,800 species, bearing in mind that plots are generally 50x50cm in size, and that it’s rare for there to be dozens and dozens of species simultaneously. Formally, this is a weakly-supervised multi-label classification task for which we only have single-label training data and a significant distribution shift.

3.1. Metric

The metric chosen to evaluate the methods of the participants was the F1 score type adapted to finding a good compromise between recall and precision, i.e. not proposing too many species at the risk of being imprecise, but at the same time not proposing too few species at the risk of being incomplete. Among the several variations of F1 score like the macro averaged per class or the micro averaged, the macro-averaged per sample was selected, i.e. the average of the F1 scores calculated individually for each vegetation plot. The other two F1 scores were shown to the participants during the challenge for information purposes, although the macro average calculated per class is no so relevant since not all species are present in the test set, and the micro average is well known to sufer from data imbalances. More formally, the macro-averaged per sample F1 score is: • is the number of images in the test set. • 1 is the F1 score for test image , calculated as: • Precision and Recall are the precision and recall for test image , defined as: 1 = 2 · Precision · Recall

Precision + Recall Precision =

Recall =

TP TP + FP

TP TP + FN (1) (2) • where for test image : – TP is the true positives (the number of plant species correctly predicted). – FP is the false positives (the number of plant species incorrectly predicted).

– FN is the false negatives (the number of plant species missed ).

3.2. Rules about metadata and additional data

The use of the metadata (licenses, exif) was authorised provided that, for each run using metadata, an equivalent run using only the visual information without metadata in submitted in order to assess the raw contribution of a purely visual analysis.

The use of additional data is permitted provided that an equivalent run with only the data provided is submitted to enable more accurate and fair comparisons.

Number of submitted runs

Working notes

3.3. Challenge platform

The PlantCLEF2024 challenge was hosted on the Hugging Face platform, providing an opportunity for researchers and enthusiasts to contribute to the development of plant recognition in such original context of plot analysis.

4. Participants and methods

Of the 83 teams oficially registered on the CLEF registration system for LifeCLEF, 34 registered specifically for the PlantCLEF 2024 challenge. On the Hugging Face platform hosting the challenge, 9 teams attempted to submit runs, and in the end 7 teams were able to submit a total of 181 runs (see Table 3). Three teams produced and shared working notes, Atlantic [8], DS@GT-LifeCLEF [9] and NEUON AI [10], all detailed briefly below.

Before going into the diferences between the approaches of the 3 teams who shared a working note, it’s important to note that all have developed a tiling inference pipeline making predictions on sub-images, or "tiles", of a high resolution test image. Indeed, as the training set focuses on individual plant in an image, it is quite intuitive to sub-sample a high-resolution test image in the hope of obtaining tiles focusing each more on a single species. Most have used the pre-trained ViTD2PC24All model, but DS@GT-LifeCLEF and NEUON AI teams have also trained other models that can also be used in their own tiling inference pipeline.

Subsequently, the tiling inference pipelines developed by the participants difered in terms of the initial resolution of the test images and some geometrical parameters to control tiles extraction: tile size, step size, and occasionally a border ofset in pixels, which could be useful for excluding the potential presence of wooden frames or measuring tapes in many test images.

Finally, since each tile contributes its set of species predictions, various approaches have been developed to limit (top k), threshold (probabilities), and combine all predictions at the quadrat image scale.

The following subsections describe in more detail what distinguishes the approaches. This team has invested the most time and number of runs over the 2-month competition period (see Table 3), enabling them to gradually gain a good knowledge of the test set data and to better understand and optimize the tile prediction process.

Like most other teams, their tested approaches rely on the use of the second pre-trained model ViTD2PC24All provided by the challenge organizers and integrated into a tiling inference pipeline. Observing that the resolution of the test images was of the order of 3000 pixels per side, that the plants present could be very small, and that the model took relatively large image sizes (518x518 pixels) as input, this team preferred to extract relatively small 224x224 pixel tiles, which were then upscaled to 518x518 pixels for inference. Several combinations of geometric parameters were tested: with successive tiles overlapping by a third or a quarter, with or without border ofset, with or without +/- 90 degree rotation. Subsequently, the 5 species with the highest probabilities are retained for each tile, and a region proposal process groups neighboring tiles that received the same species into potentially larger bounding boxes.

An important observation made by these participants is that the pre-trained model ViTD2PC24All lacks the ability to reject parts of the test images that do not show plants, such as stones, sand, various types of soil, and manufactured objects like measuring tapes, wooden frames, camera bags, pens, notebooks, and shoes. Indeed, this type of data is not present in the training set, and it turns out that high confidence scores occurred on a high percentage of non-plant parts of the test images. To significantly reduce the number of false positives, they combined the powerful Segment-Anything Model (SAM) [11] with color analysis and geometric rules applied to detected masks, excluding non-plant parts of the test images from the inferences.

This teams also attempted to group predictions from test images taken from the same plot but observed multiple times over time (sometimes over several years). They mentioned that their highest score was achieved by combining three tiling configurations, utilizing SAM for false positive removal, and aggregating predictions from images of the same plot taken over time.

The team made their code available here.

4.2. DS@GT-LifeCLEF, Georgia Institute of Technology, USA, 16 runs, [9]

. This team explored diferent ways of embedding the individual images from the training set to train and compare several classifiers. They used the ViTD2 model, the Vision Transformer pre-trained with DinoV2 on the large generic dataset LVD142M [ 5 ] in its base version, and the fine-tuned model provided by the organizers, ViTD2PC24All. They first drastically reduced the training dataset by lowering the image resolution to 128x128 pixels while applying a center crop to accelerate the embedding extraction process and the training of the models in general.

The first embedding method is basically and directly the cls_token, typically used for classification, extracted with the original ViTD2 model. The second embedding method involves taking all outputs of the ViTD2 model for each image, including both the cls_token and all other tokens, resulting in an output shape of 257x768, which is then reduced to a 64-dimensional vector using a Discrete Cosine Transform (DCT). The third and last embedding method is directly the cls_token from each image provided by the fine-tuned ViTD2PC24All model.

They trained then three classifiers, three linear layers, one for each embedding dataset, using the Negative Log-Likelihood (NLL) loss. However, this loss, while very similar to the usual cross-entropy, is not specifically designed for multi-label classification; therefore, they used a tiling-based method to address this issue, (the full-image approach in inference was also tested as baselines). They compared 2x2, 3x3, and 5x5 grid approaches, evaluating diferent top-k species per tile.

4.3. NEUON AI, NEUON AI & Swinburne University of Technology Sarawak, Malaysia, 13 runs, [10]

This team tested two primary deep learning architectures, Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), to achieve the best possible results. They experimented with a multilabel classification approach using a ViT model, inspired by the RICAP method [ 12]. This method involves assembling multiple images of diverse species side by side in various sizes to create large composite images. These new training composites images help to train a models to predict multiple species simultaneously. For that, they replace the head of the provided ViTD2PC24All model by a MultiLayer Perceptron (MLP) of 3 layers. a sequence of 3 linear layers specifically designed for multi-label classification, assuming that a single layer would not be suficient to capture the complexity of multiple species prediction combinations to be made.

Like the other teams, a tiling pipeline was used for inference, specifically employing a 4x4 or 8x8 grid approach.

A "top genus" filter was also used to eliminate false positives when several species of the same genus are present within a tile.

They also incorporated an evaluation of prediction variance within a tile to assess the model’s confidence in its predictions, and potentially eliminate false positives.

Finally, they adapted the Bayesian Model Averaging (BMA) technique to aggregate all scores provided by species predictions for all tiles, as an alternative to pooling or voting.

5. Results 6. Discussion

Beyond the best performances per team, we can make the following general observations and attempt to analyze in more detail the methodological approaches that seem to have most contributed to these performances.

A dificult task : overall performance is quite low and doesn’t exceed a macro-averaged F1 score per sample of 29%, even by using pre-trained state-of-the-art models on a large volume of data specifically on the flora studied. This demonstrates the significant negative impact of the shift in distribution between the quadrat images and the images of individual plants.

The highest purely visual performance, intrinsically linked to the model’s predictions without the use of metadata or rejection mechanisms, is 19.75% and was obtained by NEUON AI. This run is based on the ViTD2PC24All pre-trained model, using an 8x8 tile grid, aided by the proposed ifltering approach on the confidence of prediction scores with the Bayesian Model Averaging. It seems that the Atlantic team achieved a score of 18.47% by also relying on the ViTD2PC24All model and optimizing the parameters of their inference pipeline. The Atlantic team would therefore have gained almost 10 points (reaching 28.73%) by excluding predictions for areas that do not contain plants (rocks, sand, soil, non-natural elements like wooden frames or tape measures) with detection based on SAM [11] and by combining predictions from plots in the same location monitored over time. Rejection: Modeling the rejection of non-plant areas is therefore critical, as demonstrated by the Atlantic team, and to a lesser extent by the NEUON AI team with their confidence thresholding approach. However, this can result in significant computing time costs, taking more than two days to process the entire test dataset of about 1600 images. Integrating true rejection classes (rocks, sand, soil, non-natural elements) in addition to individual plant species into the training dataset could have avoided the need for an additional network. The pre-trained models could have had better performance from the outset.

Fake quadra generation and multi-label loss. The use of a method inspired by RICAP (Random Image Cropping And Patching) to train a multi-label deep learning model on a mono-label dataset (as per [12]) tried by NEUON AI is an interesting idea and yielded a respectable result of 19.08%, close to their best run with the ViTD2PC24All model (19.75%). The generation of these fake quadrat images would benefit from better control (avoid close-ups, realistic species combinations, attenuation of collage edge efects, etc). However, like the DS@GT-LifeCLEF team that also learned new classifiers, these two teams may have missed the opportunity to train the new models with a binary cross-entropy loss or asymmetric loss for multi-label classification (ASL) mentioned by the DS@GT-LifeCLEF, which are better suited to the multi-label classification problem, rather than using cross-entropy or Negative Log-Likelihood (NLL) loss.

About monitoring a site photographed over several years. Metadata can reveal plots photographed repeatedly over the years, enabling combined predictions for better accuracy. This approach reflects botanists’ method of refining identifications through ongoing photo series analysis. It seems that here the gain was high as shown by the Atlantic team but it’s risky. In principle, one might expect a plot photographed multiple times in a year over several years to show the same species, but each species has its own growth rhythm, which can be annual or perennial, and the set of species visible can vary from month to month. Identifying them can be more or less dificult depending on growth stages (seedling, flowering, fruiting, senescence). Lastly, consider two images taken in diferent years but showing the same species at similar growth stages; the lighting and shooting conditions also pose challenges for accurate predictions.

Multi-scale tiling: in a quadrat there may be plants of very diferent sizes, from tiny seedlings to bushes or young shrubs. Multi-scale tiling to adapt to inference or even learning could be interesting for analyzing plants of diferent sizes, but no team seems to have explored this avenue.

7. Conclusion

The results of PlantCLEF 2024 show that the problem of multi-label classification of vegetation quadrat HD images is still challenging. Purely visual performance, achieved without metadata or rejection mechanisms, demonstrated that pre-trained models can perform well but benefit greatly from strategies that exclude non-plant areas and combine predictions from plots over time. Modeling the rejection of non-plant areas is crucial, although it can incur substantial computing costs. Integrating true rejection classes into the training dataset could enhance the performance of pre-trained models from the outset. Innovative approaches, such as using RICAP-inspired methods for multi-label deep learning on mono-label datasets, show promise but require better control over fake quadrat generation and appropriate loss functions for multi-label classification. Additionally, utilizing metadata to identify plots photographed over multiple years can improve prediction accuracy, although this approach carries risks due to variations in species visibility and growth stages. Finally, considering multi-scale tiling could be beneficial for analyzing plants of diferent sizes within quadrats, an avenue yet to be explored.

Beyond its prospects for improvement, the root of the problem remains mainly the shift in distribution between the quadrat images and the images of individual plants in the training set that has a significant negative impact and it is dificult to improve performance even with advanced tiling and/or segmentation approaches. In order to reduce this shift in the future, one solution could be to use unlabelled images of quadrats which are cheaper to produce in large quantities. This would enable the use of unsupervised or self-supervised methods on this type of data. More generally, it is noteworthy that no self-supervised learning (SSL) method has been experimented in the context of PlantCLEF2024 whereas this could have been potentially beneficial even on the individual plants training data. For the next edition of the challenge, we will try to encourage experimentation with self-supervised methods by sharing new data and pre-trained models.

Acknowledgments

The research described in this paper was partly funded by the European Commission via the GUARDEN and MAMBO projects, which have received funding from the European Union’s Horizon Europe research and innovation program under grant agreements 101060693 and 101060639. The opinions expressed in this work are those of the authors and are not necessarily those of the GUARDEN or MAMBO partners or the European Commission.

This project was provided with computer and storage resources by GENCI at IDRIS thanks to the grant 2023-AD010113641R1 on the supercomputer Jean Zay’s the V100 partition. [7] R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019.

doi:10.5281/zenodo.4414861. [8] S. Foy, S. McLoughlin, Utilizing dino v2 for domain adaptation in vegetation plot analysis, in:

Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [9] M. Gustineli, A. Miyaguchi, I. Stalter, Transfer learning for multi-label plant species classification with self-supervised vision transformers, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [10] S. Chulif, H. A. Ishrat, Y. L. Chang, S. H. Lee, Patch-wise inference using pre-trained vision transformers: Neuon submission to plantclef 2024, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [11] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026. [12] R. Takahashi, T. Matsubara, K. Uehara, Data augmentation using random image cropping and patching for deep cnns, IEEE Transactions on Circuits and Systems for Video Technology 30 (2020) 2917–2931. URL: http://dx.doi.org/10.1109/TCSVT.2019.2935128. doi:10.1109/tcsvt. 2019.2935128.

[1]

Goëau ,

Bonnet ,

Joly , Overview of PlantCLEF 2022: Image-based plant identification at global scale , in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum , 2022 .

[2]

Goëau ,

Bonnet ,

Joly , Overview of PlantCLEF 2023: Image-based plant identification at global scale , in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum , 2023 .

[3]

Afouard ,

Goeau ,

Bonnet ,

J.-C.

Lombardo ,

Joly , Pl@ntnet app in the era of deep learning , in: 5th International Conference on Learning Representations (ICLR 2017 ), April 24-26 2017 , Toulon, France, 2017 .

[4]

Goëau ,

J.-C.

Lombardo ,

Afouard ,

Espitalier ,

Bonnet , A . Joly, PlantCLEF 2024 pretrained models on the flora of the south western Europe based on a subset of Pl@ntNet collaborative images and a ViT base patch 14 dinoV2, 2024 . URL: https://doi.org/10.5281/zenodo.10848263. doi: 10 .5281/zenodo.10848263.

[5]

Oquab ,

Darcet ,

Moutakanni ,

Vo ,

Szafraniec ,

Khalidov ,

Fernandez ,

Haziza ,

Massa ,

El-Nouby , et al., Dinov2: Learning robust visual features without supervision , arXiv preprint arXiv:2304.07193 ( 2023 ).

[6]

Darcet ,

Oquab ,

Mairal ,

Bojanowski , Vision transformers need registers, 2024 . arXiv: 2309 . 16588 .