Patch-wise Inference using Pre-trained Vision
                         Transformers: NEUON Submission to PlantCLEF 2024
                         Notebook for the LifeCLEF Lab at CLEF 2024

                         Sophia Chulif1,2,* , Hamza Ahmed Ishrat1 , Yang Loong Chang2 and Sue Han Lee1
                         1
                             Swinburne University of Technology Sarawak Campus, 93350, Sarawak, Malaysia
                         2
                             Department of Artificial Intelligence, NEUON AI, 93350, Sarawak, Malaysia


                                        Abstract
                                        This paper describes our submissions to the PlantCLEF 2024 challenge, where the task involved a multi-label
                                        plant classification on vegetation plot images. Vegetation plots describe the plant species composition within a
                                        fixed area, typically ranging from a few square centimetres to hectares. It is vital for ecological and environmental
                                        research and remains a challenging effort where over 300,000 plant species exist globally. The training data
                                        for the challenge includes over 1 million images of 7,806 South Western European species, while the test data
                                        comprises 1,695 high-resolution plot images in different floristic contexts. The difficulty of the challenge is the
                                        data distribution shift between the training images of individual plants and test images of multi-species vegetation
                                        plots. Noting the size of the test images, we opted for a patch-wise inference method that tried to reduce the
                                        problem from multi to single or few-class identification. We implemented patch-wise inference on the test set
                                        using an Inception-ResNet-v2-based convolutional neural network (CNN) and DINOv2-based vision transformers
                                        (ViT) with multilayer perceptrons (MLP) of 3 layers. In particular, we trained our vision transformer-based MLP
                                        classifier on custom-generated patched training images from the provided training dataset. However, our MLP
                                        classifier, the method above, did not outperform the base pre-trained DINOv2 ViT. The performance of these
                                        approaches ranked highest from the base ViT, ViT-based MLP, and finally, the CNN. Our best run, an ensemble of
                                        the base ViT and ViT-based MLP, achieved a 23.01 public macro F1 score and a 21.31 private macro F1 score per
                                        vegetation plot, making our submission the second best.

                                        Keywords
                                        vegetation classification, multi-species identification, vision transformer, multilayer perceptron, convolutional
                                        neural network


                         1. Introduction
                         Vegetation plots are plant observations at a particular site within a distinct area that records the
                         plant species composition. The plots can be defined in varying sizes from a few centimetres to a few
                         hundred metres or even hectares square. Generally, the plot sizes depend on the vegetation type, i.e.,
                         smaller plots are used for small, low-grown herbaceous vegetation, and larger plots ranging over a
                         hundred square meters are used for woodlands [1]. Since the late 19th century, researchers have been
                         studying the ideal plot size for species sampling [2], and various standards have been introduced [3, 4, 5].
                         Consequently, the documentation of species diversity enables habitat management [6], ecological
                         research [7], conservation planning [8], and vegetation change monitoring [9]. With the advancement
                         of technology and digitisation, the number of electronic databases of vegetation plots is increasing,
                         dominantly in Europe. According to the Global Index of Vegetation-Plot Databases (GIVD) [10, 11],
                         which stores the metadata of vegetation plots, over 300 databases and 3 million vegetation plots
                         are recorded worldwide. The challenge lies in analysing and interpreting these data. Nevertheless,
                         continuous description of vegetation composition is essential for environmental research.
                            Over the years, various methods, from manual annotations requiring human experts to machine
                         learning, have been adopted for vegetation classification. Customarily, extensive field surveys are
                         conducted involving experts and classification systems [12]. Subsequently, statistical modelling systems
                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ schulif@swinburne.edu.my (S. Chulif); hamza.ishrat@yahoo.com (H. A. Ishrat); yangloong@neuon.ai (Y. L. Chang);
                          shlee@swinburne.edu.my (S. H. Lee)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
involving environmental data like elevation and terrain roughness are employed [13]. With the help of
remote sensing technologies like aerial imagery and Light Detection and Ranging (LiDAR) systems,
vegetation characteristics can not only be identified [14] but also quantified [15]. More recently, the
integration of deep learning, such as convolutional neural networks (CNN) and vision transformers (ViT)
are used in various vegetation tasks like weed and crop classification [16, 17] through remote-sensing
imagery at high spatial resolution. Deep learning has enabled automatic vegetation feature abstraction,
like leaf texture and fruit colours, that formerly required manual description. As a result, end-to-end
training is achieved whereby the training stages are reduced, and a classification result can be directly
obtained.
   Overall, vegetation classification requires careful examination and often involves a lengthy process.
The PlantCLEF 2024 challenge [18, 19] aims to tackle the multi-species image classification task from
vegetation plots with the help of AI. The test set comprises Pyrenean and Mediterranean species, while
the training data is focused on over seven thousand South Western European species. The main difficulty
of the challenge is the shift in data distribution between the training data (single individual plant images)
and test data (multi-species vegetation plot images). This paper describes our approach and results
based on our submissions to PlantCLEF 2024. In the previous PlantCLEF 2022 and 2023 challenges
involving large-scale plant identification of 80,000 species, we have witnessed the superiority of ViTs
against CNNs. This year, two pre-trained ViTs based on the self-supervising DINOv2 architecture [20]
were provided by the organisers [21]. These pre-trained models were finetuned on the PlantCLEF 2024
dataset and can be used directly by the participants. Therefore, we employed the given pre-trained ViT
for this task.
   Our approach mainly comprises a multilayer perceptron (MLP) based on the pre-trained ViT feature
embeddings mentioned above. It was trained on custom-generated patched training images from the
provided training dataset to improve its robustness in handling multi-class predictions. The custom
training image would consist of a maximum of 4 species. We also experimented with an Inception-
ResNet-v2 CNN (not trained on custom-generated patched training images) and the base pre-trained
ViT as minimum benchmarks. In addition, we adopted patch-wise inference on the test set with the
models. In this context, we split the test images into non-overlapping patches instead of feeding the
photos as a whole to the networks. The patch sizes experimented on were 1 (whole image), 4, 16, and
64. Our best submission, which is an ensemble of aggregated predictions from the pre-trained ViT and
MLP with inference patch sizes of 64 + 16, obtained an average public macro F1 score per plot of 23.01
and an average private macro F1 score per plot of 21.31, making it the second-best submitted run of the
challenge.


2. Datasets
2.1. Training set
The training set comprises 1,408,033 training images focused on 7,806 individual plant species from
Southwestern Europe. In contrast with the previous editions of the PlantCLEF challenges, the training
set provided in PlantCLEF 2024 is relatively higher in resolution to cater for the use of models with
larger resolution input and, at the same time, to reduce the difficulty of classifying small-sized plants
in the vegetation plots. The participants are able to train with two types of training sets. The first
type contains images with a minimum of 800 pixels on its biggest side (width or height), denoted as
TrainMinSide, while the second type includes images with a maximum of 800 pixels on its biggest
side, denoted as TrainMaxSide. Both datasets contain the same 1,408,033 training images and differ
only in their image resolution. It is also pre-organised into three subsets: train, validation, and test,
however, participants can choose to use their own train/validation/test dataset splits. Likewise, in the
past editions of PlantCLEF, the training dataset is imbalanced; some species have hundreds of image
samples, while some have only one image sample. Figure 1 shows several training images from the
dataset. It illustrates the challenges in identifying small-sized plant species, particularly those sharing
the same genus and similar visual characteristics. In our experiments, we used the TrainMaxSide dataset
Figure 1: Observations of four individual species sharing the same genus, Carex: Carex capillaris L., Carex
demissa Hornem., Carex depressa Link, and Carex extensa Gooden from the PlantCLEF 2024 dataset. The photos
show visual similarities between the species.


Table 1
Details of our datasets used in our CNN model training.
         Dataset      Subset folder    No. of images      No. of family   No. of genus     No. of species
         Train        train + test     1,356,839          181             1,446            7,806
         Validation   val              51,194             181             1,415            5,912


Table 2
Details of our datasets used in our ViT model training.
                         Dataset      Subset folder       No. of images   No. of species
                         Train        train               1,356,839       7,806
                         Validation   test                47,940          6,670


and partitioned our training/validation splits for our CNN and ViT-based MLP models as detailed in
Tables 1 and 2.

2.2. Test set
The test set includes 1,695 high-resolution (over 2000 pixels in width and height) vegetation plots
comprising Mediterranean and Pyrenean species. Experts produced the plots, and the shooting protocol
varied depending on the floristic contexts. Some plots are delimited by wooden frames, while others
with measuring tape. It also varies in angles of view, more or less perpendicular to the ground, and
quality resulting from the different environmental conditions, which produced shadows and blurry
areas. The main difficulty of the challenge is the domain shift between the training and test datasets.
Unlike the training images, which consist of individual plant observations, the test images comprise
multiple species within an image. Figure 2 shows some samples of the vegetation plot images from the
test set. Other than the difficulty in identifying multiple species within the plot, the challenge also lies
in the different variations of the images, as shown in Figure 3. The shadows and blurry areas of some
photos reduce the species’ visibility. In addition, the dried plants, due to weather or growth stages,
result in further domain shifts. From healthy/green in the training set to withering/brown in the test
set.
Figure 2: Samples of the vegetation plot images from the PlantCLEF 2024 test dataset. The photos show multiple
species in each plot.


Figure 3: Examples of challenges in the PlantCLEF 2024 test dataset. The shadows, blurry areas, and shift in
domain (healthy/green to withering/brown) of the plant images.


3. Methodology
Based on the results from PlantCLEF 2022 [22] and PlantCLEF 2023 [23], we have seen the superiority
of ViTs over standard CNNs in large-scale plant classification concerning 80,000 species. In this
year’s challenge, the organisers provided two pre-trained vision transformer models based on the self-
supervised learning DINOv2 method [20]. Both models were finetuned on the PlantCLEF 2024 dataset
with an image input of patch size 14. The first pre-trained model was frozen in its backbone layers,
Table 3
The differences between the trained Inception-ResNet-v2 CNN and DINOv2 ViT (ViT-L/14) models.
                                            Inception-ResNet-v2 CNN       DINOv2 ViT
                   Learning method          Supervised                    Self-superivised
                   Number of parameters     56 M                          80 M


Table 4
Details of our CNN model hyperparameters.
                              Hyperparameter             Values
                              Image input size           299 × 299 × 3
                              Batch size                 128
                              Initial learning rate      0.0001
                              Weight decay               0.00004
                              Learning dropout rate      0.2
                              Optimiser                  Adam
                              Loss function              Softmax cross entropy


and only the classifier layer was finetuned. In contrast, the second pre-trained model was finetuned on
all the model layers. We have experimented with the standard convolutional neural networks and the
latter vision transformer for this task. We employed an Inception-ResNet-v2-based CNN model and a
DINOv2-based ViT model, which differs as described in Table 3. Essentially, the self-supervised ViT
model does not require manually labelled data like the CNN model and comprises 80 million parameters,
making it more resource-intensive.

3.1. Convolutional Neural Network (CNN)
Our CNN model was trained with the TF-Slim [24] library under TensorFlow 1.12 based on the Inception-
ResNet-v2 [25] architecture using the hyperparameters described in Table 4. From the weights initialised
on the ImageNet dataset, it was first finetuned on the PlantCLEF 2017 [26] dataset, then on the PlantCLEF
2024 dataset. We also trained another CNN model without first finetuning on the PlantCLEF 2017
dataset; however, since it performed lower in our validation set, we employed the model that was
finetuned on the PlantCLEF 2017 dataset. The model is trained as a multi-task classifier relating to three
taxonomy levels: Species, Genus, and Family. In addition, we trained our CNN model with increased
data augmentation (random bi-cubic resizing, hue, contrast, horizontal, and up and down flipping) and
balanced batching by limiting the number of training images to 50 samples per epoch.

3.1.1. Inference methods
We employed two types of inference methods to obtain our CNN model predictions. The first method is
the traditional softmax function in which the class probabilities are computed from the final output layer
of the model. The second method is the feature embedding similarity comparison method, whereby
the model’s layer before the final fully connected layer is used as a feature extractor to represent the
embeddings of the trained species. Each species embedding comprises a vector size of 1536 features.
An embedding dictionary of the training dataset is then constructed to compare the distances with the
test set embeddings using cosine similarity to obtain the closest species prediction. Considering that
our CNN model is trained on single-label images and that the high-resolution test images are provided,
we split each test image into 4, 16, and 64 patches for our model’s single-class predictions. The patches
are divided equally in size and do not overlap one another. During inference, we also applied image
cropping on each patch. Two types of cropping methods were adopted: firstly, whole cropping (which
involves the entire patch), and secondly, ten croppings (which includes the centre, top-left, top-right,
bottom-left, and bottom-right crops of the patch, and all of these five croppings horizontally flipped,
resulting in ten cropppings).

3.1.2. Final predictions
We applied several computations beforehand to obtain the final predictions from our CNN model. The
order of these computations varies between the submission runs (which are further detailed in Section
4.2). However, in general, the processes are described in this section. Firstly, the computation of
a composite score to assess each prediction. We used the predicted label’s occurrence frequency
and probability score to rank the predictions better. By accumulating the total predicted class labels
and probabilities of the test images (from each image patch), we assigned specific weights to the class
occurrences and probabilities. In our case, either 0.3 or 0.7 as the weights of the class occurrences and
probability score. The final composite score of each test image plot was also filtered by applying a
threshold of 2.5, which was determined through the histogram graph of the entire test plot images
(only submission Run 12 was applied with this threshold). In addition, we removed duplicate species
sharing the same genus from our list of predictions, and only the species with the highest proba-
bility score was kept. In submission Runs 6 and 7, we also incorporated the model’s genus label
predictions before obtaining the species predictions.

3.2. Vision transformer (ViT)
Likewise for the vision transformer models, we performed patch-wise inference on the test set then
aggregate the results of each patch to get the predicted image labels. Patch-wise classification using
CNNs has been proven useful in whole slide tissue images for differentiating cancer subtypes [27].
However, due to resource limitations, we utilised the vision transformer provided by the organisers
[21] as our feature extractor, training only a multilayer perceptron (MLP) to form the classification
head. Specifically, the "vit_base_patch14_reg4_dinov2_lvd142m_pc24_onlyclassifier_then_all" classifier
that involved a 2 steps training. First, its backbone weights were frozen and only the classification head
was trained. Then, the whole model was further finetuned from the only-head-trained model.

3.2.1. Data preparation
Using the given training dataset, we used the "vit_base_patch14_reg4_dinov2_lvd142m_pc24_onlyc
lassifier_then_all" model as a feature extractor by removing the last classifer linear layer to extract
embeddings of size 768 from the training set to create an appropriate dataset for the MLP . Each class
was converted, with at most 100 embeddings per class being extracted to reduce the disparity in the
number of samples per class, as it was noted during data exploration that samples per class ranged from
less than 10 to more than 400.
   We also employed a method similar to RICAP [28] to form our image patches to simulate the test
images and train our MLP on multi-class images to improve its robustness in handling multiple classes
and single-class predictions. Although where RICAP has its grid having only one intersection point (i.e.,
the shape of a cross with its centre along different regions of the image), we opted to have a staggered
grid with varying sizes (1 to 4 patches per image) to simulate the random distribution we observed
in the test images provided. The tiles were determined on a selection step; if the first split would be
horizontal or vertical, then selecting a small section at the midpoint of the image to ensure the image is
not split in halves. The orientation of the splits alternate: if the first split was vertical, the subpatches
are split horizontally, then each of them is split vertical and so on. These steps could be extending to as
many as 8 subpatches, however 4 was chosen as the maximum as it yielded the best looking patched
images. Blending [29] was applied to the edges of the patches to smoothen the transition from patch
to patch as a method of data augmentation that has proven effective in histopathological images [30].
Examples of our generated training images for our MLP classifier are shown in Figure 4.
Figure 4: Examples of our generated training images for our MLP classifier. Training images comprising varying
patches of 1 to 4 (as shown from left to right) were created. Note that these generated multi-species training
images are only used for our vision transformers models with an MLP classifier head.


3.2.2. Model setup
The model developed was a classification head consisting of 3 layers of sizes 768, 1024, and 7806. The
large final layer was due to the labels being one-hot encodings of the image to allow for multi-class
classification and to show what other classes were being activated when an image was presented. Batch
normalisation was also used after the 1024 layer, along with the ReLU activation function. The final
layer had either no activation function or a sigmoid activation function as part of experimentation,
along with the type of training data, either being embeddings of single-class or multi-class images
generated from the training set. The tunable hyperparameters were the learning rate, weight decay,
and the number of epochs the model ran for. Cross entropy was used as our loss function and SGD
for the optimiser. We also tried experimenting with binary cross entropy and sigmoid cross entropy
as the loss function; however, only the cross entropy model was submitted due to the unfavourable
results obtained from the other two (we had limited time to rectify the issue so they were scratched).
The dataset per epoch ensured that every class was present once, giving a total dataset size of 7806
(relating to the 7806 classes available in the training set) per epoch, with a batch size of 128.

3.2.3. Bayesian Model Averaging
Bayesian Model Averaging (BMA) is a technique used in ensemble learning to best aggregate predictions
produced by multiple models weighted by the model’s individual performance on any given data [31].
Despite our method not using ensemble learning, we adapted its averaging method to aggregate our
scores as an alternative to other methods, such as pooling or voting. BMA is traditionally calculated
using two major components: likelihood and the priori of the model, 𝑀𝑘 , given the data, 𝐷.
   The likelihood is of a model 𝑀𝑘 explains the ‘correctness’ of the observed data 𝐷 given model 𝑀𝑘 .
Typically, this is computed via the integrals of the weights of model 𝑀𝑘 using the Markov Chain Monte
Carlo approximation method to approximate the likelihood.
   The priori encapsulates the pre-existing knowledge and assumptions of 𝑀𝑘 before observing any
data. This can be initialised either uniformly (all priors across all models are the same to let the data
dictate the posterior probability) or through other statistical means based on the data.
   The posterior probability of 𝑀𝑘 given 𝐷 is calculated as such:

                                               𝑃 (𝐷|𝑀𝑘 ) × 𝑃 (𝑀𝑘 )
                                 𝑃 (𝑀𝑘 |𝐷) = ∑︀𝐾
                                               𝑙=1 𝑃 (𝐷|𝑀𝑙 ) × 𝑃 (𝑀𝑙 )
 Where 𝑃 (𝐷 | 𝑀𝑘 ) is the likelihood of 𝑀𝑘 given the data 𝐷, and 𝑃 (𝑀𝑘 ) is the prior probability of
𝑀𝑘 , and the denominator is the sum of all model’s product of their respective likelihoods and priors.

3.2.4. Inference method
Using the patch-wise classification approach, we split each test image down to 4, 16 and 64 patches each
by mathematically dividing the length and width by 2, 4 and 8, respectively, with no overlap between
patches. Below are the steps with worked examples of how we arrived at our results.
Figure 5: The overall inference method for an image.


   Firstly, we split the image down to 64 individual, non-overlapping patches. For example, if the original
size of the image was 3024 by 3024 pixels. With each one now being 378 by 378 pixels. The embedding
is extracted for each patch and put through the trained MLP model to get an array of predictions, one
for each class. Figure 5 illustrates the overall inference method for an image.

BMA Likelihood Given our case of only one model giving 64 predictions from 64 patches of one
image, we adapted this equation to our needs. The easiest to figure out was the priori. Since only one
model is used, we based the priori on the number of patches, 𝑁 . Thus, our priori for each ‘model’ would
then be 𝑁1 . This was later removed during the calculation of each posterior probability as it would
cancel out.
   The likelihood was decided upon the variance of the model’s predictions. An alternative to this was
the performance of the model used on a validation set. However, this would cause all the likelihoods
to be the same across all predictions, and so was scraped. We chose variance as a measure of how
‘confident’ the model was in its prediction.
   Variance calculates the spread of given set based on its mean and standard deviation, where 𝑠2 is
the sample variance, 𝑁 is the number of samples and 𝑥     ¯ is the mean of all the samples. The higher the
variance, the bigger the spread of values. If a patch has a particularly high variance, we will assume that
the top scores are significantly higher than the rest of the results and so the model is more ‘confident’
that it sees something in the patch. The variance is computed with the formula described in Equation 1.
                                                ∑︀𝑁
                                                             ¯ )2
                                                   𝑖=1 (𝑥𝑖 − 𝑥
                                           2
                                          𝑠 =                                                          (1)
                                                     𝑁 −1
   We had initially planned to train a secondary model to discriminate against patches of plants and
non-plants, as we would want to give the patches with plants within them more of a say towards the
average. After training a simple MobileNet classifier on these two sets, with data sourced from the
patches and sorted manually, the classifier would be too extreme in its predictions; it was ‘too correct’,
through no fault of its own, but rather demonstrating its ability to be an excellent classifier.
   To calculate the variance of each prediction, we first take the top 1000 probabilities and applied the
SoftMax activation function to each prediction set to limit the range from 0 – 1, then multiplied by 100
to shift the range to 0 to 100. The variance alone, however, was not enough; it was simply too small of
a value in too broad of a range, usually in the range of 10−1 to 10−6 . To overcome this, we take the
absolute log of the variance, giving us a better comparison of variances:

                                            𝑣𝑎𝑟 = |𝑙𝑜𝑔10 (𝑠2 )|
  From this new range of values, we needed a way to convert them into a scale of 0 to 1. We considered
two functions, where 𝑎 is a predetermined constant to shift the 𝑥 intercept:
Figure 6: Graphs of functions A and B.


                                                                   √︂
                                        𝑥                               𝑎 + 0.5 − 𝑥
                         (𝐴)   𝑦 =1−                     (𝐵) 𝑦 =
                                     𝑎 − 0.5                              𝑎 + 0.5
   The graphs of these functions are illustrated in Figure 6. Ultimately, we decided on function (𝐵) due
to its bias towards higher confidence scores. Our confidence score for patch 𝑖 would now be calculated
as:                                              √︂
                                                    𝑣𝑎𝑟𝑚𝑎𝑥 + 0.5 − 𝑣𝑎𝑟𝑖
                                𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑖 =
                                                        𝑣𝑎𝑟𝑚𝑎𝑥 + 0.5
   Where 𝑣𝑎𝑟𝑚𝑎𝑥 is the highest recorded log of variance from the pool of 64 patches. 0.5 is added as a
buffer to prevent the smallest log of variance to equal 0.
   The confidence scores therefore became a means of adding weight to the prediction based on how
varied the prediction set was, from the assumption that a higher variance means a more certain
prediction.
   Admittedly not a perfect method as evidenced by the darker, hence less confident, patches on what
should obviously be identifiable as illustrated in Figure 7. We attribute this to misclassification due to
high probabilities across a range of classes. This method does discriminate against the edges and some
non-plant structures as intended, however some patches, despite the clear absence of plants, show up
clear, i.e., higher confidence.
   Lastly, assuming likelihood is represented by 𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑖 and priori as 𝑁1 , the posterior probability
of patch 𝑖 would be:

                                                  𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑖 × 𝑁1
                                  𝑃 (𝑖|𝐷) = ∑︀𝑁                     1
                                                 𝑘=1 𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑘 × 𝑁
  Where 𝑁 is the number of patches. This can be simplified to:
                                                     𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑖
                                                          𝑁
                                         𝑃 (𝑖|𝐷) = ∑︀𝑁  𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑘
                                                    𝑘=1       𝑁

                                                      𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑖
                                                           𝑁
                                         𝑃 (𝑖|𝐷) = ∑︀𝑁
                                                     𝑘=1 𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑘
                                                           𝑁

                                                 𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑖
                                    𝑃 (𝑖|𝐷) = ∑︀𝑁
                                                𝑘=1 𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒𝑘
Figure 7: Overlay of confidence heatmap on original test image. The image is split into 64 patches for inference.
Darker patches indicate lower confidence, clearer patches indicate higher confidence.


  Once obtained, a patch’s posterior probability is multiplied to each one of its predictions, followed by
the total class-wise summation across all classes to obtain the average prediction. Note that only the
top 100 classes for each patch are considered for the average prediction.

Threshold As the final averages still contain a number of classes, we opted to use z-scores to
determine our threshold in our predictions. Z-score as represented in Equation 2 is a measure of how
many standard deviations a point is from the mean of a set.
                                                    𝑥−𝜇
                                                  𝑧=                                                       (2)
                                                       𝜎
  Where 𝑥 is the probability in question, 𝜇 is the mean and 𝜎 is the standard deviation of the set. We
implemented this calculation to find the outliers in the prediction, hence considered as a more probable
prediction in the set. Our threshold is therefore any class that has a z-score of 2 or more, i.e., 2 standard
deviations or more from the mean. Figure 9 shows the z-scores of several test images.


4. Submissions
4.1. Evaluation metric
The evaluation metric used for the PlantCLEF 2024 challenge is the macro averaged F1 score per
sample or plot. It is the average of the F1 scores calculated individually for each vegetation plot.
The F1 score considers both precision and recall, in which precision measures the fraction of true
positive labels among the instances predicted as positive by the model. In contrast, recall measures the
fraction of true positive instances the model correctly predicted. The F1 score can be represented as
𝐹 1 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
            𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 .
Figure 8: Mathematical workflow of the averaging method.


Figure 9: Graphs showing z-scores of the first 6 testing images. The red lines indicate a threshold of z-score = 2.
Any class exceeding the limit is considered in the final result.


4.2. Models and runs
We submitted thirteen runs to PlantCLEF 2024; 5 consisted of the CNN models, while 8 consisted of
the ViT models. The 5 CNN runs involved the same Inception-ResNet-v2 model and only differ in
inference methods. Meanwhile, among the 8 ViT models, 4 included a trained MLP classifier head to
get the initial predictions, 3 used the pre-trained vision transformer provided by PlantCLEF, and the
last was an aggregation of the ViT and MLP models. We experimented with several patch sizes: 1 (the
whole image), 4, 16, 64, and 64 + 16 patches, where the predictions across the 64 sub-patches and 16
sub-patches were combined for the calculations. The overall submitted models are detailed in Table 5
and summarised as follow. The number (in parentheses) in the run names indicates the patch size on
which it was inferred. For example, in Run 1, the “ViT (1) model" refers to the patch-wise inference on
a single patch, which is the entire image.

4.2.1. Run 1: ViT (1) model
This run directly used the pre-trained ViT model provided by PlantCLEF. Inference on the test image
was made on the whole image without splitting the test image to patches, with the z-score of more
than 2 being used as the threshold.

4.2.2. Run 2: MLP v1 (1) model
This is similar to the Run 1, except the model used was the ViT and MLP v1 model with the threshold
being the same.

4.2.3. Run 3: CNN (16) model
This run used the trained CNN model. Inference on the test image was made from 16 patches. We applied
ten croppings on each patch, obtained the top-10 species predictions of each image crop, aggregated the
results for the composite score calculations, and finally removed the species sharing the same genus.

4.2.4. Run 4: CNN (4) model
This run is the same as Run 3, the only difference was instead of 16 patches, 4 patches was used during
inference.

4.2.5. Run 5: MLP v1 (64) model
This run used the trained MLP classifier along with the ViT as the feature extractor. Inference was
made on 64 sub patches and the predictions were aggregated using the averaging method discussed
previously.

4.2.6. Run 6: CNN genus (16) model
This run used the trained CNN model. Inference on the test image was made from 16 patches. We applied
ten croppings on each patch, obtained the top-10 species and top-1 genus predictions of each image
crop, and aggregated the results for the composite score for both the species and genus predictions.
From the total species prediction list, we removed the species not under the genus prediction list. Then,
we removed the species sharing the same genus. Finally, we picked the top-10 species from the final
prediction list.

4.2.7. Run 7: CNN genus (64) model
This run used the trained CNN model. Inference on the test image was made from 64 patches. We applied
ten croppings on each patch, obtained the top-5 species and top-1 genus predictions of each image crop,
and aggregated the results for the composite score for both the species and genus predictions. From
the total species prediction list, we removed the species not under the genus prediction list. Then, we
removed the species sharing the same genus. Additionally, we applied a threshold of 2.5 on the species
composite score before selecting the top-10 species from the final prediction list.
Table 5
Details of our submitted models.
       Run    Model                Architecture                    Learning   Epochs    Patch size
                                                                   rate       trained   used
       1      ViT (1)              Pre-trained ViT                 -          -         1
       2      MLP v1 (1)           Pre-trained ViT + 3-layer MLP   0.05       100       1
       3      CNN (16)             Inception-ResNet-v2             0.0001     91        16
       4      CNN (4)              Inception-ResNet-v2             0.0001     91        4
       5      MLP v1 (64)          Pre-trained ViT + 3-layer MLP   0.05       100       64
       6      CNN genus (16)       Inception-ResNet-v2             0.0001     91        16
       7      CNN genus (64)       Inception-ResNet-v2             0.0001     91        64
       8      ViT ten crops (64)   Pre-Trained ViT                 -          -         64
       9      ViT (64)             Pre-trained ViT                 -          -         64
       10     MLP v2 (64 + 16)     Pre-trained ViT + 3-layer MLP   0.07       300       64 + 16
       11     MLP v2 (64)          Pre-trained ViT + 3-layer MLP   0.07       300       64
       12     CNN emb (64 + 16)    Inception-ResNet-v2             0.0001     91        64+16
       13     Ensemble (64 + 16)   -                               -          -         64 + 16


4.2.8. Run 8: ViT ten crops (64) model
This run is similar to Run 1 except that inference on the test image was made on 64 patches. In addition,
the cropping on each patch was applied with ten croppings but no z-score threshold was applied. We
selected the top-10 species as the final species prediction list.

4.2.9. Run 9: ViT (64) model
This run is similar to Run 5, except the base ViT is used for the inference on 64 patches.

4.2.10. Run 10: MLP v2 (64 + 16) model
This run is similar to Run 5, except the MLP v2 is used as the classifier for the inference on 64 and 16
patches.

4.2.11. Run 11: MLP v2 (64) model
This run is similar to Run 10, except the MLP v2 is used as the classifier for the inference on only 64
patches

4.2.12. Run 12: CNN emb (64 + 16) model
Unlike the previous CNN runs, this run applied the feature embedding similarity comparison method by
comparing the embedding dictionary of the training dataset with the patches of test images. Inference
of the test image was made from the ensembled embeddings of 64 + 16 patches. Additionally, instead
of ten croppings, whole cropping of each patch was used. From each patch, we obtained the top-5
species prediction using cosine similarity then compute the species composite score. From the list of
species predictions, we removed the species sharing the same genus and applied a threshold of 2.5 on
the species composite score before picking the top-10 species from the prediction list.

4.2.13. Run 13: Ensemble (64 + 16) method
This run combined the predictions of Runs 5, 9 and 11, where each uses 64 + 16 patches to get the
predictions. Once aggregated using the averaging method, the top-n classes from each prediction for
each image were combined, removing duplicates.
4.3. Results and discussion
The evaluation metric used by the organisers was the macro averaged F1 score per sample, or more
specifically, MacroF1ScoreAveragedPerPlot; however, MacroF1ScoreAveragedPerSpecies and Mi-
croF1Score were also shown for information purposes on the challenge’s HuggingFace platform:
https://huggingface.co/spaces/BVRA/PlantCLEF2024. The results were divided into public and private
scores, and the final rankings were based on the private splits. Since we do not have the complete
results of the private split scores, we will only display our public score results in this section. The results
from our submitted runs are detailed in Table 6. Overall, our ensemble predictions of the pre-trained
ViT and the MLP classifier placed 2nd on the public leaderboards, with the base ViT run as third.
   We noticed a boost in performance when using the same model varying the number of patches from
64 to 64 and 16 as in the case of both runs involving MLP v2, with the MacroF1ScoreAveragedPerPlot
going from 19.57 to 20.16. However, we did not expect it to perform worse than the MLP v1 model,
considering it was trained for longer with a slightly higher learning rate. The MLP v1 model came as a
surprise as we had purposely cut its training time short as the deadline was approaching and needed a
benchmark, which ultimately turned out to be one of the top methods. The ViT model was also used to
provide a benchmark on models using the same averaging method and outperformed both MLP models
with a MacroF1ScoreAveragedPerPlot of 22.19. As a final attempt we decided to gather the predictions
of all ViT-based models based on 64 + 16 patches and combined them together, giving us our best run
of a MacroF1ScoreAveragedPerPlot of 23.01.

Further research In summary, ViT remains superior and outperforms CNN in this multi-species
identification task, as expected. Additionally, comparing the sole use of the base pre-trained ViT (64)
model (Run 9 - 22.19 F1 score) and the MLP v1 (64) model (Run 5 - 21.65 F1 score), we find that our
MLP model did not improve in performance. It could be due to the different data distribution between
the test set (vegetation plots) and the generated patched training images, as illustrated in Figure 4.
A more realistic generation of patched images could be experimented with to tackle this issue. Data
augmentation techniques like ReMix [32] and CutMix [33] can also be implemented and studied.
  Additionally, using a higher patch size for inference has helped in the classification of all models.
With high-resolution plots, splitting the image into patches is favourable to reduce the problem from
a multi to single or few-class identification. The largest patch size we tried was 64. Therefore, more
patches can be examined to determine a suitable patch size for the vegetation classification.
  Furthermore, the adoption of higher taxonomy level classification (genus) implemented in the CNN
run, CNN genus (16) (Run 6 - 10.91 F1 score) showed improvements over the results from the CNN
without genus classification, CNN (16) (Run 3 - 9.7 F1 score). Due to constraints, genus classification
was not implemented in our ViT models, but it can be employed for further research.
  Besides, binary cross entropy or sigmoid cross entropy loss functions can be adopted to create a
better or more accurate multi-species classification model.
  Lastly, more studies can be carried out to analyse the misclassifications of the models, specifically
the ViT-based, as shown in Figure 7. More experiments can be explored to determine and eliminate
why the absence of plants produced high confidence scores.
Table 6
Results of our submitted runs based on the public leaderboard.
        Run/Model                 MacroF1ScoreAveraged      MacroF1ScoreAveraged   MicroF1Score
                                  PerPlot                   PerSpecies
        (13) Ensemble (64 + 16)   23.01                     46.2                   20.84
        (9) ViT (64)              22.19                     51.57                  19.75
        (5) MLP v1 (64)           21.65                     46.41                  18.92
        (10) MLP v2 (64 + 16)     20.16                     42.89                  17.94
        (11) MLP v2 (64)          19.57                     42.04                  17.38
        (8) ViT ten crops (64)    16.99                     39.16                  17.3
        (7) CNN genus (64)        12.81                     41.32                  12.63
        (6) CNN genus (16)        10.91                     36.02                  10.85
        (3) CNN (16)              9.7                       32.95                  9.52
        (1) ViT (1)               9.14                      38.68                  7.63
        (12) CNN emb (64 + 16)    6.9                       32.18                  6.51
        (2) MLP v1 (1)            5.16                      23.99                  4.51
        (4) CNN (4)               3.56                      34.85                  3.57


5. Conclusion
This paper presents our multi-species vegetation plot classification work in conjunction with PlantCLEF
2024. Considering the task of the challenge involving high-resolution test plot images, we implemented
patch-wise inference on the image whereby the plot is split into varying sizes, reducing the task from a
multi to single or few-class identification. Our best submission, an ensemble of DINO-v2 based ViT
and MLP classifiers, scored an average public macro F1 score per plot of 23.01 and a private macro F1
score per plot of 21.31, ranking us second in the challenge. Since our ViT model comprising the MLP
classifier head did not perform better than the base pre-trained ViT benchmark, further research can be
implemented to improve the classification. ViT has once again shown to be a competitive choice in plant
identification in the context of PlantCLEF. Nonetheless, more research is required to tackle the resource
constraints and to leverage this powerful architecture in vegetation tasks like plot classification.


Acknowledgments
The resources of this project is supported by NEUON AI SDN. BHD., Malaysia.


References
 [1] M. Chytrỳ, Z. Otỳpková, Plot sizes used for phytosociological sampling of european vegetation,
     Journal of vegetation science 14 (2003) 563–570.
 [2] J. Moravec, The determination of the minimal area of phytocenoses, Folia geobotanica & phyto-
     taxonomica 8 (1973) 23–47.
 [3] M. Poore, The use of phytosociological methods in ecological investigations: I. the braun-blanquet
     system, Journal of ecology 43 (1955) 226–244.
 [4] J. S. Rodwell, J. N. C. C. (GB), National vegetation classification: Users’ handbook, Joint nature
     conservation committee Peterborough, 2006.
 [5] S. K. Wiser, N. Spencer, M. De Caceres, M. Kleikamp, B. Boyle, R. K. Peet, Veg-x–an exchange
     standard for plot-based vegetation data, Journal of Vegetation Science 22 (2011) 598–609.
 [6] C. Y. Manullang, et al., Distribution of plastic debris pollution and it is implications on mangrove
     vegetation, Marine Pollution Bulletin 160 (2020) 111642.
 [7] W. Hansen, J. Wollny, A. Otte, R. L. Eckstein, K. Ludewig, Invasive legume affects species and
     functional composition of mountain meadow plant communities, Biological Invasions 23 (2021)
     281–296.
 [8] Y. R. F. Nunes, C. S. Souza, I. F. P. de Azevedo, O. S. de Oliveira, L. A. Frazão, R. S. Fonseca, R. M.
     dos Santos, W. V. Neves, Vegetation structure and edaphic factors in veredas reflect different
     conservation status in these threatened areas, Forest Ecosystems 9 (2022) 100036.
 [9] A. H. Halbritter, V. Vandvik, S. H. Cotner, W. Farfan-Rios, B. S. Maitner, S. T. Michaletz, I. Oliv-
     eras Menor, R. J. Telford, A. Ccahuana, R. Cruz, et al., Plant trait and vegetation data along a 1314
     m elevation gradient with fire history in puna grasslands, perú, Scientific Data 11 (2024) 225.
[10] J. Dengler, F. Jansen, F. Glöckler, R. K. Peet, M. De Cáceres, M. Chytrỳ, J. Ewald, J. Oldeland,
     G. Lopez-Gonzalez, M. Finckh, et al., The global index of vegetation-plot databases (givd): a new
     resource for vegetation science, Journal of Vegetation Science 22 (2011) 582–597.
[11] J. Dengler, et al., Global index of vegetation-plot databases (givd), 2011. https://www.givd.info/
     [Accessed: 2024-06-06].
[12] I. Noble, The role of expert systems in vegetation science, Vegetatio 69 (1987) 115–121.
[13] A. E. Gelfand, M. Holder, A. Latimer, P. O. Lewis, A. G. Rebelo, J. A. S. Jr., S. Wu, Explaining species
     distribution patterns through hierarchical modeling, Bayesian Analysis 1 (2006) 41 – 92. URL:
     https://doi.org/10.1214/06-BA102. doi:10.1214/06-BA102.
[14] C. X. Garzon-Lopez, E. Lasso, Species classification in a tropical alpine ecosystem using uav-borne
     rgb and hyperspectral imagery, Drones 4 (2020) 69.
[15] L. Huo, E. Lindberg, J. Holmgren, Towards low vegetation identification: A new method for tree
     crown segmentation from lidar data based on a symmetrical structure detection algorithm (ssd),
     Remote Sensing of Environment 270 (2022) 112857.
[16] T. Kattenborn, J. Leitloff, F. Schiefer, S. Hinz, Review on convolutional neural networks (cnn)
     in vegetation remote sensing, ISPRS journal of photogrammetry and remote sensing 173 (2021)
     24–49.
[17] R. Reedha, E. Dericquebourg, R. Canals, A. Hafiane, Transformer neural network for weed and
     crop classification of high resolution uav images, Remote Sensing 14 (2022) 592.
[18] A. Joly, L. Picek, S. Kahl, H. Goëau, V. Espitalier, C. Botella, B. Deneu, D. Marcos, J. Estopinan,
     C. Leblanc, T. Larcher, M.
     vSulc, M. Hrúz, M. Servajean, J. Matas, et al., Overview of lifeclef 2024: Challenges on species
     distribution prediction and identification, in: International Conference of the Cross-Language
     Evaluation Forum for European Languages, Springer, 2024.
[19] H. Goëau, V. Espitalier, P. Bonnet, A. Joly, Overview of PlantCLEF 2024: Multi-species plant
     identification in vegetation plot images, in: Working Notes of CLEF 2024 - Conference and Labs
     of the Evaluation Forum, 2024.
[20] T. Darcet, M. Oquab, J. Mairal, P. Bojanowski, Vision transformers need registers, arXiv preprint
     arXiv:2309.16588 (2023).
[21] H. Goëau, J.-C. Lombardo, A. Affouard, V. Espitalier, P. Bonnet, A. Joly, PlantCLEF 2024 pretrained
     models on the flora of the south western Europe based on a subset of Pl@ntNet collaborative
     images and a ViT base patch 14 dinoV2, 2024. URL: https://doi.org/10.5281/zenodo.10848263.
     doi:10.5281/zenodo.10848263.
[22] H. Goëau, P. Bonnet, A. Joly, Overview of plantclef 2022: Image-based plant identification at global
     scale., in: CLEF (Working Notes), 2022, pp. 1916–1928.
[23] H. Goëau, P. Bonnet, A. Joly, Overview of plantclef 2023: image-based plant identification at global
     scale, in: 24th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF-WN
     2023, volume 3497, 2023, pp. 1972–1981.
[24] Sergio Guadarrama, Nathan Silberman, TensorFlow-Slim: A lightweight library for defining,
     training and evaluating complex models in tensorflow, https://github.com/google-research/tf-slim,
     2016. URL: https://github.com/google-research/tf-slim, [Online; accessed 29-June-2019].
[25] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, Inception-v4, inception-resnet and the impact
     of residual connections on learning, in: Thirty-First AAAI Conference on Artificial Intelligence,
     2017.
[26] A. Joly, H. Goëau, H. Glotin, C. Spampinato, P. Bonnet, W.-P. Vellinga, J.-C. Lombardo, R. Planqué,
     S. Palazzo, H. Müller, Lifeclef 2017 lab overview: multimedia species identification challenges, in:
     Experimental IR Meets Multilinguality, Multimodality, and Interaction: 8th International Confer-
     ence of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11–14, 2017, Proceedings 8,
     Springer, 2017, pp. 255–274.
[27] L. Hou, D. Samaras, T. M. Kurc, Y. Gao, J. E. Davis, J. H. Saltz, Patch-based convolutional neural
     network for whole slide tissue image classification, in: Proceedings of the IEEE conference on
     computer vision and pattern recognition, 2016, pp. 2424–2433.
[28] R. Takahashi, T. Matsubara, K. Uehara, Ricap: Random image cropping and patching data augmen-
     tation for deep cnns, in: Asian conference on machine learning, PMLR, 2018, pp. 786–798.
[29] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, J. M. Ogden, Pyramid methods in image
     processing, RCA engineer 29 (1984) 33–41.
[30] S. T. M. Ataky, J. De Matos, A. d. S. Britto, L. E. Oliveira, A. L. Koerich, Data augmentation for
     histopathological images based on gaussian-laplacian pyramid blending, in: 2020 International
     Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–8.
[31] J. A. Hoeting, D. Madigan, A. E. Raftery, C. T. Volinsky, Bayesian model averaging: a tutorial (with
     comments by m. clyde, david draper and ei george, and a rejoinder by the authors, Statistical
     science 14 (1999) 382–417.
[32] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train strong
     classifiers with localizable features, in: Proceedings of the IEEE/CVF international conference on
     computer vision, 2019, pp. 6023–6032.
[33] J. Cao, L. Hou, M.-H. Yang, R. He, Z. Sun, Remix: Towards image-to-image translation with limited
     data, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
     2021, pp. 15018–15027.