<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hanna Herasimchyk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robin Labryga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomislav Prusina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Hamburg</institution>
          ,
          <addr-line>177 Mittelweg, Hamburg, 20148</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at diferent scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-Label Classification</kwd>
        <kwd>DINOv2</kwd>
        <kwd>Vision Transformer</kwd>
        <kwd>Species Identification</kwd>
        <kwd>Vegetation Plot Images</kwd>
        <kwd>Biodiversity</kwd>
        <kwd>PlantCLEF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Vegetation plot inventories are essential in ecological research, enabling the sampling and assessment
of biodiversity as well as the monitoring of environmental changes. They generate valuable data that
supports ecosystem analysis, biodiversity conservation, and evidence-based environmental
decisionmaking. A standard vegetation inventory examines small quadrats that are rectangular frames of about
half a square meter placed on the ground to define specific sampling areas. Trained botanists record all
plant species found and quantify their presence using metrics such as biomass, ecological scores, or
coverage observed in images.</p>
      <p>Integrating machine learning methods into this process could drastically enhance eficiency, enabling
broader ecological studies with reduced expert involvement. However, developing models capable of
identifying multiple plant species among thousands in a single image remains a significant technical
challenge.</p>
      <p>Having a quadrat image dataset annotated with all present plant species is crucial, yet expensive and
challenging to create due to the numerous species in a given area. In contrast, substantial collections of
images containing only single plant species already exist, making it much easier to train single-species
classification models.</p>
      <p>
        The PlantCLEF 2025 challenge [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] seeks to address this gap by evaluating models designed to
predict the presence of multiple plant species in high-resolution quadrat images. In this competition,
models are trained using single-label images of individual plants but are tested on multi-label quadrat
images, highlighting the challenge of domain shift between training and test data.
      </p>
      <p>
        Our main approach utilizes a vision transformer architecture [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] equipped with multiple
classification heads, enabling the model to simultaneously predict species, genus, and family from a shared
feature extraction backbone. This multi-head design efectively integrates taxonomic knowledge and
leverages hierarchical relationships, significantly enhancing the robustness of species predictions in
complex vegetation plot images.
      </p>
      <p>Key contributions of our work towards improving multi-label classification of plant species in quadrat
images include:
• We use multi-head predictions and static knowledge of plant taxonomy to harness information
contained in the metadata of the training images.
• We introduce multi-scale tiling to improve the model’s ability to recognize plants at diferent
scales in quadrat images.
• We dynamically determine prediction thresholds by optimizing for the mean prediction length.
• We utilize bagging to enhance the model’s robustness and generalization capabilities.
Our code is available on GitHub1.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>2.1. Data
The training dataset consists of approximately 1.4 million images (about 281 GB) of individual plants,
each accompanied by metadata. This large scale presents a significant computational challenge for
model training. The dataset, also used in the PlantCLEF 2024 competition, covers 7,806 plant species,
1,446 genera, and 181 families.</p>
      <p>The distribution of images across species is shown in Figure 2, while the distribution of species across
genera and families is depicted in Fig. 3. Each image is labeled with a single plant species, single genus,</p>
      <p>800
t 600
n
u
o
ce 400
g
a
m
I 200</p>
      <p>0
and single family, and includes metadata such as organ type and geographic location. A genus describes
a group of plant species, while a family describes a group of plant genera. Example training images are
shown in Fig. 1.</p>
      <p>0
1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000</p>
      <p>Species rank (sorted by image count)</p>
      <sec id="sec-2-1">
        <title>2.2. Metric</title>
        <p>
          Unlike PlantCLEF 2024 [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ], this competition uses a modified evaluation metric. The final score is
the average of macro-averaged F1 scores, computed for each transect in the test set. A transect is a
sequence of vegetation plots (quadrats) placed along a defined path in the field to systematically record
species occurrences.
        </p>
        <p>⎛
1 ∑︁</p>
        <p>⎞
1 ∑︁ F1 ⎠
=1 ⎝  =1
•  is the total number of transects,
•  is the number of quadrats in transect ,
• 1 ∑︀=1 F1 is the macro-averaged F1-score for quadrat  in transect .</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. DINOv2 Model</title>
        <p>
          We used a DINOv2 model [
          <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
          ] provided by the PlantCLEF organizers, pre-trained on single-species
training images. The architecture is based on the distilled Vision Transformer Base (ViT-B/14) with
registers [9] serving as the backbone for feature extraction. For each input image, the model generates
an embedding that is then passed through a classification head consisting of one linear layer to predict
the species. Further details can be found in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Our choice to use DINOv2 was based on empirical evidence from the PlantCLEF 2024 challenge, where
ViT-B architectures demonstrated superior performance compared to alternative model architectures
[
          <xref ref-type="bibr" rid="ref6">6, 10, 11, 12</xref>
          ]. Furthermore, given the computational constraints associated with the dataset (1.4
million images, 281 GB), training large-scale deep neural networks from scratch would have been
computationally prohibitive. Hence, we used the already pre-trained DINOv2 backbone provided by
the organizers without additional finetuning.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Training Data Preparation</title>
        <p>For several of our methods, it is necessary to train or retrain models, including the newly added genus
and family classifiers, as well as models for distinguishing between plant and non-plant samples. The
training procedure we use is described below.</p>
        <p>Data Augmentations During training, we employed a variety of data augmentation techniques to
enhance the model’s robustness and generalization capabilities. These augmentations included random
cropping, random horizontal and vertical flipping, perspective transformations, and random rotations.
Additionally, we applied color jittering to introduce variability in brightness, contrast, and saturation.</p>
        <p>We also applied standard normalization and resizing procedure to ensure that input images matched
the distribution and size expected by the DINOv2 architecture. This included subtracting the mean and
dividing by the standard deviation as well as resizing input images to 518 × 518.</p>
        <p>Data Split The provided training dataset was already pre-split. We decided to use all available data
for training, including images that were not original used for pre-training. For internal evaluation, we
performed a stratified split of the training data to ensure a balanced representation of species.
LUCAS Dataset The organizers provided an additional training dataset called LUCAS (Land
Use/Cover Area frame Survey) [13], comprising 212,782 unannotated ground vegetation images in a vertical
quadrat-like format, amounting to 170GB of data. We explored continued pre-training of the DINOv2
model to incorporate this data, motivated by the idea that exposure to domain-specific vegetation
plot imagery during pre-training could enhance the model’s representational capacity for downstream
classification. However, this approach proved infeasible due to hardware constraints. As a result, we
proceeded with the original DINOv2 weights without additional pre-training on the LUCAS dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Test Data Preprocessing</title>
        <p>Image Cropping Initial visual inspection of the vegetation plot imagery revealed the frequent
presence of non-plant artifacts, such as wooden plot frame edges, measuring tapes, and footwear,
usually located at the image borders (see Fig. 4). To reduce the influence of these non-plant objects
on the model, we experimented with centrally cropping 5% to 15% from all four image sides. The 10%
cropping strategy yielded the best results on the public leaderboard, while the 5% strategy was more
efective on the private one, suggesting that the 10% approach may have been excessive.
Multi-Scale Tiling To address the challenge of varying plant sizes and densities within vegetation
plots, we implemented a multi-scale tiling approach. This involved splitting the image into a grid
of multiple tiles (2 × 2 , 3 × 3 , . . . ), allowing the model to capture both small and large plant species
efectively. Each tile is used as an input image for the model. All pre-processing steps are applied to
each tile accordingly. We additionally experimented with overlapping tiles to ensure that plants on
the edges of tiles were not missed. However, we found that using multiple tiles without overlap was
suficient, as the overlap did not lead to any improvement in the results.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Architecture and Training</title>
        <p>Multi-Head Classification To leverage taxonomic information, along the original species MLP
classification head, we incorporated additional MLP classification heads for genus and family prediction
on top of the DINOv2 ViT-B backbone. These heads utilized metadata associated with each image. We
also experimented with the number of layers in each classification head.</p>
        <p>Given the strict hierarchical relationship—where each species uniquely belongs to one genus, and
each genus to one family, we multiplied the predicted probabilities for species, genus, and family,
discarding combinations that do not exist in the provided metadata. This ensured that only valid
taxonomic assignments were considered.</p>
        <p>In addition to the taxonomic classification heads, we trained a dedicated classification head for organ
prediction, designed to identify the type of plant organ depicted in each image (e.g., leaf, flower, stem).
However, integrating organ-based information into the overall prediction pipeline proved challenging
due to the inherent variability in organ representation among diferent species.</p>
        <p>Furthermore, the dataset included a "scan" organ label indicating images obtained by scanning plants
rather than capturing them in natural settings. Since our primary focus is on vegetation plot analysis,
which relies on photos of plants in real settings, we hypothesize that removing such images from the
training dataset could improve final accuracy.</p>
        <p>Hydra Model Architecture We used independent classification heads that shared the same
embedding from a frozen backbone. Several versions of each head with diferent numbers of layers were
trained simultaneously. During testing, we could swap these pre-trained heads to create various model
versions from one main architecture. We refer to this ensemble approach as the Hydra model. The best
Hydra model we trained included a one-layer head for species classification and two-layer heads with
ReLU activation function in between for genus and family classification.</p>
        <p>
          DINOv2 ViT-L We explored scaling the model architecture by training a DINOv2 implementation
based on the Vision Transformer Large (ViT-L/14 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]) backbone. While this architecture ofers greater
representational capacity compared to smaller variants, preliminary experiments revealed significant
computational limitations. A single training iteration on the full PlantCLEF dataset required
approximately 30 hours using our GPU cluster (see Model Training in Section 3.3). Given the need for at least
roughly 50 iterations to achieve convergence, the total training time would exceed 1,500 hours (about
62.5 days), rendering this approach infeasible within the project’s resource constraints.
Plant/Non-Plant Filtering To reduce false positives from irrelevant foreground clutter (e.g., rocks
or soil patches), we trained a binary classifier to distinguish between plant and non-plant regions. We
created a separate dataset of non-plant images (primarily rocks) from publicly available sources and
trained logistic regression, random forest, and a ViT-based classifier. Out of these three approaches,
the Random Forest classifier achieved the highest overall accuracy, correctly identifying plant and
non-plant tiles 95% of the time on our validation data. As a result, we adopted the Random Forest model
for filtering of non-plant objects in our primary pipeline. However, the model failed to generalize on
the vegetation plot images and did not improve the final prediction quality.
        </p>
        <p>Model Training We trained the described model architectures on our GPU cluster, utilizing 2×
NVIDIA A6000 GPUs for each experiment. Each ViT-based model was trained for approximately
three days, with the duration varying depending on the specific architecture. For detailed technical
specifications and code, please refer to our publicly available GitHub repository (see Section 1).</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Inference</title>
        <p>We implemented a multi-step prediction pipeline to adapt the single-species classifier to the
multispecies quadrat prediction task. Several strategies were empirically tested and integrated, with varying
levels of success across the public and private datasets.</p>
        <p>Top-n and Bottom-n Filtering Given that each vegetation plot image typically contains no more
than a dozen distinct plant species, we constrained the number of species predictions per image by
limiting the maximum (top-n) allowed predictions. Through experimentation, we found that tuning
this upper bound improved scores on the public leaderboard. The same experiments after the challenge
revealed that this often leads to worse performance on the private leaderboard. Additionally, enforcing
a minimum (bottom-n) of at least one species prediction per image proved beneficial.
Logit Thresholding For each tile, we allowed at most one species contribution to the final prediction.
To ensure that only the most confident predictions were included, we applied a logit thresholding
strategy. One approach was to set a minimum logit value for species predictions, filtering out
lowconfidence predictions. Another approach involved dynamically adjusting the logit threshold based
on the mean prediction length across all test images. To perform this dynamic adjustment eficiently,
we utilized pre-computed logits for each test image and tile, and found appropriate thresholds using a
bisection search algorithm. We ended up using dynamically adjusted thresholding with an average of
four species per image because of its simplicity of use and apparent performance.
Metadata Merging A subset of the test set vegetation plot images included identifiers and dates
within their filenames. We investigated whether using image metadata, specifically, merging predictions
across images taken in the same field and year, could enhance the score. For example, if a species was
identified more than three times across all images of the same plot, it was predicted in every image
of that plot. The idea was that such an approach might enhance recall by consolidating information
from related plots. However, we did not use this method because: first, this method did not improve
our score; second, metadata was not available for the entire test set; third, this method contradicts the
goal of the challenge, which is to discover the changes in biodiversity from the vegetation plot.
Bagging To further improve the robustness of our predictions, we implemented a bagging strategy
(see [14]). We combined multiple models by averaging their logits from each image tile before generating
the final prediction. This method helps reduce variability and increases the reliability of our results by
using information from diferent models.</p>
        <p>Kernels We implemented a kernel-based smoothing approach applied to the logit outputs of each
image tile. Specifically, the logits of neighboring tiles were added to each tile’s prediction logits with a
weighting coeficient (e.g., 0.5), allowing the predictions of adjacent tiles to influence one another. The
idea was that plants might span across tile boundaries. However, initial experiments with kernel-based
smoothing did not yield improvements in the final evaluation scores. Consequently, we did not try any
alternative kernels. It is likely that the lack of improvement was due to our use of multi-scale tiling,
which efectively served a similar purpose.</p>
        <p>Other Techniques We explored several additional strategies, such as z-score normalization of logits
instead of thresholding or filtering out rare species, but observed no consistent improvements across
datasets. Due to marginal returns, these methods were ultimately not included in the final pipeline.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Table 2 presents our top-5 submissions on both the public and private PlantCLEF leaderboards, as
well as our five selected predictions. While all models achieve higher scores on the public leaderboard,
there is a consistent drop in performance on the private leaderboard across all submissions. This pattern
suggests that the public and private test sets are not well-balanced, and that models optimized for the
public set may not generalize well to the private set. The relatively small score diferences between
submissions on the private leaderboard, contrasted with larger variations on the public leaderboard,
further highlight this imbalance. These results indicate that leaderboard-driven optimization likely
led to overfitting on the public test set. In particular, we experienced the smallest drop on the private
leaderboard in comparison to the top-performing solutions on the public leaderboard.</p>
      <p>Due to a substantial domain shift between the training and test data, we were unable to validate our
approaches locally, which forced us to rely on the public leaderboard for model selection. Despite our
eforts to select a diverse set of models, none of our five chosen submissions appear among the top-5 on
the private leaderboard, highlighting the challenges presented by the test data split and the limitations
of leaderboard-based evaluation.</p>
      <p>Our primary multi-head classification approach achieved a substantial improvement over the baseline,
which relied on simple single-head plant species classification. As shown in Table 2, all reported results
utilize multi-head classification, highlighting this improvement.
2https://www.kaggle.com/models/juliostat/dinov2_patch14_reg4_onlyclassifier_then_all/PyTorch/default
4,5
4,5
4,5
4,5
4,5
4,5
4,5
4
4,5
4,5
4,5
4,5
4,5
4,5
1,2,4,5
8,10,12
10
10
10
10
10
10
5
10
10
10
10
10
10
10</p>
      <p>We evaluated several hyperparameter configurations and observed that the 10% cropping strategy
yielded the most promising results on the public test set, while the 5% strategy performed better on
the private set, suggesting that the former likely resulted in excessive cropping of informative visual
regions. Top-9 and top-10 filtering did not improve the score, and top-n filtering generally decreased
performance on the private leaderboard. Always predicting at least one species positively improved the
score. Dynamically adjusting the threshold with an average of four species per image enhanced the
ifnal score. Our best Hydra model featured a one-layer head for species classification and two-layer
heads with an activation function applied between layers for genus and family classification. Merging
metadata did not improve results, likely because metadata was not available for the entire test set and
because this approach contradicts the challenge’s goal of discovering changes in biodiversity from the
vegetation plot. For multi-scale tiling, we found that using multiple, non-overlapping tiles of sizes 4 and
5 was suficient, as overlap did not ofer any performance gains. Although plant/non-plant filtering via
a Random Forest achieved 95% validation accuracy on our separate dataset, it failed to generalize to the
vegetation plot images and did not enhance the final predictions. While bagging significantly improved
results on the public leaderboard, it had a negative efect on the private leaderboard score. However,
bagging did improve the private score when applied to models using diferent cropping parameters,
as seen in our second-best submission on the private leaderboard. Finally, initial experiments with
kernel-based smoothing did not improve the final evaluation scores, possibly because multi-scale tiling
already provided a similar efect.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>Deep learning and computer vision methods have been widely applied to plant species identification and
vegetation analysis. Early work focused on convolutional neural networks (CNNs) for remote sensing
and vegetation mapping, as reviewed by Kattenborn et al. [15]. More recently, transformer-based
architectures have shown promise for plant-related tasks, such as weed detection in UAV imagery [16],
and our work builds on this trend by utilizing a vision transformer backbone for multi-label plant
species prediction.</p>
      <p>Patch-based and multi-scale approaches have been explored to address the challenge of varying
object sizes in images. Adelson et al. [17] introduced image pyramid methods, which similar to our use
of multi-scale tiling captures information at diferent spatial resolutions.</p>
      <p>Hierarchical classification, which exploits for example taxonomic relationships, has been studied in
various domains. Silla and Freitas [18] provide a comprehensive survey of hierarchical machine learning.
An example of hierarchical classification in the context of taxonomy is the work by Colonna et al. [ 19]
that used a top-down approach to predict family, genus, and species in frogs. Several works [20, 21]
propose multiplying probabilities along the taxonomic hierarchy with some using one classifier per
hierarchical layer, and some using one per inner node in the hierarchy. It is similar to our multi-head
architecture that predicts species, genus, and family independently and fuses their outputs.</p>
      <p>Data augmentation remains a key technique for improving model robustness. Shorten and
Khoshgoftaar [22] provide a comprehensive survey of image augmentation methods, many of which we use in
our training pipeline.</p>
      <p>
        Previous work in the PlantCLEF2024 challenge [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] featured diverse deep learning approaches for
plant species identification. Foy and McLoughlin [ 11] leveraged the vision transformer (ViT) architecture
together with the Segment-Anything Model (SAM) to efectively suppress false positives in non-plant
image regions. Gustineli et al. [10] explored multiple embedding methods and classifier architectures
based on ViT, while Chulif et al. [12] combined CNNs and ViT with Bayesian Model Averaging for
enhanced prediction. These approaches highlight a trend toward vision transformers and advanced
post-processing techniques for robust plant species identification.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We present a metadata-enhanced multi-head vision transformer for multi-label plant species prediction,
combining species, genus, and family outputs through taxonomic fusion. Using multi-scale tiling,
dynamic thresholding, and ensemble strategies (Hydra), our model achieved strong results on the public
leaderboard.</p>
      <p>However, performance dropped on the private test set, revealing sensitivity to domain shift and the
limitations of leaderboard-based tuning, but still having competitive results.</p>
      <p>Future work should address domain adaptation, incorporate organ-specific cues, and explore
finetuning strategies to improve real-world robustness.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We want to thank the organizers of PlantCLEF 2025 and LifeCLEF 2025 for hosting the competition.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT, GitHub Copilot, Grammarly in order
to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the author(s)
reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.
[9] T. Darcet, M. Oquab, J. Mairal, P. Bojanowski, Vision Transformers Need Registers, International</p>
      <p>Conference on Learning Representations (2024).
[10] M. Gustineli, A. Miyaguchi, I. Stalter, Multi-Label Plant Species Classification with Self-Supervised</p>
      <p>Vision Transformers, Conference and Labs of the Evaluation Forum (2024).
[11] S. Foy, S. McLoughlin, Utilizing Dino V2 for Domain Adaptation in Vegetation Plot Analysis,</p>
      <p>Conference and Labs of the Evaluation Forum (2024).
[12] S. Chulif, H. Ishrat, Y. Chang, S. Lee, Patch-wise inference using pretrained vision transformers:</p>
      <p>Neuon submission to plantclef2024, Conference and Labs of the Evaluation Forum (2024).
[13] R. d’Andrimont, M. Yordanov, L. Martinez-Sanchez, P. Haub, O. Buck, C. Haub, B. Eiselt, M. van der
Velde, Lucas cover photos 2006–2018 over the eu: 874 646 spatially distributed geo-tagged close-up
photos with land cover and plant species label, Earth System Science Data (2022).
[14] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning, 2009.
[15] T. Kattenborn, J. Leitlof, F. Schiefer, S. Hinz, Review on convolutional neural networks (cnn) in
vegetation remote sensing, ISPRS Journal of Photogrammetry and Remote Sensing (2021).
[16] R. Reedha, E. Dericquebourg, R. Canals, A. Hafiane, Transformer neural network for weed and
crop classification of high resolution uav images, Remote sensing (2022).
[17] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, J. M. Ogden, Pyramid methods in image
processing, RCA Engineer (1984).
[18] C. N. Silla, A. A. Freitas, A survey of hierarchical classification across diferent application domains,</p>
      <p>Data mining and knowledge discovery (2011).
[19] J. G. Colonna, J. Gama, E. F. Nakamura, A comparison of hierarchical multi-output recognition
approaches for anuran classification, Machine Learning (2018).
[20] J. N. Hernandez, L. E. Sucar, E. F. Morales, A hybrid global-local approach for hierarchical
classification., Florida Artificial Intelligence Research Society (2013).
[21] L. Fiaschi, M. Cococcioni, Informed deep hierarchical classification: a non-standard analysis
inspired approach, IEEE Transactions on Neural Networks and Learning Systems (2024).
[22] C. Shorten, T. M. Khoshgoftaar, A survey on image data augmentation for deep learning, Journal
of Big Data (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          , G. Martellucci,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , PlantCLEF2025 @ LifeCLEF &amp;
          <string-name>
            <surname>CVPR-FGVC</surname>
          </string-name>
          , https://kaggle.com/competitions/plantclef-2025,
          <year>2025</year>
          . Kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janoušková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of lifeclef 2025:
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF)</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Janouskova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          , Overview of FungiCLEF 2025:
          <article-title>Few-shot classification with rare fungi species</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <source>International Conference on Learning Representations</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darcet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Moutakanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szafraniec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khalidov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Haziza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Nouby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Assran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ballas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Galuba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Howes</surname>
          </string-name>
          , P.-Y. Huang,
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rabbat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mairal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Labatut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , P. Bojanowski,
          <article-title>Dinov2: Learning robust visual features without supervision</article-title>
          ,
          <source>Transactions on Machine Learning Research</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <source>Overview of PlantCLEF</source>
          <year>2024</year>
          <article-title>: multi-species plant identification in vegetation plot images, Conference and Labs of the Evaluation Forum (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Eggel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          , Overview of lifeclef 2024:
          <article-title>Challenges on species distribution prediction and identification, Conference and Labs of the Evaluation Forum (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lombardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Afouard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Joly,
          <article-title>PlantCLEF 2024 pretrained models on the flora of the south western Europe based on a subset of Pl@ntNet collaborative images and a ViT base patch</article-title>
          14 dinoV2,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>