<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Overview of PlantCLEF 2023: Image-based Plant Identification at Global Scale</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hervé Goëau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre Bonnet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexis Joly</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CIRAD, UMR AMAP, Montpellier</institution>
          ,
          <addr-line>Occitanie</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inria, LIRMM, Univ Montpellier</institution>
          ,
          <addr-line>CNRS, Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>The world is estimated to be home to over 300,000 species of vascular plants. In the face of the ongoing biodiversity crisis, expanding our understanding of these species is crucial for the advancement of human civilization, encompassing areas such as agriculture, construction, and pharmacopoeia. However, the labor-intensive process of plant identification undertaken by human experts poses a significant obstacle to the accumulation of new data and knowledge. Fortunately, recent advancements in automatic identification, particularly through the application of deep learning techniques, have shown promising progress. Despite challenges posed by data-related issues such as a vast number of classes, imbalanced class distribution, erroneous identifications, duplications, variable visual quality, and diverse visual contents (such as photos or herbarium sheets), deep learning approaches have reached a level of maturity which gives us hope that in the near future we will have an identification system capable of accurately identifying all plant species worldwide. The PlantCLEF2023 challenge aims to contribute to this pursuit by addressing a multi-image (and metadata) classification problem involving an extensive set of classes (80,000 plant species). This paper provides an overview of the challenge's resources and evaluations, summarizes the methods and systems employed by participating research groups, and presents an analysis of key findings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LifeCLEF</kwd>
        <kwd>fine-grained classification</kwd>
        <kwd>species identification</kwd>
        <kwd>biodiversity informatics</kwd>
        <kwd>evaluation</kwd>
        <kwd>benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The world is home to an estimated 300,000 species of vascular plants, and the discovery and
description of new plant species continue to occur each year [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The remarkable diversity of
plants has played a pivotal role in the advancement of human civilization, providing resources
such as food, medicine, building materials, recreational opportunities, and genetic reservoirs
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, plant diversity plays a crucial role in maintaining the functioning and stability
of ecosystems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, our understanding of plant species remains limited. For the
majority of species, we lack knowledge about their specific roles within ecosystems and their
potential utility to humans. Additionally, information regarding the geographic distribution
and population abundance of most species remains scarce [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Over the past two decades, the biodiversity informatics community has made significant
eforts to develop global initiatives, digital platforms, and tools to facilitate the organization,
sharing, visualization, and analysis of biodiversity data [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Nonetheless, the process of
systematic plant identification poses a significant obstacle to the aggregation of new data
and knowledge at the species level. Botanists, taxonomists, and other plant experts spend
substantial time and energy on species identification, which could be better utilized in analyzing
the collected data.
      </p>
      <p>
        As previously discussed by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the routine identification of previously described species
shares similarities with other human activities that have successfully undergone automation.
In recent years, automated identification has made significant advancements, particularly due
to the development of deep learning techniques, thanks to the rise of Convolutional Neural
Networks (CNNs) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The long-term evaluation of automated plant identification, conducted as
part of the LifeCLEF initiative [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], demonstrates the impact of CNNs on performance within a
few years. In 2011, the best evaluated system achieved a mere 57% accuracy on a straightforward
classification task involving only 71 species captured under highly uniform conditions (scans
or photos of leaves on a white background). In contrast, by 2017, the best CNN achieved an
88.5% accuracy on a far more complex task encompassing 10,000 plant species, characterized
by imbalanced, heterogeneous, and noisy visual data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Moreover, in 2018, the best system
outperformed five out of nine specialists in re-identifying a subset of test images [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Existing plant identification applications, due to their growing popularity, present
opportunities for high-throughput biodiversity monitoring and the accumulation of specicfi knowledge
[
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]. However, they often face the challenge of being restricted to specific regional floras
or limited to the most common species. With an increasing number of species exhibiting a
transcontinental range, such as naturalized alien species [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or cultivated plants, relying on
regional floras for identification becomes less reliable. Conversely, focusing solely on the most
prevalent species disregards the broader implications for biodiversity.
      </p>
      <p>To address these challenges, during two years of competition, the PlantCLEF 2022 and
2023 challenges introduced a multi-image (and metadata) classification problem involving an
extensive number of classes, specifically 80,000 plant species. Convolutional Neural Networks
(CNNs) and the recent Vision Transformers (ViTs) techniques emerge as the most promising
solutions for tackling such large-scale image classification tasks. However, previous studies
had not reported image classification results of this magnitude, regardless of whether the
entities were biological or not. This paper presents the challenge’s resources and evaluations,
summarizes the approaches and systems employed by participating research groups.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>
        To thoroughly evaluate the aforementioned scenario on a large scale and in realistic conditions,
two distinct training datasets were developed and shared: the "trusted" dataset and the "web"
dataset. These datasets encompassed a total of 4 million images across 80,000 plant species,
sourced from various origins.
"Trusted" training set: this training dataset is based on a carefully curated selection
of more than 2.9M images covering 80k plant species aggregated, shared and collected mainly by
GBIF (Global Biodiversity Information Facility). This type of data is aggregated from academic
sources such as museums, universities, and national institutions, as well as collaborative
platforms like inaturalist, and Pl@ntNet, implying a fairly high certainty of determination
quality. We initially formed an extensive dataset using the GBIF portal, which includes nearly
16 million occurrences of vascular plants (Tracheophyta) comprising ferns, conifers, and
lfowering modern plants [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This initial selection, however, exhibited significant imbalance,
with some species having tens of thousands of images while others had only one. To ensure
class equilibrium and prevent dataset inflation, we limited the number of images per species to
approximately 100. The selected images focus on views that are optimal for plant identification,
such as close-ups of flowers, fruits, leaves, and trunks.
"Web" training set: in contrast, the "web" training dataset was compiled from a
collection of web images obtained from search engines like Google and Bing. This initial collection
contained millions of images, but it sufered from significant errors in species identification,
a high presence of duplicate images, and a large number of images that were less suitable
for visual plant identification, such as herbarium images, landscapes, microscopic views, and
unrelated subjects. To address these issues, a semi-automatic revision process was conducted to
minimize the number of irrelevant images and maximize the inclusion of close-ups of relevant
plant features. The "web" dataset ultimately consisted of approximately 1.1 million images,
covering around 57,000 plant species.
      </p>
      <p>Test set: For the evaluation of the models, a separate test set was constructed using
multi-image plant observations collected on the Pl@ntNet platform throughout 2021, ensuring
that they were not present in the training datasets. Only observations with a high confidence
score, determined through the collaborative review process on Pl@ntNet, were selected for
the challenge, ensuring a high level of determination quality. The review process involved
individuals with varying levels of expertise, ranging from beginners to world-leading experts,
with diferent weights given to their judgments. The test set consisted of approximately 27,000
plant observations, comprising around 55,000 images related to approximately 7,300 plant
species.</p>
      <p>Table1 presents various statistics about the three datasets. One notable observation is
the significant diference in the number of species between the training sets and the test
set. This diference primarily stems from the challenge of collecting a large amount of
expert-verified data from botanists on such a scale. However, this diference aligns with the
realistic scenario faced by automatic identification systems like Pl@ntNet and Inaturalist. These
systems need to be capable of recognizing a wide range of species without prior knowledge of
which species will be frequently requested or completely overlooked. This characteristic reflects
the goal of these systems to identify as many species as possible and adapt to unpredictable
user requests.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>The challenge was hosted during two years as two rounds in the AICrowd plateform1. The task
was evaluated as a plant species retrieval task based on multi-image plant observations from
the test set. The goal was to retrieve the correct plant species among the top results of a ranked
list of species returned by the evaluated system. During the first year of competition in 2022,
the participants had access to the training set in mid-February 2022, the test set was published 6
weeks later in early April, and the round of submissions was then open during 5 weeks. During
the second round in 2023, the training and test data remained exactly the same (the ground
truth on the test set being kept secret). The submission system remained open from mid-March
to mid-May.</p>
      <p>The metric used for the evaluation of the task is the Macro Average (by species) Mean
Reciprocal Rank (MA-MRR). The Mean Reciprocal Rank (MRR) is a statistic measure for evaluating any
process that produces a list of possible responses to a sample of queries ordered by probability
of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank
of the first correct answer. The MRR is the average of the reciprocal ranks for the whole test set:

  = 1 ∑︁</p>
      <p>1
 =1 rank
where  is the total number of plant observations (query occurrences) in the test set and rank
is the rank of the correct species of the plant observation .</p>
      <p>However, the Macro-Average version of the MRR (average MRR per species in the test set)
was used because of the long tail of the data distribution to rebalance the results between
underand over-represented species in the test set:
  −   =</p>
      <p>1 ∑︁ 1 ∑︁</p>
      <p>1
 =1  =1 rank
(1)
(2)
where  is the total number of species in the test set,  is the number of plant observations
related to a species .
1https://www.aicrowd.com/challenges/lifeclef-2022-23-plant</p>
    </sec>
    <sec id="sec-4">
      <title>4. Participants and methods</title>
      <p>
        During the two years of the challenge, a total of 195 people expressed an interest in signing
up for the challenge. Among this large raw audience, 8 research groups finally succeeded in
submitting run files (8 the first year and 3 the second year). Details of the used methods and
evaluated systems during the 2023 round are synthesized below and further developed in the
working notes of the participants (Mingle Xu [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], Neuon AI[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Table 2 reports the results
while describing in various columns the main characteristics that distinguish each method from
the others: type of architecture, training set used, pre-training method, taxonomic levels used.
Complementary, the following paragraphs give a few more details about the methods and the
overall strategy employed by each participant (the paragraphs are sorted in descending order of
the best score obtained by each team).
      </p>
      <p>
        Mingle Xu, South Korea, 9 runs, [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]: the team’s work is founded on the
utilization of a Vision Transformer (ViT) that has been pre-trained using a Self Supervised Learning
(SSL) technique, which is a recent and increasingly popular approach in the field of computer
vision. This approach is quite disruptive as it deviates from the traditional Supervised Transfer
Learning (STL) method. Typically, in a STL approach, a neural network is initially trained from
scratch using labeled data for a classification task on a generic dataset such as ImageNet 1k
or 22k, and the network is then subsequently fine-tuned on a specific dataset that possesses
a distinct set of labels. In contrast, Self Supervised Learning (SSL) methods operate without
the need for labeled data. The premise is that a network pre-trained with an SSL method can
extract superior features that exhibit improved generalization abilities. These extracted features
can subsequently be fine-tuned in a supervised manner, enabling their efective utilization for
diverse downstream tasks, including image classification and object detection. In contemporary
times, the primary focus of state-of-the-art methods lies not in devising new architectures, as
ViT has emerged as the default choice. Instead, the emphasis is on determining the optimal SSL
approach for pre-training a ViT model. As an example, in the previous year, the team led by
Mingle Xu achieved the best results using a pre-trained ViT-large model based on a Masked
Auto-Encoder (MAE) approach [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. They obtained a remarkable MA-MRR score of 0.64079
(post-challenge). The concept of MAE draws inspiration from the successful masked language
modeling technique commonly used in Natural Language Processing, notably popularized by
BERT [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The process of masking data was challenging to apply to CNN-based architectures,
whereas it becomes relatively straightforward with vision transformers since they operate
internally using visual patches or "tokens" along with positional embedding. MAE shares
similarities with BEIT [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], wherein the self-supervised task involves training a backbone
vision transformer to predict missing tokens from partially masked images.
      </p>
      <p>
        During this year’s participation, Mingle Xu explored declinations of runs based on the
visioncentric foundation model EVA [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], that was the state-of-the-art position during the challenge
in the first quarter of 2023. EVA is a pretraining strategy that combines CLIP[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and MVP[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
CLIP maximizes the relationships between paired text and images, while MVP integrates MAE
and CLIP to enhance pretraining. MVP freezes the CLIP image encoder and trains the vision
part using a loss function that minimizes the distance between frozen CLIP features and vision
model features. EVA scales up MVP by using larger models and more datasets, resulting in
improved performance across various tasks. Overall, EVA leverages multimodal information
and scalability for better semantic learning.
      </p>
      <p>Mingle Xu conducted an investigation into the finetuning of pre-trained EVA models using
various approaches. This included species ablations with limited image data (runs 1, 2, 4, 6),
augmenting the "trusted" training set with additional images from the "web" training set (runs
8, 9, 10), starting from a self-supervised learning (SSL)-only pre-trained model (run 3), or
employing intermediate supervised finetuning on ImageNet 22k (all other runs). The best run
(MingleXuRun 8) reached an impressive MA-MRR of 0.67395.</p>
      <p>
        Neuon AI, Malaysia, 10 runs, [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]: this participant used various ensembles of
models finetuned using most of the time all the training dataset available ("trusted" and "web")
training datasets, mainly based on the Inception-ResNet-v2 architectures [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] (and on a
Inception-v4 to a lesser extent). All the models are directly finetuned CNNs but as a multi-task
classification related to five taxonomy levels (species, genus, family, order and "class" in the
botanical sense), instead of the default species level. All the runs were finetuned . They then
explored various ways to improve the performances: more data augmentation, a balanced
batching method, a multi-organ and single-organ training scheme, and finally a feature
embedding comparison instead of the traditional softmax function. The same data augmentation
techniques were applied for all runs and included random cropping, horizontal flipping, color
distortion, bi-cubic resizing, random hue and random contrast. The balanced batching method
consisted to limit the selection of training images for a species in an epoch to a maximum
of 16 samples in order to avoid any bias towards any particular species and preventing poor
performance on underrepresented species (runs 2 and 4). A multi-organ training scheme
was used for run 4: the approach involved training multiple models on smaller sub-datasets
that exclusively consisted of images tagged with either the Flower, Bark, Fruit, Habit, or Leaf
tag. The feature embedding comparison was used in runs 3, 6 and 8. It relies on calculating
distances between test and training images, utilizing cosine similarity applied to the feature
vectors of a single test image and all the training images. The feature embedding are directly
the features extracted from the model before the fully-connected last layer. Distances scores are
transformed into probabilities using Inverse Distance Weighting, allowing for class ranking,
and the class with the highest probability is finally representing the most confident prediction.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>
        We report in Figure 1 the performance achieved by the collected runs. Table 2 provides the
results achieved by each run as well as a brief synthesis of the methods used in each of them.
ViT SSL are better than CNN STL: The most impressive outcomes were achieved
by vision transformer-based approaches, particularly the vision-centric foundation model EVA
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], that was the state-of-the-art position during the challenge in the first quarter of 2023.
While CNN-based approaches also produced respectable results, with a maximum MA-MRR of
0.61813 (NeuonAIRun9), they still fell notably short of the highest score attained by an EVA
approach. The best EVA approach, achieved a remarkable MA-MRR of 0.67395 (MingleXuRun8).
The noisy web training dataset helps: incorporating the comprehensive PlantCLEF
training dataset, which includes both the trusted and web datasets, yielded notable benefits
despite the extended training duration and the inherent residual noise present in the web dataset.
The inclusion of the web training dataset led to a significant improvement in performance, as
evidenced by the MA-MRR reaching 0.67395 (MingleXuRun8), surpassing the maximum of
0.65035 (MingleXuRun5) achieved without its incorporation. However, it’s possible that the
web dataset has been well curated and that the noise level isn’t as high as one might think.
Species ablation was not relevant: the reduction of the training set by removing
the classes with the fewest images (MingleXuRun1-4-2-6 vs 5) implies a significant drop in
performance. This observation highlights a crucial point: the presence of a direct correlation
between training and test data is not always guaranteed. It underlines the importance of
including all classes, including those associated with uncommon species, to meet the challenge
of monitoring plant biodiversity. By including a diverse range of classes, even those associated
with less common species, we can better grasp the true extent and variability of plant life. This
holistic approach ensures a more promising understanding of plant biodiversity for efective
monitoring and conservation eforts in the future. It is a reminder that comprehensive and
inclusive datasets are essential for accurate and reliable analysis in the field of plant biodiversity.
      </p>
      <p>Combining models dedicated to specific organs deteriorates results : Intuitively, one
might say that it’s interesting to specialize models on organ-based learning subsets, as botanists
eventually learn to analyze organ structure and appearance independently and separately.
However, a noteworthy observation is that one of the poorest results in the challenge emerged
when combining models trained on specific organ sub-datasets (NeuonAIRun4 with a MA-MRR
of 0.33926). A possible explanation for this outcome can be attributed to the loss of a significant
number of species per organ. Statistics reported by the authors indicate that by focusing solely
on fruits, for instance, the resulting dataset encompasses only approximately 27,000 species,
significantly lower than the total 80,000 species available. This reduction in species coverage
likely hampers the model’s ability to generalize and accurately classify a broader range of plant
organisms.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presented the overview and the results of the LifeCLEF 2023 plant identification
challenge following the 12 previous ones conducted within CLEF evaluation forum. This year
the task was performed for the second year on the biggest plant images dataset ever published
in the literature. This dataset was composed of two distinct sources: a trusted set built from the
GBIF and a noisy web dataset totaling both 4M images and covering 80k species.</p>
      <p>The main conclusion of our evaluation is that vision transformers performed definitely better
than convolutional neural networks, especially when this type of models are pre-trained with a
Self-Supervised Learning. Furthermore, an important lesson we have learned is the significance
of maximizing the number of images, including those obtained from the web, despite the
possibility of errors. It is crucial not to limit the size of the dataset based on organ types or
assume that a certain number of training images is too small to be included in the test set. By
incorporating a larger and more diverse set of images, we enhance the model’s ability to capture
a wider range of plant variations and improve its overall performance.</p>
      <p>However, training those models requires more computational resources that only participants
with access to large computational clusters can aford. For instance, the winning team Mingle
Xu indicate that, they need to use 16 RTX 3090 GPUs for almost three months for training all
the models. We are aware that this is not fair to other teams who do not have enough GPUs, and
that it considerably limits the participation of other teams. However, we hope that the challenge
and results presented in this article will highlight future research directions for solving key
species identification problems across all kingdoms and advancing AI in general for biodiversity.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>The research described in this paper was partly funded by the European Commission via the
GUARDEN and MAMBO projects, which have received funding from the European Union’s
Horizon Europe research and innovation program under grant agreements 101060693 and
101060639. The opinions expressed in this work are those of the authors and are not necessarily
those of the GUARDEN or MAMBO partners or the European Commission.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Christenhusz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Byng</surname>
          </string-name>
          ,
          <article-title>The number of known plants species in the world and its annual increase</article-title>
          ,
          <source>Phytotaxa</source>
          <volume>261</volume>
          (
          <year>2016</year>
          )
          <fpage>201</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Naeem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Bunker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hector</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Loreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Perrings</surname>
          </string-name>
          ,
          <article-title>Biodiversity, ecosystem functioning, and human wellbeing: an ecological and economic perspective</article-title>
          ,
          <source>OUP Oxford</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cronk</surname>
          </string-name>
          ,
          <article-title>Plant extinctions take time</article-title>
          ,
          <source>Science</source>
          <volume>353</volume>
          (
          <year>2016</year>
          )
          <fpage>446</fpage>
          -
          <lpage>447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Parr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Wilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Leary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Lans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Walley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Hammock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Goddard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Rice</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Studer</surname>
          </string-name>
          , et al.,
          <article-title>The encyclopedia of life v2: providing global access to knowledge about life on earth, Biodiversity data journal (</article-title>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q. D.</given-names>
            <surname>Wheeler</surname>
          </string-name>
          , What if gbif?,
          <source>BioScience</source>
          <volume>54</volume>
          (
          <year>2004</year>
          )
          <fpage>717</fpage>
          -
          <lpage>717</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Gaston</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. A. O'Neill</surname>
          </string-name>
          ,
          <source>Automated species identification: why not?</source>
          ,
          <source>Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences</source>
          <volume>359</volume>
          (
          <year>2004</year>
          )
          <fpage>655</fpage>
          -
          <lpage>667</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , Y. Bengio, G. Hinton,
          <article-title>Deep learning</article-title>
          , nature
          <volume>521</volume>
          (
          <year>2015</year>
          )
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Spampinato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Lombardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Palazzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Biodiversity information retrieval through large scale content-based identification: a long-term evaluation</article-title>
          ,
          <source>in: Information Retrieval Evaluation in a Changing World</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>389</fpage>
          -
          <lpage>413</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Plant identification based on noisy web data: the amazing performance of deep learning (lifeclef 2017), in: CLEF task overview 2017, CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2017</year>
          , Dublin, Ireland.,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of expertlifeclef 2018:
          <article-title>how far automated identification systems are from the best experts? lifeclef experts vs</article-title>
          .
          <source>machine plant identification task</source>
          <year>2018</year>
          , in: CLEF task overview
          <year>2018</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2018</year>
          , Avignon, France.,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nugent</surname>
          </string-name>
          , inaturalist,
          <source>Science Scope</source>
          <volume>41</volume>
          (
          <year>2018</year>
          )
          <fpage>12</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wäldchen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rzanny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seeland</surname>
          </string-name>
          , P. Mäder,
          <article-title>Automated plant species identification-trends and future directions</article-title>
          ,
          <source>PLoS computational biology 14</source>
          (
          <year>2018</year>
          )
          <article-title>e1005993</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Afouard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Lombardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Pl@ ntnet app in the era of deep learning</article-title>
          ,
          <source>in: ICLR: International Conference on Learning Representations</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>M. van Kleunen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Pyšek</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Dawson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Kreft</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pergl</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Weigelt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Dullinger</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>König</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lenzner</surname>
          </string-name>
          , et al.,
          <article-title>The global naturalized alien flora (glonaf) database</article-title>
          , Ecology.
          <year>2019</year>
          ;
          <volume>100</volume>
          :
          <issue>1</issue>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Occdownload</given-names>
            <surname>Gbif</surname>
          </string-name>
          .Org, Occurrence download,
          <year>2022</year>
          . URL: https://www.gbif.org/ occurrence/download/0105549-
          <fpage>210914110416597</fpage>
          . doi:
          <volume>10</volume>
          .15468/DL.EJ7KN5.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <article-title>Plantclef2023: A bigger training dataset contributes more than advanced pretraining methods for plant identification</article-title>
          ,
          <source>in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chulif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. L.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Deep learning for large-scale plant classification: Neuon submission to plantclef 2023</article-title>
          ,
          <source>in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Masked autoencoders are scalable vision learners</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>16000</fpage>
          -
          <lpage>16009</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , Beit:
          <article-title>Bert pre-training of image transformers</article-title>
          ,
          <source>arXiv preprint arXiv:2106.08254</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <article-title>Eva: Exploring the limits of masked visual representation learning at scale</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2211</volume>
          .
          <fpage>07636</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          , Mvp:
          <article-title>Multimodality-guided visual pre-training</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>337</fpage>
          -
          <lpage>353</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alemi</surname>
          </string-name>
          ,
          <article-title>Inception-v4, inception-resnet and the impact of residual connections on learning</article-title>
          ,
          <source>arXiv preprint arXiv:1602.07261</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>