<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Preprocessing Is All You Need: TheHeartOfNoise Submission to PlantCLEF 2025</article-title>
      </title-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The PlantCLEF 2025 competition promotes scientific research in biodiversity monitoring. Participants are invited to analyze high-resolution images, known as quadrats, to identify the plant species present. This task poses several challenges. For example, plants may occupy only a small proportion of the total image area, or they may have a morphology that makes them dificult to identify due to seasonal variations. Every detail counts, and the entire processing chain must be optimized, from loading the high-definition image to compiling the list of predicted species. Preprocessing, which transforms the source image into a tensor for the machine learning model, is the focus of this article's study and leads to the best classification performance in the PlantCLEF 2025 competition. As part of this participation, three approaches to analyzing the quadrat images were explored. The ifrst method, used as a reference, simply reduces the images to the expected model resolution, allowing the research to focus on optimizing preprocessing. The research results were then used to tile the images before analyzing them with a single-shot high-resolution inference. The Rust computer language was primarily used. The associated source code is available on a public repository.1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multilabel classification</kwd>
        <kwd>high-resolution image</kwd>
        <kwd>image preprocessing</kwd>
        <kwd>plant identification</kwd>
        <kwd>vision transformer</kwd>
        <kwd>computer vision</kwd>
        <kwd>tiling method</kwd>
        <kwd>high-resolution inference</kwd>
        <kwd>Rust</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Environmental issues and biodiversity conservation are becoming increasingly important, particularly
for monitoring the decline of native species and the emergence of invasive ones [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In botany, the
quadrat method [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is used to assess biodiversity indices. This method involves counting plant species in
standardized areas, typically 50 cm by 50 cm, at regular intervals to estimate changes in plant diversity.
The PlantCLEF 2025 competition [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], part of the LifeCLEF lab [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] under the Conference and Labs of the
Evaluation Forum (CLEF) follows on from PlantCLEF 2024 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and aims to facilitate this monitoring by
simplifying the work of botanists through artificial intelligence.
      </p>
      <p>
        The Pl@ntNet team, which has been active for fifteen years in research related to the visual
identification of plant species, has developed a free application [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] using computer vision models to quickly
identify plant species. The team is also working to improve the eficiency of quadrat analysis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which
is a complex task due to the massive amount of data that must be processed. Often, several hundred
VisionTransformer inferences are required for a single high definition image. This makes it essential to
optimize computing power and the data processing pipeline.
      </p>
      <p>
        New automation capabilities for image analysis have been made possible by artificial intelligence,
which is gradually being integrated into various aspects of everyday life, such as medicine and
autonomous vehicles. In 2012, the ImageNet competition, which involved classifying images into 1,000
categories, was won by the AlexNet model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a convolutional neural network with 60 million
parameters, trained using the backpropagation algorithm proposed by Yann LeCun in 1989 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This
breakthrough was made possible by increased computing power and the availability of large amounts
of data.
      </p>
      <p>
        In 2017, convolutional neural network technology was surpassed by Transformers [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], initially
designed for natural language processing and adapted in 2020 for image analysis as VisionTransformers
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. These models, often containing hundreds of millions or even billions of parameters and require
large amounts of annotated data to avoid overfitting. This has popularized self-supervised learning
methods ([
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ]). These methods enable the construction of representation latent spaces with
strong generalization capabilities and pre-trained models that can be easily adapted to various use
cases. Examples include VisionTransformer DINOv2 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and its improved version with registers,
DINOv2Reg4 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], which were used in this competition.
      </p>
      <p>
        Certain use cases of artificial intelligence require specific reliability and robustness criteria, regarding
the quality of predictions and proper software functioning. Research is underway to develop explainable
AI to increase confidence in the technology [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In the short term, risks of cyber and IT failures must
also be considered, especially for critical applications such as autonomous cars and space technologies
(e.g., satellites). In this context, the Rust programming language [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] stands out by ofering performance
comparable to hardware-close languages, such as C/C++ [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], while ensuring greater safety, particularly
in memory management [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ].
      </p>
      <p>
        Participation in PlantCLEF 2025 aims to advance scientific research in image analysis and demonstrate
the potential of the Rust language, through the Candle library [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], for computer vision research.
Although the Python/PyTorch stack [
        <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
        ] is widely adopted, Rust ofers unique benefits that enhance
both reliability and eficiency in computer vision tasks. These advantages include improved performance,
enhanced safety, concurrency support, and seamless integration with other languages such as C and
CUDA.
      </p>
      <p>
        Implementing image analysis methods in Rust required reproducing the oficial code of the reference
method (Python script "Oficial Starter Notebook | Inference on Test Data") and processing the image
directly at the expected 518x518 resolution. This process highlighted the essential role of preprocessing,
which was then investigated further. Next, two approaches to high-resolution image processing were
explored, each leading to diferent technical challenges: the tiling method, used in the previous edition
of the competition [
        <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
        ], consists of extracting square samples from the original image at diferent
scales, potentially with overlap. Then, these tens or hundreds of tiles are inferred individually before
the predictions are aggregated.
      </p>
      <p>
        A second approach, VaMIS (Variable Model Input Size) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], is based on high-resolution inference.
This approach allows the model to adapt to analyzing large images with a single inference, reducing
the computational footprint compared to tiling. All tokens in the image can interact with each other
via global attention. To further reduce computational complexity, the image can be segmented into
windows, allowing tokens to interact solely within their respective windows [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Finally, a hybrid
attention method alternates between the previous two techniques to combine their advantages [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], as
does the SegmentAnything image encoder [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
      </p>
      <p>The goal of this competition is to improve the performance of plant classification, i.e., to identify
the species visible within each quadrat based primarily on visual data rather than image metadata.
Therefore, these metadata have been utilized to a limited extent.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Overview</title>
      <sec id="sec-2-1">
        <title>2.1. The PlantCLEF2025 competition</title>
        <p>
          The annual PlantCLEF competition aims to promote scientific research in plant classification. Since
last year [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the objective has been to encourage quadrat research, or the multi-label classification
of high-definition images. To this end, various single-label and multi-label datasets are provided for
training and evaluation purposes. Two deep learning models trained for plant classification are made
available to participants to avoid the need for significant computing resources.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Research avenues explored</title>
        <p>A significant part of the research was devoted to data preprocessing. Various avenues of research have
made it possible to optimize technical choices, such as the interpolation used during resizing or the
choice of a subsampling scheme for certain color channels and the JPEG compression quality, performed
after resizing, which can surprisingly be seen as a regularization factor.</p>
        <p>During the five-week participation in the PlantCLEF 2025 competition, two approaches were explored
for analyzing high-definition images with a VisionTransformer implemented in Rust: the classic
tiling method and high-resolution inference methods known as VaMIS, for VAriable Model Input Size.
VaMIS uses either global or window attention, or both, within the same model, referred to as hybrid
attention. Since VaMIS-adapted models are highly memory-intensive in terms of GPU, it was necessary
to reimplement an attention module calculation in CUDA, which is more VRAM-eficient.</p>
        <p>
          A brief collaboration with the company Quandela should also be mentioned. The goal was to conduct
preliminary tests to evaluate and enhance the performance of multilabel classification using quantum
photonics and demonstrate the potential of this technology. The Python libraries Perceval, for designing
quantum circuits, and MerLin [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], for easily integrating these circuits into a PyTorch model, enable such
experiments. This research took place over the course of three days in the final week of the competition.
As the preliminary results were inconclusive, no submission was made during the competition. However,
these methods could be further developed with more reasonable research deadlines. More details can
be found in the appendix.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. The 2025 PlantCLEF challenge</title>
        <p>The PlantCLEF 2025 competition promotes innovation in ecology, focusing on biodiversity monitoring
and plant species evolutionary dynamics. Based on the standardized quadrat protocol sampling method,
the challenge motivates participants to develop automated approaches for analyzing high-resolution
images of plant quadrats. The main objective is to identify various plant species among over 7,800.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Single-label dataset</title>
          <p>The PlantCLEF 2025 monolabel training dataset remains the same as in the previous edition. It consists
of observations of individual plants in southwestern Europe and covers 7,806 species. The dataset
includes approximately 1.4 million images, which have been supplemented with additional images from
the GBIF platform to include less represented species. The images are pre-organized into subfolders by
species and divided into training, validation, and test sets to facilitate model training.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Multi-label datasets</title>
          <p>The PlantCLEF 2025 test dataset is a compilation of quadrat image datasets in various floristic contexts.
It contains a total of 2,105 high-resolution images. The shooting protocols vary, including diferent
angles and weather conditions. These images, produced by experts, allow to evaluate the ability of
models to accurately identify plant species under various conditions, thereby testing their robustness.</p>
          <p>
            A complementary dataset is also available [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ]. It contains over 200,000 images from the LUCAS
Cover Photos 2006-2018 collection, including a large number of unannotated pseudo-quadrat images.
These additional data are intended to adapt the models to process images of multi-species vegetation
quadrats. These data were not used in this study.
          </p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Technical characteristics</title>
          <p>The images are provided in JPEG format, which can introduce compression artifacts that modify the
represented data by reducing its entropy, according to the principle of lossy compression. Two main
parameters control compression: JPEG compression quality and the YCbCr subsampling scheme. The
latter is the underlying color space used to decompose the luminance of the chroma channels. These
compression parameters are discussed in a dedicated section below.</p>
          <p>
            Tables 1 and 2 summarize the JPEG technical characteristics of the training dataset, which contains
images with a maximum size of 800px. Similarly, tables 3 and 4 report the JPEG compression parameters
of the quadrat test set.
3.1.4. Models
To facilitate access to the competition, two VisionTransformer [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] models were made available to
participants. These models were pre-trained using the DINOv2 self-supervised learning (SSL) method
[
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. The final DINOv2Reg4 architecture uses registers [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] as temporary memory to aggregate
information at the image level. The chosen architecture size is "ViT-Base" which consists of 12 successive
VisionTransformer blocks with 12 attention heads and a latent space size of 768. The models take an
RGB image as input. The image is partitioned into 14x14 pixel patches, leading to 37x37=1369 local
tokens. Including the CLS token and the four registers, there are 1369 + 1 + 4 = 1374 tokens (i.e. vectors)
of size 768. These tokens represent the informational state space in which the VisionTransformer
operates.
          </p>
          <p>The first model uses public DINOv2 weights pre-trained by SSL for the image encoder. Only the
classification head was trained in a supervised manner. The second model was trained entirely in a
supervised manner using the first model as initial checkpoint. Both models were trained on a server
with A100 GPUs using the Timm library with Torch. The first model was trained for approximately
17 hours over 92 epochs, with a batch size of 1,280 images per GPU, and a learning rate of 0.01. The
second model was trained for about 36 hours over 92 epochs, with a batch size of 144 images per GPU
and a learning rate of 0.00008.</p>
          <p>For the PlantCLEF 2025 competition, only the second model was used. Since all of its parameters
were fine-tuned on the plant training dataset, it can be expected to perform better than the first model.
3.1.5. F1 metric
The metric chosen to rank participants’ submissions is the macro-averaged F1 score per sample, which
strikes a good balance between statistical recall and precision. The 2,105 images are grouped into
transects, which represent samples from specific areas within selected sites. To mitigate biases related
to oversampled areas, the score is first calculated for each transect in the test set and then averaged
across transects to obtain the final score.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.6. Challenge posed by the competition</title>
          <p>Several machine learning challenges must be solved to enable efective classification. The training
data consists of images of individual plants or parts of plants with a single label, while the test data
consists of images of vegetation quadrats with multiple labels. Therefore, the monolabel model must be
adapted to perform multi-label classification. Additionally, the test images have fairly high resolutions,
frequently ranging from eight to ten million pixels, compared to the model’s input size of approximately
250,000 pixels. The next section details diferent approaches to address this discrepancy.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Research avenues explored</title>
        <p>Quadrat image analysis involves processing a substantial amount of information, including images
containing nearly 10 million pixels and a tiling process that can multiply the number of inferences
by 100 compared to the initial dataset. Additionally, the analysis predicts between one and 15 species
out of 7,806 possibilities, generating a large number of potential combinations. In accordance with the
oficial public Python script, the decision was made to limit the maximum number of predicted species
to 15 in order to avoid substantially reducing statistical accuracy.</p>
        <p>A clear pipeline for processing this data must be established. This pipeline can be described in four
successive steps:
1. Preprocessing: conversion of a 3000x3000x3 quadrat image stored on hard disk in JPEG format to
518x518x3 tensor(s) in RAM (or VRAM).</p>
        <p>First, the high-definition image file is loaded into RAM. Then, the JPEG format is decoded. Next,
the image is resized and cropped. JPEG decoding can take into account the final position of the
pixels. In other words, it is not necessary to decode all 10 million pixels of the initial quadrat
image because this process is relatively intensive. One or more tiles (i.e., square areas within the
original image) are extracted and resized to the model’s input resolution of 518x518 pixels or a
higher resolution for the high-resolution inference approach.
2. The image encoder of the DINOv2 plant model: transformation of the 518x518x3 image tensor(s)
into a vector(s) of size 768.</p>
        <p>In a standard deep learning approach, the reduced image is provided as input to the model in
the form of a tensor. The VisionTransformer image encoder calculates "deep features," which are
lfoating-point vectors that summarize all the plant information contained in the original image.
Then, it performs linear classification. It was decided that these deep features (more precisely,
only the CLS token, which is used for classification) would be stored on a hard disk.
The classification head is usually calculated immediately after the image encoder. However, it is
preferable to divide the model inference into two steps. This is because calculating deep features
is more resource-intensive than calculating the classification head. The latter can be recalculated
quickly after loading the deep features from the hard disk without significant delay. Additionally,
storing predictions by tile and species would require much more space than storing the respective
deep features. Finally, testing other classification heads or fine-tuning only the head may be
desirable, if necessary.
3. Linear classification and prediction aggregation: transformation of the 768-size vector(s) into a
7806-size score vector.</p>
        <p>
          Each deep feature vector (CLS token) is input into a linear classifier to obtain a vector of logits, and
then the SoftMax function is applied to get probabilities by species. Predictions are aggregated
from the tile level to the high-definition image level using the chosen method: maximum pooling
by species, average pooling, a given quantile [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], or any method summarizing information at
the quadrat image level.
        </p>
        <p>To limit confusion and improve statistical precision, it is also possible to retain only one or two
species per tile, as is done in the oficial code provided for the tiling method.
4. Submission calculation: transformation of the 7806-size score vector into a list of species.</p>
        <p>Only the 15 most probable species at the high-definition image scale are retained (e.g., Top15),
and only those with a score above the detection threshold are selected. The final list comprises
the species chosen by the model as predictions for the given quadrat image. Other methods of
calculating submissions are possible.</p>
        <p>As part of the PlantCLEF 2025 competition, the research primarily focused on the initial stages of the
pipeline: image preprocessing and selecting an image encoder process to analyze high-definition images.
Two approaches were implemented for the latter: the classic tiling technique and the VaMIS approach,
which analyzes each image with a single high-resolution inference. Due to insuficient computing
power, these approaches were only tested in inference mode, i.e., without fine-tuning the deep learning
model. Finally, prediction aggregation and submission calculation were briefly explored without specific
analysis.</p>
        <p>To implement these models and calculate inferences and submissions, Python and Rust code were
produced.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Preprocessing</title>
        <p>The development of the Rust pipeline started with implementing the reference approach, which solves
the problem of analyzing high-definition images by simply reducing them to the resolution expected by
the VisionTransformer model (518 x 518 pixels). The initial goal was to reproduce the results of the
oficial script provided for the "Oficial Starter Notebook | Inference on Test Data" competition. However,
multiple tests and the launch of the code revealed a bottleneck caused by image preprocessing. Loading
and decoding JPEG images with nearly 10 million pixels and reducing them to 518x518 pixels requires
as much computational power as, if not more than, the VisionTransformer inference that follows image
loading.</p>
        <p>Therefore, the initial step involved performing external preprocessing of the images in the PlantCLEF
2025 test dataset to reduce their size upstream. This allowed to use the reduced images as input and
test the Rust solution. The list of plant species predicted by quadrat varied significantly depending on
the chosen preprocessing. In particular, preprocessing involving JPEG compression after resizing the
high-definition image showed significant variations in performance.</p>
        <p>Therefore, research on preprocessing was conducted along several lines: the interpolation used
for resizing, and in the case of post-resizing JPEG compression, the YCbCr subsampling scheme and
compression quality factor used by JPEG.</p>
        <p>Tests were conducted using the reference approach, which reduces images to the expected dimensions
of 518x518 pixels at the model input. This technical choice allows for multiple tests at reasonable
computational cost. It also allows for better evaluation and comparison of the diferent preprocessing
methods. Listing the plant species present in a quadrat image with a reduced resolution of 518x518 pixels
is dificult; every pixel counts for identifying plants. This difers from a tiling process, in which tiles are
extracted at the original resolution of the image by cropping it. In the latter case, the importance of
technical preprocessing choices is reduced. The detection threshold was set to a one percent probability
for all the experiments, which is the same value as the oficial 518x518 reference method.</p>
        <p>Finally, the impact of preprocessing, including post-resizing JPEG compression, was measured on the
two approaches implemented for PlantCLEF 2025: tiling and the high-resolution VaMIS approach.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Interpolations</title>
          <p>Interpolation necessarily plays a role in the final performance of the classifier. Reducing the image
size requires estimating the color of pixels in a grid that may not overlap the original geometry. This
process makes assumptions when choosing a surface model. This introduces an additional source of
noise that may afect the downstream deep learning model, thereby penalizing classification quality.
The following interpolations were tested: nearest neighbor, bilinear, bicubic, and Lanczos. Many other
interpolation methods exist and could have been evaluated as well. However, the decision was made to
limit the evaluation to the most commonly used methods.</p>
          <p>
            Nearest neighbor interpolation is unique because it simply involves selecting the pixel in the source
grid that is closest to the target pixel. In contrast, bilinear, bicubic, and Lanczos interpolations are
convolution products, i.e., weighted averages of the pixels in the neighborhood of the target pixel
projected onto the source grid. Named after Hungarian mathematician and physicist Cornelius Lanczos,
the latter interpolation is calculated using the cardinal sine function and is known to reduce visual
artifacts such as blurring and aliasing. Figure 1 illustrates the filters associated with these convolution
products. [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ] provides additional information.
          </p>
          <p>The classification performance of preprocessing was evaluated with each of the four interpolations
on grayscale test data images to avoid dependence on a particular color space and to focus on the
geometric calculations involved in reducing high-definition quadrat images to 518x518 pixels.
4:2:2
4:2:0
4:1:1</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. JPEG compression: YCbCr scheme and quality factor</title>
          <p>YCbCr subsampling scheme Various preprocessing methods were considered, including
recompressing the resized images to JPEG format. The initial aim was to reduce disk space and subsequently
optimize classification performance. This required selecting the YCbCr subsampling scheme and
compression quality used by JPEG.</p>
          <p>The YCbCr color space, obtained via a linear transformation from RGB, distinguishes the Y luminance
channel from the Cb and Cr chroma channels. To reduce data size while accounting for human visual
perception, JPEG allows the chroma channels to be subsampled relative to the luminance channel
according to standardized schemes:
• 4:4:4 : No subsampling. The chroma and luminance components have the same resolution.
• 4:2:2 : chroma is subsampled horizontally by a factor of 2.
• 4:2:0 : chroma is subsampled horizontally and vertically by a factor of 2.</p>
          <p>• 4:1:1 : chroma is subsampled horizontally by a factor of 4.</p>
          <p>These schemes are summarized in Figure 2.</p>
          <p>A fifth, less common scheme appears in the test data, in which chroma is subsampled vertically by a
factor of two. It is denoted 4:2:2v in this article.</p>
          <p>
            JPEG compression quality In JPEG compression, the quality factor is a number between 0 and 100
that determines how high frequencies are filtered in the discrete cosine transform (DCT, [
            <xref ref-type="bibr" rid="ref34">34</xref>
            ]) and how
the remaining coeficients are quantized. For example, a higher quality factor preserves more high
frequencies, provides more accurate quantization, and results in less compression of the image.
          </p>
          <p>The prediction performance of the single inference reference method was assessed by reducing
high-definition quadrat images to a resolution of 518x518 pixels. The tested preprocessing methods
involve post-resizing JPEG compression, including a comparison of five subsampling schemes and JPEG
compression qualities ranging from 75 to 100 percent.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. High-resolution processing approaches</title>
        <sec id="sec-3-4-1">
          <title>3.4.1. Tiling method</title>
          <p>
            Principle This paper explored the tiling method used in previous editions [
            <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
            ] at diferent scales.
First, the source image is resized to a multiple of 518 by cropping the long edge. Then, it is partitioned
without overlap into square samples of size 518 x 518, which corresponds to the input expected by
the model. The dimension multiplier corresponds to the scale. For example, at a scale of three, nine
adjacent tiles are produced without overlap, as illustrated in Figure 3.
          </p>
          <p>Each tile is fed into the VisionTransformer image encoder, which returns a vector of deep features
stored on the hard drive. During the competition, diferent scales were calculated and aggregated
(from scale one to scale ten). The preprocessing implements JPEG compression after resizing the
high-definition image.</p>
          <p>
            Prediction calculation and aggregation The model’s linear classification head is applied to the
deep features of each tile to obtain the logits. The probabilities for each species are obtained by applying
the SoftMax function. Probabilities at the image scale are obtained by applying maximum or average
pooling per species. Initially used for convolutional neural networks [
            <xref ref-type="bibr" rid="ref35">35</xref>
            ], the logic of maximum pooling
is as follows: If a species is detected in a tile, its probability in that tile is high, and the max() operator
"brings up" this high score to the quadrat image scale. This corresponds well to the detection of presence
in the image, regardless of the tile position or its occupied surface within the image.
          </p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. High-resolution inference (VaMIS)</title>
          <p>VaMIS (Variable Model Input Size) methods analyze high-resolution quadrat images using a single
inference and adapt the machine learning model. This reduces computational costs compared to the
tiling method, which performs multiple inferences at diferent scales.</p>
          <p>
            In the VisionTransformer [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] inference process, the image is split into 14x14 pixel patches. Each
patch is linearly projected, and its position in the image is encoded by adding a vector of learned
parameters. According to Transformer [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] terminology, this becomes a "token." These tokens interact
in pairs via the attention module, much like words in a sentence. The second VisionTransformer module
is the feed-forward module, which processes the tokens individually. There is a distinction between
local tokens, which come from image patches according to the aforementioned method, and global
tokens (CLS and registers), which have an initial value learned during model training.
Global attention In the case of global attention, all tokens are considered in the calculation
simultaneously. This standard calculation is the most natural method of image size extension. The model
analyzes the entire square image and can refine its predictions by taking co-occurrences between plant
species into account. The standard VaMIS method is available in Candle [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ].
          </p>
          <p>
            Window attention This method, known as Window Shifted Attention (WSA) [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ], partitions tokens
based on their initial spatial positions within the image. The attention module only interacts with
tokens belonging to the same window (see Figure 4). In the 3x3 case, for example, this corresponds
to replacing standard VaMIS attention with nine attentions on adjacent windows of the image. After
the window attention calculation is complete, the global tokens (CLS and registers, respectively) are
averaged over the nine windows to obtain the global tokens for the entire image. This technical choice
is natural but would certainly benefit from specific fine-tuning of the adapted model.
          </p>
          <p>Windows smaller than 518x518 can be chosen, provided the resized image size is a multiple of the
window size. For example, a window VaMIS with a size of 1680x1680 and 5x5 windows of size 336x336
or 8x8 windows of size 210x210 pixels is possible.</p>
          <p>Hybrid attention One limitation of window attention is that local tokens from diferent windows
never interact with each other. Hybrid attention, which alternates between global and window attention,
allows information to difuse better across the whole image by enabling all tokens to interact within
given Transformer blocks.</p>
          <p>
            The Segment Anything image encoder [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ] uses this approach, spacing global attention evenly.
For instance, in a VisionTransformer with a "ViT-Base" dimension [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], such as the one used in the
competition with 12 blocks, global attention is applied to blocks 3, 6, 9, and 12. The other blocks use
window attention, as detailed in Table 5.
          </p>
          <p>The three VaMIS methods described above were tested at a single resolution of 1554x1554 (equivalent
to scale three of the tiling). Due to a lack of computing resources, the deep learning models were not
ifne-tuned for these methods and were only used for inference. The preprocessing implements JPEG
compression after resizing the high-definition image.</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>Implementation of attention calculation in CUDA The attention module calculates the similarity</title>
          <p>between all pairs of tokens in the image, resulting in an attention matrix. The attention matrix has 2
elements, where  is the number of tokens. VisionTransformers divide an input image into patches of a
ifxed size (e.g., 14 x 14 pixels for DINOv2) and then convert the patches into tokens. The number of
tokens increases linearly with the image’s surface area and therefore quadratically with its dimensions
(width and length) while maintaining its aspect ratio. Thus, doubling the dimensions of an image
multiplies the size of the attention matrix by 16.</p>
          <p>
            VaMIS methods rely on increasing the model input size, but the limits of the graphics card are quickly
reached using the explicit method to calculate the attention module because the attention matrix must
be stored in GPU memory. FlashAttention [
            <xref ref-type="bibr" rid="ref36">36</xref>
            ] optimizes this calculation and limits GPU memory
usage. However, this feature was unavailable on the graphics card used during the competition because
it was too old. Therefore, a new implementation of the attention module in the CUDA language was
necessary.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Submission calculation: thresholding and species selection</title>
        <p>The above methods provide a vector of presence scores for each species at the image level. The list of
species predicted by the model is obtained by thresholding; that is, by listing the species whose score
(e.g. probability) exceeds a chosen detection threshold.</p>
        <p>For the tiling method, the threshold was slightly optimized by testing diferent levels and observing
the public F1 score of the respective submissions. However, this method is costly in terms of the number
of submissions required.</p>
        <p>The detection thresholds for the VaMIS methods were chosen roughly, considering the average
number of species detected. One run was performed for each VaMIS method to allow for initial
validation of the Rust implementation.</p>
        <p>Semi-automatic methods were tested occasionally to adjust the detection thresholds per transect
because some transects are more dificult to analyze and require diferent thresholds. These thresholds
were determined using a semi-automatic procedure with dichotomy to maintain an average number of
species similar to a reference classification submission. A multiplicative parameter adjusts this number
globally to optimize performance. These methods were primarily applied to tiling techniques.</p>
        <p>Finally, manual optimizations were briefly explored through a detailed analysis of the species
prediction probabilities generated by the model and subsequent submissions. For example, if a species
frequently appeared within a transect, it was occasionally generalized across all images of that transect.
Conversely, species appearing only once (or only a few times) within a transect were deemed outliers
and subsequently excluded from the predictions. Such considerations were applied across various scales:
the entire dataset, individual transects, and images from the same location within a transect. Given that
these methods incorporate metadata rather than relying solely on visual data, performance metrics
with and without these optimizations are provided to illustrate the quality and generalizability of the
approaches outlined in this technical note.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing</title>
        <sec id="sec-4-1-1">
          <title>4.1.1. Interpolations</title>
          <p>Most of the test data (89 percent) is used to calculate the private F1 score, making it more robust than
the public F1 score from a statistical perspective. To perform a relevant analysis of the results, this
section mainly lists the private F1 statistics according to diferent methods and parameter sets.
Figure 5 illustrates the classification performance of diferent interpolations. Nearest neighbor
interpolation is the least efective, while Lanczos outperforms the others. It is notable that F1 score of 0.07906
is achieved with grayscale images (Lanczos interpolation), compared to a score of 0.12077 with the
reference approach of the oficial script, which uses the three RGB channels.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. JPEG compression: YCbCr sub-sampling scheme and quality factor</title>
          <p>Figure 6 illustrates the classification performance for diferent YCbCr subsampling schemes. Two
"modes" or two maxima, are observed for diferent sets of parameters, summarized in Table 6. We
ifnd that a JPEG quality of 100 percent does not improve multi-label classification performance for the
considered model. Performance improves as soon as this quality is slightly reduced. In particular, all
schemes perform remarkably well (private F1 score greater than 0.22) with a JPEG quality of 75 percent.
75
80
85
Quality
90
95
100</p>
          <p>The 4:1:1 subsampling scheme, which reduces chroma by a factor of four relative to luminance,
performs well for all JPEG qualities. In contrast, the 4:4:4 scheme, which does not subsample, requires
significantly reduced JPEG compression quality (75 percent) to perform well.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. High-resolution processing approaches</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Tiling method</title>
          <p>The tiling schemes are summarized in Table 7. The tiling scheme that provides the best performance
is the one with 91 tiles and a private F1 of 0.35013. Schemes with more tiles do not seem to improve
performance. Note that the above results do not use metadata and only evaluate visual detection
performance.</p>
          <p>Based on the results of the 91-tiles scheme, the detection thresholds were subsequently optimized by
studying the average number of species per transect. Manual optimization involves quickly analyzing
the predictions and species lists visually at diferent levels of image grouping, such as the entire dataset,
a transect, images from the same location within a transect, or an individual quadrat image. For
instance, frequent species within a group of test images were generalized and outliers were removed.
The performance results are summarized in Table 8.</p>
          <p>Optimizing the threshold per transect improves the private F1 score by approximately +0.007. Manual
optimization improves the private F1 score by an additional +0.0075 on top of this initial gain.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. High-resolution inference (VaMIS)</title>
          <p>
            High-resolution inference methods without fine-tuning achieve the performance shown in Table 9.
Among these methods, the hybrid attention method based on the SegmentAnything image encoder
attention scheme [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ] seems to be the most efective.
          </p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Preprocessing: performance delta on tiling and VaMIS approaches</title>
        <p>All of the performance results shown below were obtained using a purely visual model. The detection
threshold was the same for all of the images in the test dataset, and no metadata was used.</p>
        <p>Two types of preprocessing were compared: Standard (without JPEG compression) and with JPEG
compression after resizing. The Standard approach uses the Rust Image crate, which can be compared to
Python/Pillow, while the JPEG compression approach uses the MagickRust crate, which uses
Wand/ImageMagick and JPEG compression. Due to preprocessing, the performance delta indicated in Table 10
can be observed for the diferent tiling schemes.</p>
        <p>For the VaMIS models, the performance deltas are listed in Table 11.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>
        Based on the results, Lanczos interpolation performed the best out of those tested. This is a common
ifnding in signal theory [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. It was used to conduct JPEG compression tests after resizing images.
      </p>
      <p>Interestingly, a JPEG quality of 100 percent with no subsampling (4:4:4 scheme) does not yield optimal
performance in multi-label classification. The tests revealed two distinct "modes" in which the model
performed well: One is the 4:1:1 sampling scheme with 94 percent JPEG compression; the other is the
4:2:2 scheme, which requires 85 percent compression.</p>
      <p>These two modes are related to the characteristics of the data used to train the model, which include
a quasi-unique 4:2:2 sampling scheme and two main levels of compression for the training images. 76
percent (for more than 3 images over 4) and 95 percent. These uniform JPEG parameters determine
the entropy, or variance, and more generally the distribution of the RGB color channels of the input
pixels during training. It is possible that the model has learned to process only images with these
characteristics.</p>
      <p>When a test image is resized from a high resolution, such as 3000x3000 pixels, to the model’s input size
of 518x518 pixels—efectively dividing each side by a factor of approximately six—the resulting image
exhibits a significantly higher equivalent JPEG quality compared to its original version. Furthermore,
during tile extraction at a scale of six, which involves minimal resizing, two-thirds of the test images
retain a compression quality of 98 percent (see Table 3). Overall, all inference methods tend to supply
the model with images of higher quality than the one used during its training phase (see Table 1).</p>
      <p>Figure 7 illustrates the impact of various compression methods on the YCbCr channels. The model
was primarily trained using images with the 4:2:2 scheme (bottom row, middle image). The RGB
images are almost indistinguishable to the human eye, following the principle of YCbCr subsampling.
However, the model may be more sensitive to these variations than the human eye. Thus, changing the
compression scheme between the training and test data constitutes an additional domain shift.</p>
      <p>
        The VisionTransformer appears to treat additional data in the input image, beyond the entropy
expected by the model, as noise, which reduces its ability to focus on classification details. Figure 8
compares the 12 attention maps of the final block of the model when it is provided with an uncompressed
image and an image compressed with the optimal parameter set, "Scheme 4:1:1 — Quality 94." The first
map shows slightly noisier attention, while the second map shows more focused attention on certain
points. From the model’s point of view, JPEG compression can be seen as a regularizing factor for the
image, as the model expects precise entropy on the input pixels. It is noteworthy that the benefits of
JPEG compression in a diferent deep learning context have been previously discussed [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ].
      </p>
      <p>
        From the point of view of the final classification task, this lost data during compression after resizing
is certainly informative, particularly in reducing confusion between species thanks to the details of
neighboring pixels contained in the high frequencies of the JPEG discrete cosine transform (DCT, [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]).
Training the model on images with various compression schemes and also without any compression
could make the model less constrained on the expected input entropy and improve the performance of
quadrat images in multi-label classification and confirm the present analysis.
      </p>
      <p>Figure 9 allows for a visual comparison of diferent submissions resulting from diferent preprocessing
methods. Submissions with JPEG quality ranging from 91 to 95, as well as those with bilinear or bicubic
interpolation, are similar. The ImageMagick submission is somewhat isolated, indicating specific
preprocessing. The oficial submission is similar to preprocessing with the JPEG 4:2:2, 4:2:2v, and 4:4:4
schemes.</p>
      <p>The tiling approach achieves the best results for high-resolution analysis. It gives a private F1 of
0.35013 with a scheme involving six scales. The high-resolution inference approach (VaMIS) achieves a
private F1 of 0.22711 with the hybrid attention method, which alternates between global and window
attention. These results improved with preprocessing and post-resizing JPEG compression, explaining
a performance delta of +0.03354 and +0.05511, respectively.</p>
      <p>Further research concerning the color space used during image resizing could be pursued to more
exhaustively study preprocessing. Some color spaces use nonlinear transformations, such as LAB and
LUV, and interpolation can produce diferent images and performance.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Preprocessing plays a significant role in the classification performance of computer vision models,
especially when analyzing high-definition quadrat images. In these images, plants may occupy only a
small area, making each pixel of the reduced image of great importance.</p>
      <p>The model is conditioned by the training data. Classic training on the ImageNet dataset requires
image normalization of the same name during the testing phase. It focuses on the first two moments
of the input image tensor. More generally, the model is afected by the entropy of the input tensors.
Therefore, special focus should be given to the format of the training images and their preprocessing.</p>
      <p>For instance, having a monolabel training dataset with original, high-resolution images from the
sensors before any resizing would allow to use various preprocessing techniques during training such
as interpolation method or JPEG compression parameters. This would teach the model to consider the
image details in the high frequencies of the JPEG DCT, improving visual classification performance and
making the model more robust to preprocessing.</p>
      <p>Additional tests could be conducted for high-resolution image analysis. The tiling approach with
scales beyond the native resolution of the quadrat images and/or with overlap should be explored
further. Indeed, tiles that framed the plants more precisely facilitated identification. Similarly,
singleshot high-resolution inference approaches (VaMIS) could benefit from fine-tuning the model’s input
size. As previously indicated for tiling, smaller windows for methods involving window attention would
refine the spatial framing of each plant within an image.</p>
      <p>Additionally, contest participants most likely focused on tiling schemes and methods for aggregating
predictions from these tilings. Some may have even tested and fine-tuned other deep learning models.
However, preprocessing optimization is not a research avenue usually prioritized to improve the
classification performance of a computer vision model. Therefore, we can expect the solutions proposed
by other participants to be combined with the preprocessing optimization proposed here to achieve
cumulative gains in classification performance.</p>
      <p>Finally, since preprocessing is relatively independent of the final multi-label classification task for
these high-definition images, the study presented in this paper may be applicable to other competitions
and computer vision use cases beyond the scope of the PlantCLEF 2025 competition.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>I would like to thank the Pl@ntNet team for organizing this competition, which enabled us to test new
methods and avenues of research. Some of these methods were unexpected and daring, but they were
all relevant and instructive.</p>
      <p>
        Thanks are also extended to Jean Senellart and Grégoire Leboucher from Quandela [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] for their
collaboration and for conducting experiments to improve the performance of the machine learning
model using quantum photonics.
      </p>
      <p>Gratitude is expressed to Rosine Choupe for facilitating the preprocessing experiment in Pl@ntNet
by assisting with field photography and sharing her equipment to diversify the sensors used.</p>
      <p>
        Finally, thanks are due to Guillaume Gomez for reviewing the source code and for providing accessible,
comprehensive, and up-to-date Rust language documentation [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], which greatly facilitates learning.
      </p>
    </sec>
    <sec id="sec-8">
      <title>8. Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used Mistral AI’s LeChat in order to: Grammar and
spelling check. The author also used DeepL Translation and DeepL Write to facilitate the translation of
this document. After using these tools, the author reviewed and edited the content as needed and takes
full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Supplementary Experiment in Collaboration with Quandela</title>
      <p>During the final week of the competition, a late collaboration was established with Quandela, a company
specializing in quantum photonics and ofering machine learning solutions that leverage this technology.</p>
      <p>
        The Python libraries Perceval and Merlin [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] enable the design of quantum circuits and their seamless
integration into a PyTorch pipeline as an additional processing layer, similar to a conventional linear
layer (with  input neurons and  output neurons).
      </p>
      <p>Although the preliminary results were not conclusive at this stage, the experiments conducted as
part of the participation in PlantCLEF 2025, in collaboration with Quandela, are briefly outlined here to
facilitate their continuation or to inspire future research.</p>
      <p>The computations were performed according to the principles of quantum photonics: the quantum
circuit, known as the BosonSampler, consists of two main types of components: phaseShifters and
BeamSplitters. The floating-point input values are encoded into the superposed input state given to the
circuit to perform quantum computations. The phase shifters and beam splitters will then modify the
probability of observing a photon in one specific mode, creating a superposed output state diferent
from the input one. Those output probabilities are then converted back into floating-point values. This
is called strong simulation, all of this is calculated thanks to a classical computer, calling this method a
"quantum inspired".</p>
      <p>It is also possible to run the same algorithm on an actual quantum computer, running the same
experiment multiple times (called shots) measuring the output state, and thus obtaining an estimation
of the output probabilities. See Figure 10 for an example of quantum circuit.</p>
      <p>A first approach aimed to analyze the 2,105 quadrat images and predict the species present among
the 7,806. The classification method used resizes the test images to the model’s input size of 518x518
pixels. The quantum layer was inserted between the image encoder and the linear classifier, with input
and output dimensions set to 768 floating-point values. The core idea behind this method is to create a
new embedding based on the Vision Transformer one, using quantum to explore new areas of the latent
space. This process involved semi-manual techniques, including manually testing diferent quantum
circuits, as the Quantum Layer was not fully optimized for GPU execution at that time, leaving some
room for improvement on this project. Indeed, optimizing such a high-dimensional quantum circuit
using gradient descent proved challenging due to the large volume of data to be processed, especially in
a limited amount of time. The linear classifier was fine-tuned using a limited set of training images (a
few hundred thousand), selected by capping the number of train images of each of the 7,806 species.</p>
      <p>
        A second approach was explored to reduce computational complexity and allow for multiple tests:
the study was restricted to the 15 quadrat test images from the LISAH-JAS transect and the 78 species
detected by the oficial public Python script submission (which processes images directly at a resolution
of 518x518 pixels). Two successive linear layers with dimensions 768 → 32 → 78 were trained (initialized
by SVD factorization of the weights of the initial model’s linear classifier) to limit the number of
parameters according to the bottleneck principle ([
        <xref ref-type="bibr" rid="ref39 ref40">39, 40</xref>
        ]). The second linear layer was then removed,
resulting in an image encoder that outputs deep features of size 32, containing the necessary information
to classify the 78 selected species. A quantum layer of size 32x32 was inserted, followed by a linear
classifier. The fine-tuning of the classifier was significantly accelerated. Several diferent quantum
circuit hyperparameters and networks were tested (the Figure 11 represents one of them). However,
tests conducted over just three days did not yield conclusive results. A longer experimentation period
and deeper exploration of this new technology’s capabilities might lead to more definitive outcomes,
especially with the release of MerLin ([
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]), a Pytorch compatible framework developed by Quandela to
create Hybrid Quantum Classic algorithms in a very accessible and optimized way. In particular, it is now
possible to compute gradient descent through a Quantum Layer. Also, previous theoretical research [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ]
has indicated that quantum photonics exhibits certain universality properties. It is anticipated that these
properties could enhance the expressive power of neural networks, thereby improving classification
performance.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pauchard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stoett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Truong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Galil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Hulme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ikeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sankaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>McGeoch</surname>
          </string-name>
          , et al.,
          <article-title>Ipbes invasive alien species assessment: summary for policymakers</article-title>
          ,
          <source>IPBES</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Magurran</surname>
          </string-name>
          , Measuring biological diversity, John Wiley &amp; Sons,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Overview of PlantCLEF 2025: Multi-species plant identification in vegetation quadrat images</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janoušková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of lifeclef 2025:
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF)</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <source>Overview of PlantCLEF</source>
          <year>2024</year>
          <article-title>: multi-species plant identification in vegetation plot images, in: CEUR worshop proceedings</article-title>
          , volume
          <volume>3740</volume>
          <source>of CEUR worshop proceedings, Guglielmo Faggioli and Nicola Ferro and Petra Galuščáková and Alba</source>
          García Seco de Herrera, Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>1978</fpage>
          -
          <lpage>1988</lpage>
          . URL: https://hal.inrae.fr/hal-04806900.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Afouard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Lombardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Pl@ntnet app in the era of deep learning</article-title>
          ,
          <source>in: ICLR: International Conference on Learning Representations</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Lombardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          , T. T. Høye,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dyrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Adapting a global plant identification model to detect invasive alien plant species in highresolution road side images</article-title>
          ,
          <source>Ecological Informatics</source>
          <volume>89</volume>
          (
          <year>2025</year>
          )
          <article-title>103129</article-title>
          . URL: https://www. sciencedirect.com/science/article/pii/S1574954125001384. doi:https://doi.org/10.1016/j. ecoinf.
          <year>2025</year>
          .
          <volume>103129</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          , in: F. Pereira,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          Weinberger (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>25</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Denker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hubbard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Jackel</surname>
          </string-name>
          , Backpropagation Applied to Handwritten Zip Code Recognition,
          <source>Neural Computation</source>
          <volume>1</volume>
          (
          <year>1989</year>
          )
          <fpage>541</fpage>
          -
          <lpage>551</lpage>
          . doi:
          <volume>10</volume>
          .1162/neco.
          <year>1989</year>
          .
          <volume>1</volume>
          .4.541.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <year>2019</year>
          . URL: https://cdn.openai.
          <article-title>com/better-language-models/language_models_ are_unsupervised_multitask_learners</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Masked autoencoders are scalable vision learners</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>16000</fpage>
          -
          <lpage>16009</lpage>
          . URL: https://arxiv.org/abs/2111.06377.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darcet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Moutakanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szafraniec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khalidov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Haziza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Nouby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Assran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ballas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Galuba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Howes</surname>
          </string-name>
          , P.-Y. Huang,
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rabbat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mairal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Labatut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , P. Bojanowski,
          <article-title>Dinov2: Learning robust visual features without supervision</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .
          <fpage>07193</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Darcet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mairal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Vision transformers need registers,
          <year>2024</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>16588</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>W.</given-names>
            <surname>Saeed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Omlin</surname>
          </string-name>
          ,
          <article-title>Explainable ai (xai): A systematic meta-survey of current challenges and future opportunities</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          <volume>263</volume>
          (
          <year>2023</year>
          )
          <article-title>110273</article-title>
          . URL: https://www. sciencedirect.com/science/article/pii/S0950705123000230. doi:https://doi.org/10.1016/j. knosys.
          <year>2023</year>
          .
          <volume>110273</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Rust</surname>
          </string-name>
          ,
          <year>2010</year>
          . URL: https://www.rust-lang.
          <source>org, version 1.87.0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>B.</given-names>
            <surname>Stroustrup</surname>
          </string-name>
          , The C+
          <article-title>+ Programming Language</article-title>
          , 4th ed.,
          <source>Addison-Wesley</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Perkel</surname>
          </string-name>
          ,
          <article-title>Why scientists are turning to rust</article-title>
          ,
          <source>Nature</source>
          <volume>588</volume>
          (
          <year>2020</year>
          )
          <fpage>185</fpage>
          -
          <lpage>186</lpage>
          . doi:
          <volume>10</volume>
          .1038/ d41586-020-03382-2.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Balasubramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Baranowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burtsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Rakamarić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ryzhyk</surname>
          </string-name>
          ,
          <article-title>System programming in rust: Beyond safety</article-title>
          ,
          <source>in: Proceedings of the 16th Workshop on Hot Topics in Operating Systems</source>
          , HotOS '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>156</fpage>
          -
          <lpage>161</lpage>
          . URL: https://doi.org/10.1145/3102980.3103006. doi:
          <volume>10</volume>
          .1145/3102980.3103006.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L.</given-names>
            <surname>Mazare</surname>
          </string-name>
          , Candle,
          <year>2023</year>
          . URL: https://github.com/huggingface/candle, version
          <volume>0</volume>
          .9.1.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>G. van Rossum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L. D.</given-names>
            <surname>Jr</surname>
          </string-name>
          .,
          <string-name>
            <surname>Python</surname>
            <given-names>Tutorial</given-names>
          </string-name>
          ,
          <year>2020</year>
          . URL: https://docs.python.org/3/tutorial/, accessed:
          <fpage>2025</fpage>
          -06-04.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pytorch:</surname>
          </string-name>
          <article-title>An imperative style, high-performance deep learning library</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1912</year>
          .01703. arXiv:
          <year>1912</year>
          .01703.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gustineli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miyaguchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Stalter, Multi-label plant species classification with self-supervised vision transformers</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.06298. arXiv:
          <volume>2407</volume>
          .
          <fpage>06298</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chulif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Ishrat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. L.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Patch-wise inference using pre-trained vision transformers: Neuon submission to plantclef</article-title>
          <year>2024</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          . URL: https://arxiv.org/abs/2103.14030.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Exploring plain vision transformer backbones for object detection</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2203.16527. arXiv:
          <volume>2203</volume>
          .
          <fpage>16527</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mintun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ravi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rolland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gustafson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Whitehead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , W.-Y. Lo,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          , Segment anything,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2304.02643. arXiv:
          <volume>2304</volume>
          .
          <fpage>02643</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Quandela</surname>
          </string-name>
          , Merlin - photonic
          <source>quantum machine learning framework</source>
          ,
          <year>2025</year>
          . URL: https:// merlinquantum.ai/, accessed:
          <fpage>2025</fpage>
          -06-12.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>R. d'Andrimont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Yordanov</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Martinez-Sanchez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Haub</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Buck</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Haub</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>M. van der Velde</given-names>
          </string-name>
          ,
          <article-title>Lucas cover photos 2006-2018 over the eu: 874 646 spatially distributed geo-tagged close-up photos with land cover and plant species label</article-title>
          ,
          <source>Earth System Science Data</source>
          <volume>14</volume>
          (
          <year>2022</year>
          )
          <fpage>4463</fpage>
          -
          <lpage>4472</lpage>
          . URL: https://essd.copernicus.org/articles/14/4463/2022/. doi:
          <volume>10</volume>
          .5194/essd-14-
          <fpage>4463</fpage>
          -
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>S.</given-names>
            <surname>Foy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. McLoughlin</surname>
          </string-name>
          ,
          <article-title>Utilising dinov2 for domain adaptation in vegetation plot analysis</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>K.</given-names>
            <surname>Turkowski</surname>
          </string-name>
          ,
          <article-title>Filters for common resampling tasks</article-title>
          , in: Graphics gems,
          <year>1990</year>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Yip</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Britanak</surname>
          </string-name>
          , Discrete cosine transform: Algorithms, advantages, applications,
          <year>1990</year>
          . URL: https://api.semanticscholar.org/CorpusID:12270940.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>arXiv preprint arXiv: 1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ermon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rudra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ré</surname>
          </string-name>
          ,
          <article-title>Flashattention: Fast and memory-eficient exact attention with io-awareness</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2205.14135. arXiv:
          <volume>2205</volume>
          .14135, local:PDF/80_divers/2022_Dao_FlashAttention_
          <fpage>2205</fpage>
          .14135v2.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Salamah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <source>Jpeg inspired deep learning</source>
          ,
          <year>2025</year>
          . URL: https: //arxiv.org/abs/2410.07081. arXiv:
          <volume>2410</volume>
          .
          <fpage>07081</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Tutoriel rust,
          <year>2025</year>
          . URL: https://blog.guillaume-gomez.fr/Rust, accessed:
          <fpage>2025</fpage>
          -06-11.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <year>2015</year>
          . arXiv:
          <volume>1512</volume>
          .
          <fpage>03385</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhmoginov</surname>
          </string-name>
          , L.-C.
          <article-title>Chen, Mobilenetv2: Inverted residuals</article-title>
          and linear bottlenecks,
          <year>2019</year>
          . arXiv:
          <year>1801</year>
          .04381.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>M.</given-names>
            <surname>Reck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zeilinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bertani</surname>
          </string-name>
          ,
          <article-title>Experimental realization of any discrete unitary operator</article-title>
          ,
          <source>Phys. Rev. Lett</source>
          .
          <volume>73</volume>
          (
          <year>1994</year>
          )
          <fpage>58</fpage>
          -
          <lpage>61</lpage>
          . URL: https://link.aps.org/doi/10.1103/PhysRevLett.73.58. doi:
          <volume>10</volume>
          .1103/PhysRevLett.73.58.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>