<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haixu Liu</string-name>
          <email>liuhaixu1998@foxmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Penghao Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zerui Tao</string-name>
          <email>ztao0063@uni.sydney.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muyan Wan</string-name>
          <email>mwan3425@uni.sydney.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiuzhuang Sun</string-name>
          <email>qiuzhuang.sun@sydney.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Sydney</institution>
          ,
          <addr-line>Camperdown Campus, Sydney, 2006 NSW</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Predicting plant species composition in specific spatiotemporal contexts plays an important role in biodiversity management and conservation, as well as in improving species identification tools. Our work utilizes 88,987 plant survey records conducted in specific spatiotemporal contexts across Europe. We also use the corresponding satellite images, time series data, climate time series, and other rasterized environmental data such as land cover, human footprint, bioclimatic, and soil variables as training data to train the model to predict the outcomes of 4,716 plant surveys. We propose a feature construction and result correction method based on the graph structure. Through comparative experiments, we select the best-performing backbone networks for feature extraction in both temporal and image modalities. In this process, we built a backbone network based on the Swin-Transformer Block for extracting temporal Cubes features. We then design a hierarchical cross-attention mechanism capable of robustly fusing features from multiple modalities. During training, we adopt a 10-fold cross-fusion method based on fine-tuning and use a Threshold Top-K method for post-processing. Ablation experiments demonstrate the improvements in model performance brought by our proposed solution pipeline. This work achieves a private leaderboard score of 0.36242 in the GeoLifeCLEF 2024 LifeCLEF &amp; CVPR-FGVC Competition, securing third place in the rankings (Team name: Miss Qiu).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Swin-Transformer</kwd>
        <kwd>Computer Vision Backbone</kwd>
        <kwd>Graph Feature Extract</kwd>
        <kwd>Hierarchical Cross-Attention Fusion Mechanism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. Background and related literature</title>
        <p>Predicting the composition of plant species over segmented time and spatial scales plays an important
role in managing and protecting ecosystem biodiversity and improving species identification tools.
Therefore, the LifeCLEF lab of the CLEF conference and the FGVC11 workshop of CVPR jointly host
the GeoLifeCLEF 2024 competition centered around this task [1, 2].</p>
        <p>
          We review the strategies of past winners in this competition. The fourth-place [3] in 2022 mentions a
method for constructing "Patched" approaches, where variables from eight diferent modalities provided
that year are aligned into a single image format of (
          <xref ref-type="bibr" rid="ref3">256,256,3</xref>
          ). These modalities are then processed
separately by ResNet50 [4], and the output features are concatenated and passed through a linear layer,
with Top-K processing applied to the outputs. In contrast, the second-place entry from the same year
[5] abandons data from other modalities and only uses remote sensing data, creating NDVI image
format features using NIR and RGB data. They train multiple models based on ResNet50, DenseNet201,
and Inception-v4, and fuse the logits.The champion’s strategy in 2023 [6, 7] introduces three backbone
networks based on ResNet. The first network solely extracts features from bioclimatic raster data, while
the second and third networks each include three backbones for extracting features from bioclimatic
rasters, satellite images, and soil rasters, with varying depths in these two networks. Finally, logits fusion
is applied to these three models. At the same time, they adopted a three-step training strategy to attempt
to learn the information in PO. These successful past entries and the oficial baseline significantly
influence our model development.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Our method</title>
        <p>Our study uses survey data of 11,255 species to train a machine-learning model for this goal. Each
sample in the dataset comprises satellite imagery and time series [8, 9, 10] linked with geographical
coordinates, climate time series [11, 12], and other rasterized environmental data such as land cover,
human footprint, bioclimatic, and soil variables.</p>
        <p>To efectively extract features across these modalities for accurate predictions, we propose the
following solution pipeline for machine learning. First, we create a graph structure based on the
available data, treating SurveyIDs as nodes. Nodes are connected based on whether they fall within the
same ecological niche (less than 10 kilometers apart) and share geographical and annual similarities.
Subsequently, we aggregate the labels of all adjacent nodes for each node, using these as new features
for that node. In our model, we use a Swin-Transformer-based method for extracting temporal features,
instead of the commonly used Resnet18 network. Additionally, we provide an optional replacement
for the Swin-Transformer Tiny, used in the above baseline model for image feature extraction, with
EficientNet-B0. This substitution can significantly expedite model training with minimal impact on
prediction accuracy. We also develop a hierarchical cross-attention mechanism (HCAM) to fuse features
extracted from diferent modalities, which efectively uses information from multimodality data. Last, in
the post-processing steps, we improve the traditional Top-K rule for multi-task classification. Specifically,
we integrate a series of thresholds for the Top-K rule for model outputs (see Section 3.7). We then use
the graph structure and the above model outputs for the final species prediction. The details of this
novel solution pipeline is provided in Section 3.</p>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Contributions</title>
        <p>We summarize our contributions as follows:
1. In preprocessing, we construct a graph structure for independent Survey IDs, using labels of
nodes (Survey IDs) with adjacency relationships to generate new features. This graph structure
improves the final prediction results.
2. Our backbone network for visual features modifies the Swin-Transformer structure to extract
temporal characteristics. This strategy is shown to enhance model performance.
3. We design a hierarchical cross-attention mechanism to fuse features from diferent modalities.</p>
        <p>This strategically mitigates overfitting.
4. Our proposed post-processing strategy combines the advantages of the threshold method and</p>
        <p>Top-K approach, also improving the final prediction accuracies.</p>
        <p>Our model, named Tighnari, is inspired by the Forest Ranger character in the popular open-world
game “Genshin Impact,” who is adept at identifying a variety of species in rainforests and is committed
to ecological conservation. Just as Tighnari ensures the health of the rainforest ecosystems, we aspire
for our model to accurately predict the composition of plant species in given spatiotemporal contexts
and make significant contributions to environmental protection. The model’s acronym, TIGH, stands
for T for Transformer, I for Image processing (computer vision) backbone, G for Graph feature extract,
H for Hierarchical cross-attention fusion mechanism.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Exploratory Data Analysis</title>
      <p>Exploratory Data Analysis (EDA) is essential for unveiling the motivations behind our choice of modeling
techniques. We divide our analysis into several steps, starting with the visualization of unstructured
data (shown in Figure 1 and Figure 2):
(a) Time series of four climatic features for four samples
(b) Time series of four climate features for four samples</p>
      <p>We find that time series data exhibit strong periodicity, which inspires us to consider whether folding
the time series according to its periodicity to transform 1D data into 2D data might facilitate more
efective feature extraction.</p>
      <p>Subsequently, we perform outlier detection in tabular data, followed by data cleaning and imputation
of missing values. We group the data by year and region to observe the distribution of features and the
correlations between them under diferent groupings.</p>
      <p>The analysis reveal significant variations in feature distribution and correlations across diferent
regions, leading us to hypothesize that these variations could influence the distribution of species
(shown in Figure 3 and Figure 4). This hypothesis is confirmed by further examining the prevalence of
leading species in diferent regions. Additionally, we note that feature distribution and correlations do
not show significant diferences between close years, but observable diferences emerge across more
widely separated years. Hence, we conclude that sharing the same year and region is a prerequisite for
establishing an edge between two nodes (supporting label aggregation).</p>
      <p>Next, we visualize the geographic locations of PA ad PO Survey IDs to explore whether they exhibit
any clustering tendencies (shown in Figure 5).</p>
      <p>The visualization indicates a tendency for the survey locations to cluster. We analyze some of these
smaller clusters, calculating their radii, and estimated that the radius around each small cluster center
was approximately 10 kilometers. To prevent a node from being overwhelmed by an excessive number
of adjacent nodes, which could lead to a reduction in the variance of the aggregated feature vectors, we
introduce further constraints on edge creation: no edge would exist between nodes if their geographic
distance exceeded 10 kilometers.</p>
      <p>We then visualize the labels, observing the frequency of species occurrences and the frequency
distribution of the number of species recorded per survey (shown in Figure 6). This analysis aids us in
setting a reasonable range for K in the Top-K process. It can be observed that the optimal value of K
should be between 0 and 100.</p>
      <p>Finally, using survey IDs as nodes, we construct a graph based on the aforementioned rules. We
examine the distribution of each node’s degree (the sum of edge weights) on the graph. We find the
distribution to be highly uneven, which could potentially degrade the quality of aggregated features
on the graph. Direct normalization of the aggregated feature vectors by the degree could excessively
diminish the values corresponding to rarer species. Inspired by the Attention mechanism [13], which
normalizes attention scores using a square root transformation, we adopt this technique to use the
square root of the degree as the divisor for normalizing aggregated feature vectors.</p>
      <p>Through visualization, we discover that this approach allows the divisor used to normalize the feature
vectors of each node in the graph to approximate a normal distribution, and also results in a smaller
range of divisor (shown in Figure 7).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this section, we provide a detailed description of each step in the modeling process and the motivations
behind them.</p>
      <sec id="sec-3-1">
        <title>3.1. Table Data Cleaning and Missing Value Imputation</title>
        <p>Initially, we group the metadata for both PA and PO by Survey ID. We then replace outliers in the
grouped metadata of PA and the test set with null values, and impute missing values with the mean.
Categorical data were then one-hot encoded. Subsequently, for the label(speciesId), we encode each
element according to its corresponding number using 0-1 encoding.</p>
        <p>Next, we access all tabular data for the training and test sets from the EnvironmentalRasters folder.
We notice that the human footprint column contained − 1 and extreme outliers. Based on the observed
data distribution, we set all values greater than 255 to 255 and those less than 0 to 1. Subsequently, we
merge all tables by Survey ID, replacing all infinite and infinitesimal values with nulls, and again impute
all missing values with the mean. Finally, the training sets of the PA and test sets were merged again
with the cleansed EnvironmentalRasters data by Survey ID to produce the cleaned tabular modality
data.In the subsequent sections, we represent the features extracted from tables as  .</p>
        <p>
          Following this, we clean the time series data by folding it into cubes. In the baseline method, all null
values in the Tensor are replaced with zeros. We change this replaced value to the mean of the Tensor.
Additionally, as the Swin-Transformer cannot accept prime numbers as the dimensions of input image
matrices, we trim the shapes of two sets of time series cubes from (
          <xref ref-type="bibr" rid="ref12 ref4">4, 19, 12</xref>
          ) and (
          <xref ref-type="bibr" rid="ref4 ref6">6, 4, 21</xref>
          ) to (
          <xref ref-type="bibr" rid="ref12 ref4">4, 18, 12</xref>
          )
and (
          <xref ref-type="bibr" rid="ref4 ref6">6, 4, 20</xref>
          ) respectively. This is justified because the last year in the series already had many missing
values, so trimming directly does not result in significant information loss.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Graph Construction and Utilization</title>
        <p>Graph construction and utilization are highlights of our work. To establish graph relationships among
samples, we based our approach on two fundamental assumptions. First, we assumed a clustering
tendency in the spatial distribution of individual species, meaning that if a species appears in nodes
surrounding a particular Survey ID, the likelihood of its occurrence at that Survey ID increases. Second,
we hypothesized that within the same ecological environment, there is a correlation between the
spatiotemporal distributions of diferent species, implying that samples (nodes) close in geographical
location and year exhibit higher similarity in species composition. Our earlier data visualization eforts
validated these assumptions.</p>
        <p>Our graph construction process was two-fold. The first step involved establishing a base graph for
aggregating labels from adjacent nodes. Our rationale and actions are as follows: Visualization revealed
significant diferences in ecological characteristics and species distribution among diferent regions,
and even within the same region across extensive years. Consequently, we determine that nodes being
in the same year and region was a prerequisite for an edge (supporting label aggregation).</p>
        <p>However, simply adding edges between nodes of the same year and region can result in highly
consistent feature vectors due to label aggregation (with minimal variance), and the varying numbers
of nodes across diferent regions and years can lead to significant imbalances in feature vector values,
thus afecting feature quality. To counter this, we need to further restrict edge creation conditions.
Visualization showed clustering tendencies in survey locations; we identify some smaller clusters and
calculate their radii, with an estimated radius of about 10 kilometers for each cluster center. Therefore,
we stipulate that nodes over 10 kilometers apart would not be connected by an edge. Moreover, we
want the edge weight between two nodes to increase as their distance decreased, hence we set the
edge weight as the maximum allowable distance (10 kilometers) minus the actual distance between two
nodes. Based on these criteria, we add an edge for every pair of nodes that meet the conditions, thus
forming the graph.</p>
        <p>Next, we need to establish rules for generating a graph feature vector (GFV) for each node through
label aggregation from adjacent nodes. We convert each node’s neighboring node labels into 0-1 encoded
vectors, where the presence of a species is marked as 1 and absence as 0, resulting in each node’s label
being a 11255-dimensional vector. For any node, the weighted sum of all its neighboring nodes’ label
vectors and corresponding edge weights constitutes the node’s GFV. To avoid the severe imbalance in the
numerical distribution of GFV mentioned in the previous section, we adopt a square-root normalization
trick similar to that used in attention mechanisms, using the square-root of the degree as the divisor for
the GFV. Additionally, to prevent global imbalances in the numerical distribution of GFVs among all
nodes, we normalize all nodes’ GFVs and reassign them accordingly.</p>
        <p>Finally, we add each Survey ID from the test set as nodes to this graph, determining whether to create
edges with existing training set nodes based on the established criteria (note that edges should not be
generated between test set nodes to avoid afecting the statistics of node degrees). We then aggregate
the GFVs for inference on the test set nodes. The specific calculation formula is as follows:
  =
∑︀=1  ×
√</p>
        <p>
          = 10 − 6731 ×  ,
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
where  represents the ID of the current node,  represents the ID of the node connected to it, 
represents the edge weight between these two nodes,  represents the label vector passed through by
the nodes adjacent to the current node,  represents the degree of the current node, and  represents
the radian distance calculated between two points based on their latitude and longitude.
        </p>
        <p>In the second step, we clone the graph with its nodes, edges, weights, years, regions, coordinates,
and labels into a new graph, marking training and test set nodes with category labels. We then label
the auxiliary nodes from the PO metadata table grouped by Survey ID according to the first step’s
edge creation conditions and create edges with qualifying test nodes. For each test node, we select
the  adjacent training nodes with the highest weights (closest geographical locations) and identify
species appearing more than  times, generating a list. We then select the  auxiliary nodes with the
highest weights, identify species appearing more than  times, and compile another list. Merging
and deduplicating these two lists provides a high-probability species list for each test node, used for
correcting model output in post-processing. The default settings are:  = 5,  = 4,  = 10,  = 8.
Algorithm 2 Complete Graph Construction and Species List Compilation with Sorting
1: Input: Graph  with all attributes
2: Output: Enhanced species lists for test nodes
3: procedure CloneGraph()
4:  ← clone of  with nodes, edges, weights, years, labels
5: Label nodes in  as training or test based on category labels
6: end procedure
7: procedure LabelAuxiliaryNodes(, PO)
8: for each node in PO grouped by Survey ID do
9: Apply edge creation conditions from step 1
10: if node qualifies then
11: Create edges to test nodes in 
12: end if
13: end for
14: end procedure
15: procedure GenerateSpeciesLists()
16: for each test node  in  do
17: Initialize species count maps _ and _
18: ◁ Sort and process adjacent training nodes
19: Sort all adjacent nodes of  by weight in descending order
20:  ← top 5 nodes from sorted list
21: for each node  in  do
22: for each species  in . do
23: Increment _[] by 1
24: end for
25: end for
26:  ← species in _ where count &gt; 4
27: ◁ Sort and process auxiliary nodes
28: Sort all eligible auxiliary nodes by weight in descending order
29:  ← top 10 nodes from sorted list
30: for each node  in  do
31: for each species  in . do
32: Increment _[] by 1
33: end for
34: end for
35:  ← species in _ where count &gt; 8
36: ◁ Merge and deduplicate lists
37:  ← merge and remove duplicates from  and 
38: end for
39: end procedure
40: procedure PostProcessOutput()
41: for each test node  in  do
42: Use  to correct model output
43: end for
44: end procedure
45: Call CloneGraph()
46: Call LabelAuxiliaryNodes(, PO)
47: Call GenerateSpeciesLists()
48: Call PostProcessOutput()</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Temporal Feature Extraction</title>
        <p>Another highlight of our work is the development of a temporal feature extraction method based on
the Swin-Transformer Block. This approach is inspired by Haixu Wu et al., who used a CNN-based
Inception backbone network to extract features from folded time series data in TimesNet[14]. Wu and
colleagues argue that this method, compared to traditional time series neural networks, is not only better
at capturing multi-scale sequential relationships but also has a stronger capability for spatio-temporal
information fusion. It shows superior performance across various time series analysis tasks and ofers
more eficient training and inference. Our goal in processing time series is to obtain higher quality
features that are more conducive to modality fusion, rather than making better predictions about future
time points. Therefore, we believe that using a visual backbone network to process time series cubes
should yield better feature extraction results than traditional time series models.</p>
        <p>In the current research on deep learning technology, whether for image processing or time series
prediction tasks, methods based on Transformers are considered superior to those based on CNNs.
Consequently, we experiment with a backbone network specially designed for extracting temporal
features using Swin-Transformer Blocks and Vision Transformer Blocks.</p>
        <p>
          Taking Swin-Transformer as an example, for time series cubes cropped to sizes (
          <xref ref-type="bibr" rid="ref12 ref4">4, 18, 12</xref>
          ) and (
          <xref ref-type="bibr" rid="ref4 ref6">6, 4,
20</xref>
          ), we set the Patch size to (
          <xref ref-type="bibr" rid="ref3 ref3">3,3</xref>
          ) and (
          <xref ref-type="bibr" rid="ref2 ref5">2,5</xref>
          ), and the Window size to (
          <xref ref-type="bibr" rid="ref2 ref3">3,2</xref>
          ) and (
          <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
          ), respectively. We stack
a Swin-Transformer Stage with a depth of 2 and 12 attention heads and another with a depth of 6 and
24 attention heads to create the backbone network for handling our specific time series cubes. The
attention function can be defined mathematically by the equation:
        </p>
        <p>Attention(, ,  ) = SoftMax
︂( 
√</p>
        <p>︂)
+  ,
as follows:
respectively.
where  represents the query matrix,  represents the key matrix,  represents the value matrix, 
is the dimensionality of the keys and queries, typically used for scaling,  represents the positional
encoding for time sequences. Unlike the classic Swin-Transformer, where a two-dimensional vector
indicates the absolute position of patches in an image, here we employ a one-dimensional vector to
represent the position of each patch in a flattened sequence. This modification better adapices to
temporal tasks. Apart from that, the meanings of other symbols and formulas are the same as those in
the classic Swin-Transformer, and are not reiterated here [15].</p>
        <p>We depart from the standard practice of stacking four stages in the Swin-Transformer backbone
network because the dimensions of the time series cubes are too small. With each stacked stage, the
dimensions for the subsequent stage are halved. Moreover, the Swin-Transformer Block does not accept
odd dimensions for input feature maps, making two blocks the most logical configuration for our
current time series cube dimensions.</p>
        <p>The depths of the two stacked blocks are chosen to be 2 and 6, corresponding to the depths of the
second and third stages in the classic Swin-Transformer backbone network. This configuration means
that they extract shallow and deep features, respectively. This 1:3 depth ratio has also been adopted by
subsequent backbone networks, such as ConvNeXt [16].</p>
        <p>
          We select 12 and 24 as the attention head counts for the two stages, following the counts used in the
third and fourth stages of the classic Swin-Transformer backbone network. Increasing the number of
attention heads allows the model to independently capture features across more subspaces. Since the
feature map sizes entering the third and fourth stages in the original model are already quite small,
similar to the size of our time series cubes, having many attention heads does not overly increase the
computational burden. Therefore, we use as many attention heads as possible to more comprehensively
extract features from the time series cubes. The input and output representations of the final model are
 = Temporal-Swin-T( ),
where  ∈ R×  ×  ,  represents the number of seasons or months,  represents the number of
years,  represents the number of channels,  are the features after Swin-Transformer processing,
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
        </p>
        <p>We give the comparison diagram to reveal the diferences and connections between our model and
the classic Swin-Transformer (shown in Figure 9).</p>
        <p>Based on this rationale, we also design a Vision Transformer (ViT) backbone network for extracting
temporal information. To validate our approach, we design rigorous comparative experiments using
our new temporal feature extraction network to replace the Resnet18 [15] used in the baseline for
extracting time series features. Given the small size of the time series cubes relative to images, we
primarily choose small backbone networks to prevent overfitting and gradient vanishing.</p>
        <p>Our comparative backbone networks include the baseline’s ResNet18 [4], the improved version
of Inception, Xception41 [17] from TimesNet [14], the CNN-based lightweight backbone networks
EficientNet-B0 [ 18] and MoblieNetV3 [19], as well as our designed Swin-Transformer and Vision
Transformer backbone networks for time series. The final experimental results demonstrate that our
custom-designed temporal feature extraction networks perform optimally, with Swin-Transformer
showing the fastest gradient descent and the best results on the private leaderboard.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Image Feature Extraction</title>
        <p>Although this is a multi-task classification problem, the low resolution of satellite images clearly does
not allow for accurate prediction of plant species in a given region. Therefore, the primary motivation
for processing images is still to extract high-quality features that are conducive to modality fusion. In
the baseline, the image feature extraction network employed is the tiny version of Swin-Transformer,
which was presented in a Best Paper at ICCV 2021 [4]. We note that most models are trained with
input sizes of (224,224) or (384,384). To better utilize the weights preserved in the pre-trained model
and retain the original information carried by the satellite images, we resize the satellite images from
(128,128) to (224,224) before inputting them into the pre-trained model for fine-tuning. Experimental
evidence shows that this adjustment significantly enhances the model’s performance.</p>
        <p>To explore whether there are better alternatives, we conduct comparative experiments with other
models such as EficientNet-B0 [ 18], ConvNeXt-Base [16], and ViT-Base [20], all pre-trained with an
input size of (224,224). The choice of ConvNeXt-Base and ViT-Base is based on their status along
with Swin-Transformer as the current state-of-the-art (SOTA) solutions in computer vision backbone
networks, and they are comparable to the tiny version of Swin-Transformer in terms of the number
of parameters. EficientNet-B0 was selected partly because it is one of the most powerful feature
extraction networks among traditional CNN-based backbones, seen as a superior alternative to the
ResNet scheme. Moreover, EficientNet-B0 is exceptionally lightweight and converges quickly, both in
terms of parameter count and training time, allowing for significant optimization of the overall model
training and inference time without substantial performance loss.</p>
        <p>Our comparative experiments reveal that using the ResNet18 scheme for both temporal and image
feature extraction achieved the highest accuracy. However, using Swin-Transformer for temporal feature
extraction and EficientNet-B0 for image feature extraction reduced training time by approximately 75%
without a significant decrease in accuracy, thanks to faster per-epoch training durations and overall
faster convergence. The input image is  ∈ R×  ×  .</p>
        <p>
          = Swin-Transformer-Tiny(),
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
where  is the feature map of output. represents the height of satellite images,  represents the
width,  represents the number of channels.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Hierarchical Cross-Attention Fusion Mechanism(HCAM)</title>
        <p>Another significant highlight of our work is the introduction of a hierarchical cross-attention fusion
mechanism to address the challenge of eficiently fusing feature vectors with varying information
densities extracted from diferent modalities. In the aforementioned steps, we have extracted information
( , ,  ′, ′) from four modalities, including time-series modal ( ), satellite imagery modal (), and
tabular modal features ( ), as well as graph modal features () we constructed and extracted ourselves.
 ′ and ′ represent  and  after being processed by fully connected layers.</p>
        <p>In terms of information density, satellite imagery modal features are the most dense because the
backbone network essentially compresses information carried by multiple channels of an image into a
limited set of features for classification mapping via a fully connected layer. Time-series and tabular
modal features are less dense, as in our model, we attempt to map them to feature vectors of the same
or even higher dimensions. Graph modal features are the sparsest; although we aggregate features from
diferent nodes, the majority of elements in a graph feature vector relative to all 11,255 dimensions are
still marked as 0.</p>
        <p>Initially, we try to use concatenation for modality fusion similar to the baseline, but due to the high
dimensionality and sparsity of the graph feature vectors, the model’s loss reduction process is unstable.
Consequently, we decide to use cross-attention, more commonly seen in multimodal learning, to attempt
fusion of these modalities. However, cross-attention supports only the fusion of two modalities at a
time, requiring six uses for pairwise fusion among four modalities, and still necessitating concatenation
to integrate each cross-attention output. This not only increases computational overhead but also fails
to ensure a controllable reduction in training loss. After multiple adjustments using cross-attention, we
determine the optimal modality fusion structure, with specific operations and motivations as follows:</p>
        <p>Firstly, the features extracted from the satellite imagery modal are the densest in information.
However, observations of the raw data reveal that satellite images primarily provide features of the
landscape and vegetation cover, such as color and density of foliage, at the location of the current
SurveyID. These features may map to higher-order latent features such as seasonal climate and ecological
environment. Meanwhile, the time-series features record climate characteristics and vegetation changes
of the area, and the tabular features mainly document the ecological environment characteristics of the
area. We consider the time-series and tabular features as two sets of queries, querying keys of climate
and vegetation change features, and ecological environment features respectively, both derived from the
image features through two parallel linear layers. This setup allows the calculation of attention scores for
the time-series and image modalities on climate and vegetation changes, and for the image and tabular
modalities on ecological features. The outputs of the cross-attention from these two groups are then
concatenated after being calculated from two sets of values mapped from the image features through
two parallel linear layers. This concatenated output serves as the final output of the cross-attention
calculation for these three modalities. Simultaneously, when the features of the three modalities are
fed into this cross-attention module, they are concatenated and mapped to the same dimension as the
output of the cross-attention through a linear layer serving as a cutof, and added together, forming
a residual connection. This addition enhances the robustness of the cross-attention module during
training. The Cross attention(CA) function can be defined mathematically by the equation:
1 =  ×  , 2 =  ′ ×  ,
1 =  × 1, 2 =  × 2,
1 =  ×  1, 2 =  ×  2,
︂( 11 )︂</p>
        <p>
          √1
1 = Softmax
1, 2
= Softmax
︂( 22 )︂
√2
2,
 = Concat(1, 2), 2 = Concat(, , ),
CA =  + Linear(2).
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
In our initial concept, we consider using the graph modal as the primary modality, given that its features
are directly aggregated from the labels of adjacent nodes. Based on the assumptions mentioned at the
beginning of section 3.2, these features should closely approximate the current node’s label. Therefore,
we intend to use the features from other modalities to correct the graph modal, aiming to achieve higher
accuracy. However, during training, the diference in information density between the graph modal
features and other modal features was too great. Whether through direct concatenation, mapping three
modalities’ features to the graph feature vector’s dimension for addition, or through cross-attention, it
was inefective in guiding the graph feature modal to generate accurate labels for prediction samples
(nodes).
        </p>
        <p>In ecology, the composition of vegetation species in a specific spacetime is often determined by
factors such as climate conditions, ecological environments, and vegetation changes. Considering
that the species appearing in a node’s adjacent nodes can represent these three major features of the
node’s spacetime, albeit in a too sparse representational form, we decide to reduce the dimensionality
of the graph feature vector. We compress it to the same dimension as the concatenated vector of the
other three modalities, then perform another multi-head cross-attention(MHCA) calculation. Here, the
concatenated vector of the other three modal features serves as the Query, with the compressed graph
feature vector being mapped as Key and Value by two parallel linear layers, calculating the multi-head
cross-attention and outputting. Our rationale for this choice is that the concatenated vector of the three
modal features represents the measured characteristics of climate conditions, ecological environment,
and vegetation changes in the current spacetime, while the compressed graph feature vector represents
the observational results of these characteristics on the plant species combinations we are focusing on.
We aim to reveal the causal relationships between them through cross-attention. The Multi-head cross
attention function can be defined mathematically by the equation:
′ = Linear(),  = Concat(, ,  ),
MHCA(, ,  ) = Softmax
 =  × ,  = ′ ×  ,  = ′ ×  ,
︂(  )︂
√
.</p>
        <p>Ultimately, we concatenated the output of this multi-head cross-attention with the output of the
trimodal cross-attention module mentioned previously, organizing it into the final output through a fully
connected layer.</p>
        <p>
          final = Linear(Concat(MHCA, CA)),
output = Sigmoid(final ).
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
This approach significantly improves the model’s performance during training. Observing the loss
curves for the training and validation sets during the training process, it is evident that the model’s
overfitting was markedly reduced, and the loss on the validation set could decrease further.
        </p>
        <p>Throughout the process, we use two cross-attention layers, performing three cross-attention
calculations. Initially, we select the image modal feature with the highest information density and calculate
cross-attention with the time-series and tabular modal features, which have relatively high information
densities. Subsequently, we calculate the cross-attention between the features of the three modalities and
the compressed graph modal feature in one go. The selection of modal features for each cross-attention
layer reflects a hierarchical relationship, thus we name this approach the Hierarchical Cross-Attention
Fusion Mechanism. The schematic diagram of HCAM is as follows:</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Mix up +10 Fold Cross Fusion training strategy</title>
        <p>We adopt the Mix up training strategy provided in the baseline, which is a very common method for
data augmentation during training. Specifically, this method involves randomly shufling the order
of training data and labels in the current batch and then performing a weighted addition with the
unshufled training data and labels. The input matrices  , ,  , and , as well as the label matrix ,
are shufled to create ˜ , ˜, ˜ , ˜, and ˜:
mix =  + (1 −  )˜ ,
mix =  + (1 −  )˜,
mix =  + (1 −  )˜ ,
mix =  + (1 −  )˜,
mix =  + (1 −  )˜.</p>
        <p>This approach enhances the robustness of the training process, improves the generalization
performance of the model, smooths out the distribution of samples across diferent categories, and makes
originally sparse labels relatively dense.</p>
        <p>
          Moreover, to fully utilize the training data and reduce model overfitting, we employ a ten-fold
cross-fusion technique to train the model. The concept of ten-fold cross-fusion is an improvement
over ten-fold cross-validation. The specific operation involves dividing the dataset into ten parts, with
each part serving once as the validation set, while the other nine parts are used to train a brand new
model. The logits output by these ten new models are then averaged. However, this approach also
results in a training eficiency about ten times lower than before. To address this drawback, I reason
that the training datasets used for each model are highly overlapping, and except for the first model, the
training of the subsequent nine models can be considered as fine-tuning the first model using a slightly
altered dataset. Motivated by this, we optimize our training strategy. For the first model, we initialize
parameters and train it using an early stopping strategy. For each subsequent model, we clone the
parameters of the first model and fine-tune it on a newly combined training set. This method reduces
the number of training epochs for subsequent models, thereby enhancing training eficiency. The
original dataset is ; we divide it into ten parts {1, 2, . . . , 10}. We train a model  by setting
each  as the test dataset:  = Train (∖). Then we compute the average of logits from these
models:
 =
where  is the model input. The training process and results visualization for each model are as follows:
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
        </p>
        <p>The binary cross-entropy (BCE) loss is one of the most common loss functions for multi-label learning
[6]. For an observed data point (, ), including full positive and negative classes, the BCE loss is
calculated as follows:</p>
        <p>
          1 ∑︁ [︀ 1,=1 log(,) + 1,=0 log(1 − ,)︀] ,
BCE(, ) = −  =1
where  =  () ∈ [0, 1] is the model predicted probability of presence for species  under input ,
and 1 denotes the indicator function, i.e., 1 = 1 if assertion  is true, or 0 otherwise.
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
        </p>
      </sec>
      <sec id="sec-3-7">
        <title>3.7. Post-Processing: Threshold Top-K and Output Correction</title>
        <p>Based on the oficial requirements, the final formula for calculating the score is as follows:
1 =
,
where ⎪⎪⎩⎪⎨⎪⎪⎧⎪ FTNPFiP==i=NNuNmumubmebreb-roe-rfo--fso-ppf-erscepideeisccitneeosdt-p-sprperedecdiciietcset-detd-rbu-bulyut--tap-bprsereesnesent,nt,it.,ie.i.e..e.,.⃒⃒⃒,⃒⃒⃒̂︀⃒⃒⃒̂1︀Y∖Y∩i∖î⃒⃒⃒︀1⃒⃒⃒⃒⃒⃒ . (13)
In the baseline, the classic multi-class task method of Top-K is used to filter the model’s output.
Typically, in conventional Top-K, a value for K is manually set or initially a range is predefined, within
which K is enumerated to observe which K yields the best performance on the validation set inference
results, and this K is then applied to the test set inference process.</p>
        <p>However, we observe drawbacks with this method. Some Survey IDs contain dozens or even hundreds
of species. Although these species might have high probabilities in the model’s output, they are truncated
if they do not rank within the top K. Additionally, some test set Survey IDs, even with low probabilities
for each species, are still forced to output the top K species by probability rank.</p>
        <p>Therefore, we improve the Top-K algorithm by setting a range of thresholds (0.1 to 0.5, in steps of
0.01) for each possible K to filter out species with predicted probabilities below these thresholds. By
exhaustively combining K and threshold values and comparing the scores of the validation set outputs
processed through them, we can identify the optimal pair of K and threshold.</p>
        <p>Let the probability of the model output be  = {1, 2, . . . , 11255}, where  is the predicted
probability of species . We filter the output by setting the threshold  and K values:
 = {| &gt;  } ∪ Top-K( ),
(14)
where  is the set of filtered species.</p>
        <p>We can see that the darker colored regions represent higher scores for this pair of parameters. In the
end, we choose the optimal K of 44 and the optimal threshold of 0.23 (shown in Figure11).
(a) Heatmap
(b) 3D Plot</p>
        <p>We note that there are 11,255 species IDs, but only 5,016 appear in PA, with the remainder in PO,
prompting us to correct the model’s data through mining of the PO data. The specific approach involves
merging and deduplicating the list of high-probability species for each test set node obtained in section
3.2 with the model’s prediction results to produce the final output. In addition, We find that generating
a high-probability species list using only the PO data (auxiliary nodes) is more consistently beneficial
compared to generating a high-probability species list using both PO and PA data for result correction.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In this chapter, we present the experiments conducted to validate the superiority of the Proposal model
and analyze the experimental results. We omit the hyperparameter tuning process in the paper because
each backbone network corresponds to diferent model hyperparameters and training hyperparameters.
Given the limited time, it is challenging to prove the optimality of the selected hyperparameters through
grid search. Instead, we judge the current parameters’ potential to cause overfitting or underfitting by
observing the gradient descent process, or whether the current parameter combination results in lower
loss and higher scores compared to previous combinations.We perform exploratory tuning for each
model involved in the comparative and ablation experiments to ensure that the current hyperparameter
combination is the best among all attempts. We also employ an early stopping strategy, which minimizes
the impact of hyperparameter changes on training when there is no significant overfitting or underfitting.</p>
      <p>At the end of this chapter, we provide a table of the hyperparameters used for the final submission.We
select the following metrics to analyze the model’s performance, including public and private leaderboard
scores on the oficial test set. Additionally, since we ultimately need to fuse the logits output by the
models, we introduce the loss on the validation and training sets. Since the calculation of BCE essentially
equals the sum of entropy and KL divergence, it describes the diference between the probability
distribution of the model output and the distribution of the true labels. Therefore, the logits of models
with lower validation and training set losses result in better fusion efects than models with lower
validation scores but higher losses. Finally, we explore how to efectively lightweight the model, using
the training time per epoch as a metric.</p>
      <p>Our experiments are all conducted on the Colab platform running on an A100 instance. The specific
computational resources include 83.5GB of memory and 40GB of GPU memory.</p>
      <p>To facilitate the customization of model parameters, we use the Timm library instead of the torchvision
library to instantiate models. Surprisingly, the baseline reconstructed based on the Timm library
achieved a better private leaderboard score compared to the oficial baseline (the oficial baseline’s
private leaderboard score is 0.31626, while the baseline improved by optimizing the Top-K strategy
achieved a private leaderboard score of 0.32359).</p>
      <sec id="sec-4-1">
        <title>4.1. Comparative Experiments</title>
        <p>For the satellite image feature extraction network, the comparative experiments are as follows:</p>
        <p>Bold indicates the best score for this metric, while blue, if present, indicates a score close to the
best. Training loss is used to help determine if overfitting has occurred and does not represent model
performance. We observe that, based on the scores, the best-performing model is ViT-Base, followed
by EficientNet-B0. However, our goal is to identify the models most suitable for logits fusion, so we
also focus on the loss. We find that ViT-Base and Swin-T-Tiny have very similar validation losses,
making ViT-Base the preferred model. In terms of training time, EficientNet-B0 and Swin-T-Tiny are
the most eficient models. Considering the number of epochs to convergence, EficientNet-B0 reaches
the overfitting threshold in almost half the time of Swin-T-Tiny, but its test set loss is higher than that of
Swin-T-Tiny. Therefore, EficientNet-B0 can be considered a successful attempt at model lightweighting.
However, for the highest score after fusion, the focus should still be on ViT-Base and Swin-T-Tiny.</p>
        <p>Because the comparative experiments of the temporal feature extraction network and the image
feature extraction backbone network are conducted simultaneously, the corresponding image feature
extraction network during the comparative experiments of the temporal feature extraction network is
still the Swin-T-Tiny from the baseline. The comparative experiments are as follows:</p>
        <p>Based on the experimental results, the Swin-T backbone network we proposed for extracting temporal
features achieves the highest private leaderboard score and the lowest validation loss. It is also the only
backbone network that scored higher than the Baseline. In terms of training time and convergence
epochs, it is comparable to the Baseline. Therefore, we believe that the Swin-T backbone network can
replace ResNet18 in the Baseline.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ablation Studies</title>
        <p>Based on the conclusions drawn from the comparative experiments, we initially replace the backbone
networks for image and temporal feature extraction starting from the Baseline model. Subsequently, we
attempt to use the Swin-T backbone network for temporal feature extraction and ViT-Base for satellite
image feature extraction, but encounter dificulties with gradient descent. We analyze this issue and
found that the significant diference in the number of parameters between the two networks, along
with the misalignment in the gradient descent speeds, led to this problem. Since we cannot efectively
resolve this issue, we opt to use Swin-T-Tiny, which has a similar validation loss, as the satellite image
feature extraction network. We also try replacing Swin-T-Tiny with EficientNet-B0, but this only
accelerates training eficiency without improving the scores. Therefore, we determine that the optimal
backbone network for temporal feature extraction is Swin-T, and for satellite image feature extraction
is Swin-T-Tiny.</p>
        <p>Next, we attempt to introduce the Graph modality. We find that models incorporating the Graph
modality showed significant improvement in validation scores and the lowest validation loss so far,
but the public and private leaderboard scores decrease. Upon examining the output, we discover that
including the Graph modality makes the model more aware of some minority classes that are often
overlooked, significantly increasing their logits values. However, under the Top-K output rules, these
minority classes still do not rank high enough, and not all boosted minority classes are present. Those
that do not appear show up in the logits values of some Survey IDs with generally low confidence. This
result in better validation loss but lower private leaderboard scores. This phenomenon inspire us to
propose the threshold Top-K as an improved post-processing algorithm.</p>
        <p>We then try using HCAM (Hierarchical Cross-Attention Mechanism) for feature fusion. We find
that the validation loss of models fused with HCAM slightly increased, but the public and private
leaderboard scores return to the levels before integrating the Graph modality. This demonstrates that
HCAM efectively optimizes the logits distribution while improving model scores.</p>
        <p>To verify whether the model incorporating GFV (Graph Feature Vector) + HCAM is indeed more
suitable for ten-fold cross-validation fusion, we conduct ten-fold cross-validation training on the model
with GFV + HCAM and the baseline model using Swin-T for temporal feature extraction and
SwinTransformer-Tiny for image feature extraction with feature concatenation for modality fusion. We
ifnd that the model incorporating GFV + HCAM scored higher on the leaderboard and validation set
compared to the control group.</p>
        <p>It is important to note that the validation set mentioned in the ten-fold cross-validation method is
the same as the validation set used in previous ablation experiments. However, since the complete
training set is used in ten-fold cross-validation, the data in the validation set is actually included in
the training set. Therefore, the validation set scores here are higher than in experiments without
ten-fold cross-validation, and should only be compared between experiments using the same ten-fold
cross-validation method.</p>
        <p>In the oficial competition submission, we overlook checking the output of the model incorporating
GFV + HCAM and simply judge the failure of our GFV construction and HCAM design based on the
public leaderboard score. Consequently, we choose the previous model with ten-fold cross-validation
training and applied post-processing. However, when completing this paper, we realize that the model
incorporating GFV + HCAM might have more potential for fusion. Subsequent experiments confirme
that this approach can indeed provide further improvements in leaderboard scores.</p>
        <p>We acknowledge that the current results lack persuasive power for the ablation experiments on
HCAM and Graph modality features. Besides the reasons we analyzed, HCAM and the MLP used for
compressing Graph modality features are the parts of our model structure that are more sensitive to
parameter changes. Our lack of time for suficient hyperparameter analysis is also one reason for the
unsatisfactory ablation experiment data. In future work, we will explore the optimal parameters and
structures for these two sub-networks.</p>
        <p>Finally, we apply our proposed post-processing methods to the model outputs. From the ablation
experiments, it is evident that both Threshold Top-K and corrections based on PO data improved model
scores. However, since the validation set cannot contain species present only in PO, the validation set
scores are inaccurate in this context. Furthermore, the post-processing operations do not afect the
model’s training and inference eficiency, so they do not need to be included in the time cost comparison.</p>
        <p>The final results are summarized in the table below(* denotes that the expected improvement is not
achieved, thus this approach is abandoned):</p>
        <p>To provide a more intuitive confirmation of our analysis on the training process losses of diferent
models discussed in this chapter, we visualize the training processes of all models. This facilitates the
comparison of the number of epochs required for diferent models to reach their optimal loss (shown in
Figure11).</p>
        <p>Where ’CV’ stands for the comparative experiments of the visual modality, while ’TC’ represents the
comparative experiments of the Temporal Cubes modality.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Conclusion</title>
        <p>Our comprehensive comparative experiments demonstrate that we select the most suitable feature
extraction backbone networks for each modality’s data. Through rigorous ablation studies, we prove that
each improvement we proposed incrementally enhance the model’s performance, ultimately achieving
a score of 0.36242 on the private leaderboard and in third place(The leaderboard showed 0.35292, but we
continue to optimize some parameters while doing the experiment to finally get 0.36242). In addition
to proposing high-scoring solutions, we also introduce a lightweight model and an eficient training
strategy. Without significantly sacrificing accuracy, we significantly reduce the training time for the
ten-fold cross-fusion model (by more than 50%) and the time for a single epoch and the total number of
epochs (by approximately 75%). Thus, we manage to train ten models for logits fusion in roughly the
same amount of time it previously takes to train one baseline model, achieving prediction results far
exceeding the baseline score. It is worth mentioning that we average the outputs of all models that
score higher than the baseline and apply post-processing, ultimately achieving a private leaderboard
score of 0.36501.</p>
        <p>In summary, from a theoretical perspective, our research provides new insights into the extraction of
temporal information for modality fusion tasks, namely folding it into a 2D matrix and extracting features
through a visual backbone network. We also design a robust fusion method for features extracted from
multiple modalities with varying information densities, using a hierarchical cross-attention mechanism
to dilute features from high to low density progressively. Additionally, we propose a graph-based feature
construction method and output correction post-processing algorithm for the multi-task classification
task of species prediction under specific spacetime, which often involves extremely unbalanced or
sparse labels. From an application perspective, our research is of significant importance in fields such
as ecology, agriculture, environmental protection, and climate change studies.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Future Work</title>
        <p>Despite of our comprehensive comparative and ablation experiments, our work may be further explored
in the following way:
• We will further utilize the organized PO data, noting that there are some high-quality publishers
within PO whose surveys of the same ID often contain many species. Additionally, the competition
provides supplementary data about other modalities for Survey IDs in PO. We believe that these
Survey IDs appearing in PO can be selected through certain logical criteria to serve as training
samples for weakly supervised learning, incorporated into PA. This approach will better leverage
crowdsourced data, reduce the workload of data collection, and enhance model accuracy.
• Referring to past programs, highly ranked teams would extract the rasters around the Survey
ID geographic location from the tif files provided by EnvironmentalRasters as images to be
processed. However, due to arithmetic and time constraints, we finally give up on implementing
this scheme, and we plan to use this part of the data in our subsequent work to realize a leap in
model performance.
• Our current model establishes graph relationships solely for feature aggregation and result
calibration. In future work, we plan to use graph neural networks to replace the current method
of manually setting weights for adjacent nodes in feature aggregation. This will support the use
of weakly supervised and semi-supervised learning training strategies to progressively correct
labels of weakly supervised and semi-supervised nodes, thereby improving training outcomes.
• We hope to introduce NAS technology to optimize the hyperparameters of our 2D time series
feature extraction network based on Swin-Transformer and the hierarchical cross-attention
mechanism, further enhancing the model’s performance.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The data for this paper is organized and published by INRIA. We express our gratitude to all the
institutions and individuals involved in data collection and processing, including but not limited to the
Global Biodiversity Information Facility (GBIF, www.gbif.org), NASA, Soilgrids, and the Ecodatacube
platform. Additionally, this project has received funding from the European Union’s Horizon Research
and Innovation program under grant agreements No. 101060639 (MAMBO project) and No. 101060693
(GUARDEN project) [1, 2]. All authors contribute helpful ideas during the course of the competition and
participate in writing and revising the paper, so all authors are co-first authors. All of the authors, as
corresponding authors, are obliged to reply to emails to provide readers with the relevant code and data
of this work and explain the details of the work. Among them, Haixu Liu, as the first corresponding
author, is responsible for the necessary communication for the publication of the article. Our final
submission CSV download link is as follows: submission_036242.csv and submission_036501.csv. The
link to the code we use to run and obtain the final submission is as follows: code.
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[14] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, M. Long, Timesnet: Temporal 2d-variation modeling for
general time series analysis, in: The eleventh international conference on learning representations,
2022.
[15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, B. Guo, Swin transformer: Hierarchical vision
transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on
computer vision, 2021, pp. 10012–10022.
[16] Z. Liu, H. Mao, C. Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
11976–11986.
[17] F. Chollet, Xception: Deep learning with depthwise separable convolutions, in: Proceedings of the</p>
      <p>IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
[18] M. Tan, Q. Le, Eficientnet: Rethinking model scaling for convolutional neural networks, in:</p>
      <p>International conference on machine learning, PMLR, 2019, pp. 6105–6114.
[19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, H. Adam, Mobilenets:
Eficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861
(2017).
[20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, N. Houlsby,
An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint
arXiv:2010.11929 (2020).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Marcos</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Palard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of GeoLifeCLEF 2024:
          <article-title>Species presence prediction based on occurrence data and high-resolution remote sensing images</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , et al.,
          <source>Overview of LifeCLEF</source>
          <year>2024</year>
          :
          <article-title>Challenges on species distribution prediction and identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lorieul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <article-title>Species distribution modeling based on aerial images and environmental features with convolutional neural networks</article-title>
          ,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2123</fpage>
          -
          <lpage>2150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kellenberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tuia</surname>
          </string-name>
          ,
          <article-title>Block label swap for species distribution modelling</article-title>
          ,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2103</fpage>
          -
          <lpage>2114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H. Q.</given-names>
            <surname>Ung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wada</surname>
          </string-name>
          ,
          <article-title>Leverage samples with single positive labels to train cnn-based models for multi-label plant species prediction</article-title>
          , Working Notes of CLEF (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2023</year>
          <article-title>: evaluation of ai models for the identification and prediction of birds, plants, snakes and fungi</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>416</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Potapov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kommareddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kommareddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Turubanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pickens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <article-title>Landsat analysis ready data for global land cover and land cover change mapping</article-title>
          ,
          <source>Remote Sensing</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <fpage>426</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Witjes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Parente</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. J. van Diemen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hengl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Landa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Brodský</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Glušica</surname>
          </string-name>
          ,
          <article-title>A spatiotemporal ensemble machine learning framework for generating land use/land cover time-series maps for europe (2000-2019) based on lucas, corine and glad landsat</article-title>
          ,
          <source>PeerJ</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <article-title>e13573</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Witjes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Parente</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Križan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hengl</surname>
          </string-name>
          , L. Antonić, Ecodatacube.eu
          <article-title>: analysis-ready open environmental data cube for europe</article-title>
          ,
          <source>PeerJ</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <article-title>e15478</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Karger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Conrad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Böhner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kawohl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kreft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Soria-Auza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kessler</surname>
          </string-name>
          ,
          <article-title>Climatologies at high resolution for the earth's land surface areas</article-title>
          ,
          <source>Scientific data 4</source>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Karger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Conrad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Böhner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kawohl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kreft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Soria-Auza</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Kessler, Data from: Climatologies at high resolution for the earth's land surface areas</article-title>
          ,
          <source>EnviDat</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>