<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting Plant Species Distribution with a Multimodal Swin Transformer Network: A GeoLifeCLEF 2025 Report</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aman R. Syayfetdinov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Higher School of Economics (HSE)</institution>
          ,
          <addr-line>Moscow, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Predicting the spatial and temporal distribution of plant species is a key challenge for biodiversity monitoring and conservation planning. In this report, we present the solution to the GeoLifeCLEF 2025 challenge, which requires predicting the presence of plant species using satellite images and time series, climate time series and other rasterized environmental data. Our approach utilizes multimodal network with encoders for images, climate and satellite time series. During inference, we apply a fixed probability threshold to produce multi-label predictions. Without any pseudo-labeling or ensembling, our model achieves a macro-F1 score of 0.218 on the public leaderboard and 0.192 on the private leaderboard, placing us 6th place. We analyze the impact of each modality and discuss ways for further improvement.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal Deep Learning</kwd>
        <kwd>Species distribution modeling</kwd>
        <kwd>Biodiversity</kwd>
        <kwd>LifeCLEF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Species distribution modeling [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] plays a crucial role in biodiversity conservation by predicting
species occurrence probabilities across spatial-temporal contexts. The recent spread of
geolocated species observations, which cover thousands of species, has created opportunities for
data-driven approaches. The GeoLifeCLEF 2025 competition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], part of the LifeCLEF 2025 lab
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and FGVC12 workshop organized in conjunction with the CVPR 2025 conference, leverages
this potential through a large-scale multimodal prediction task: identifying plant species likely
observed in diferent locations using various environmental data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The GeoLifeCLEF 2025 competition presents two main complexities: extreme data
heterogeneity and challenging labeling constraints. It is similar to previous editions [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] with a
training dataset comprising 90,000 surveys across Europe, each documenting observed plant
species (from 5,000 unique taxa) alongside multimodal environmental descriptors. These include
satellite imagery (RGB/NIR patches), bioclimatic data cubes, time series climate measurements,
land cover classifications, updated human footprint indices, soil properties and elevation data.
Furthermore, only a small subset provides complete species labels (Presence-Absence data),
while the majority (about 5M records) ofer only single positive annotations (Presence-Only
data). This creates a strong partial-label scenario for multi-species prediction. In this year, the
test set includes 10,000 additional plots, resulting in almost 14K overall, with significant location
shifts, inducing domain adaptation challenges.
      </p>
      <p>In this paper we introduce an adaptive prediction method that combines a tuned probability
threshold with a top-k fallback mechanism. This approach provides more contextually relevant
predictions than a fixed-k method while preventing overly sparse outputs. We demonstrate
that a single, non-ensembled model can achieve a top-tier ranking. Our experiments show
that strategically focusing on the richest data modalities and curating the training set is more
efective than including all available data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>The GeoLifeCLEF 2025 dataset contains presence-absence (PA) and presence-only (PO)
observations. PO data includes about 5 million observations and reports only the presence and
not absence of certain plant species in specific areas. However, PA data combines around 90K
surveys with about 5K unique species of the European flora and reports the presence and
absence of plant species. The total number of surveys in the test set was approximately 16K.</p>
      <p>In PA train data, species distribution exhibits extreme imbalance: 50% of taxa have 16
occurrences, while only 20% exceed 110 observations. Geographically, the data skew towards western
Europe, a map of the locations can be seen in Figure 1. More detailed descriptions can be found
on the competitions’ homepage1.</p>
      <p>Each survey pairs GPS coordinates with some environmental data:
• Satellite imagery: 128m×128m Sentinel-2 RGB/NIR patches (10m resolution) + Landsat
time-series (6-band quarterly composites);
• Climatic data: 19 bioclimatic rasters (1km resolution);
• Soil variables: 19 SoilGrids properties (e.g., pH, organic carbon);
• Human footprint: 16 time-dynamic pressure indices (1993/2009);
• Land cover: (500m resolution) and elevation (30m resolution);</p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation Metric</title>
      <p>The competition utilizes a macro-averaged 1-score to evaluate species presence predictions.
This metric provides equal weight to all species, mitigating bias from class imbalance by
independently estimating each taxon’s performance. For each test plot  with true species set ,
the model predicts ranked presence probabilities ,1, ,2, ..., , . After thresholding these
probabilities to obtain binary predictions, the metric computes:</p>
      <p>Macro 1 = 1 ∑︁ 1,
 =1</p>
      <sec id="sec-3-1">
        <title>1https://www.kaggle.com/competitions/geolifeclef-2025/data</title>
        <p>, where
1, =</p>
        <p>1 ∑︁</p>
        <p>,
=1  , + ( , +  ,)/2
Here  ,,  , and  , are the true positive, the false positive and the false negative of the
i-th input sample, respectively, while  represents all 5K species. This formulation emphasizes
balanced performance across rare and common taxa.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This section describes our proposed multimodal network and the methods that were tried during
this competition.</p>
      <sec id="sec-4-1">
        <title>4.1. Model architectures</title>
        <p>
          Our solution for the GeoLifeCLEF 2025 challenge was based on the baselines provided by
the organizers in current and previous competitions. However, through experimentation, we
determined that our approach from the previous year [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which used vectors of satellite time
series and scalar environmental values, did not yield performance improvements this time.
        </p>
        <p>
          Our final model, illustrated in Figure 2, is a multimodal neural network that exclusively
integrates three rich data sources: Sentinel-2 satellite imagery, bioclimatic data cubes, and
Landsat data cubes. The core of our architecture consists of three specialized encoders, one
for each data modality, based on a modified Swin Transformer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] architecture. Each encoder
processes its respective data type to generate a feature embedding, which are then fused and
passed to a final classification head to predict the species.
        </p>
        <p>The Landsat data encoder utilizes a Swin_t model trained from scratch. To prepare the data,
we first apply an initial LayerNorm to stabilize the input distribution. The primary modification
involves adapting the model’s first convolutional layer (the patch embedding layer) to accept 6
input channels instead of the default 3. This allowed to accommodate the rich multi-spectral
and temporal information. To simplify the model structure and focus on feature extraction,
we replace its final classification head with an Identity layer, which outputs a 768-dimensional
feature vector.</p>
        <p>Similarly, the encoder for the 19 bioclimatic variables (structured as a 4-channel input)
uses a Swin_t architecture also trained from scratch. It is preceded by a LayerNorm layer,
accommodating the diversity of Bioclim data, and its initial patch embedding layer is modified
to handle 4 input channels. Like the Landsat encoder, its classification head is removed to output
high-level features.</p>
        <p>
          To leverage the high-resolution multispectral Sentinel-2 imagery (Red, Green, Blue, and
Near-Infrared bands), we employed a Swin_t model pre-trained on the ImageNet-1K dataset
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This supervised pre-training provides a powerful starting point for feature extraction. We
adapted the pre-trained model for 4-channel input by modifying the first convolutional layer’s
weights. The weights for the original three channels (RGB) were preserved, while the weights
for the new fourth channel (NIR) were initialized by averaging the RGB channel weights. This
strategy retains the valuable features learned during pre-training while accommodating the
additional spectral band. The classification head was also replaced with an Identity layer.
        </p>
        <p>
          The 768-dimensional feature vectors produced by each of the three Swin_t backbones are
ifrst passed through separate projection heads. Each projection head consists of a Linear layer,
BatchNorm1d, a GELU [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] activation, and Dropout [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], transforming each feature vector into
a 1000-dimensional representation.
        </p>
        <p>
          These projected features are then concatenated into a single 3000-dimension vector. This
fused vector is fed into a final classification head—a multi-layer perceptron (MLP) with GELU
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] activations and Dropout [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] —which produces the ultimate species predictions. This
multimodal fusion design ensures that the model learns to combine cues from all three data
sources efectively. Full implementation can be found in Kaggle 2.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training and inference</title>
        <p>To manage the highly imbalanced and long-tailed distribution of species occurrences in the
training data, we first filtered the dataset. We focused on plant species with more than 5
recorded occurrences, which reduced the number of target classes from the original 5,016 to
a more manageable 3,425. This occurrence threshold was empirically determined through
experimentation to optimize the trade-of between class coverage and model stability.</p>
        <p>The model was trained on the Presence-Absence (PA) data for 12 epochs. We employed
the AdamW optimizer with a learning rate of 8e-5 and a Cosine Annealing Learning Rate
Scheduler (CosineAnnealingLR) to promote stable convergence. Given the multi-label nature of
the problem (multiple species can be present at one location), we used Binary Cross-Entropy
(BCE) loss. Training was conducted with a batch size of 300.</p>
        <p>For the final predictions on the test set, we developed a hybrid, threshold-based strategy that
deviates from the baseline approach of simply predicting a fixed number of the most probable
species. Our method is designed to be more adaptive to the local biodiversity of each observation
point. After passing the test data through the model, we apply a Sigmoid function to the output
logits to obtain a probability score between 0 and 1 for each of the 3,425 species. We then apply
a probability threshold of 0.18. Any species with a score exceeding this threshold is classified
as present. This threshold was carefully tuned on a validation set to balance precision and
recall. To ensure a minimum number of predictions for each observation and avoid generating
overly sparse results, we implemented a fallback mechanism. If the number of species predicted
using the 0.18 threshold is fewer than 14, we discard those predictions and instead select the
14 species with the highest probability scores for that specific observation. This fallback value
was chosen as it approximates the median number of species per plot in the training data,
providing a data-driven baseline that prevents overly conservative predictions while being
robust to outliers.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental results</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental settings</title>
        <p>To tune hyperparameters and validate our architectural choices, we partitioned the oficial
training set into a training subset (80%) and a validation subset (20%). The training subset was</p>
        <sec id="sec-5-1-1">
          <title>2https://www.kaggle.com/code/lonansyayf/2025-model-geolifeclef</title>
          <p>used for model optimization, while the validation subset provided an unbiased estimate of
performance for model selection. For comparing diferent model versions during this
development phase, we benchmarked performance using the top-25 most probable species for each
observation. This standardized metric allowed for a consistent and direct comparison between
models, removing the potential bias introduced by our custom probability thresholding strategy
(described in Section 4.2). The detailed settings of training are shown in Table 1.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Imbalanced data</title>
        <p>The dataset exhibits a strong long-tailed distribution, where most species have far fewer presence
records than absence records. We explored several mitigation techniques. While methods like
applying a pos_weight to the Binary Cross-Entropy (BCE) loss were tested, they did not yield
significant improvements. Our most successful strategy was a multi-faceted approach combining
data curation and regularization. We constrained the training problem by focusing on species
with an occurrence count greater than 5. This reduced the number of classes from 5,016 to
3,425, allowing the model to learn more robust features for species with a reasonable number
of examples. Table 3 shows the impact of species filtering on model performance. Our second
strategy involved the hybrid inference approach detailed in Section 4.2: we applied a tuned
probability threshold and used a top-14 fallback for observations with sparse predictions. The
performance of diferent thresholding strategies is presented in Table 4.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Encoder Architecture</title>
        <p>
          A key finding from our ablation studies was the superior performance of the Swin Transformer
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] architecture (see Table 5). We conducted comparative experiments using a standard ResNet18
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] model as the backbone for each modality encoder. The Swin_t-based encoders consistently
outperformed their ResNet-based counterparts on our validation set. This suggests that the
Swin Transformer’s hierarchical feature representation and its ability to model long-range
dependencies are particularly well-suited for extracting discriminative information from the
complex spatial patterns found in satellite, bioclimatic, and Landsat data cubes.
        </p>
        <p>
          A key strategy for improving overall score was diferent model regularization. To prevent
overfitting and improve generalization, we integrated standard image augmentations (such as
rotation and random brightness/contrast adjustments). We utilized Dropout [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] along whole
network and Batch Normalization in projector layers after modality encoders.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented the multimodal deep learning framework that secured 6th place in
the GeoLifeCLEF 2025 competition. Our approach integrated Sentinel-2 imagery, Landsat
timeseries, and bioclimatic data cubes using a network of specialized Swin Transformer encoders. We
demonstrated that this architecture is particularly efective for processing complex, rasterized
environmental data. The success of our method hinged on three key contributions: the tailored
Swin Transformer backbones, a pragmatic data curation strategy that prioritized signal quality
over data quantity, and a novel hybrid inference method combining a tuned probability threshold
with a top-18 fallback.</p>
      <p>To support reproducibility and encourage further research, the complete source code for our
solution is publicly available on Kaggle3. Looking ahead, several avenues for enhancement exist.
While our model achieves strong performance, a more exhaustive hyperparameter optimization
could yield further gains. The most significant opportunity for improvement, however, lies in
data enrichment. A key future direction would be to revisit the integration of the Presence-Only
(PO) data, potentially using advanced techniques to correct for sampling bias, which could
substantially improve the model’s generalization across rare species. Finally, incorporating
additional environmental data layers and exploring more sophisticated data augmentation
methods remain promising areas for future research.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used DeepSeek to check grammar and spelling,
as well as to paraphrase and reword the text. After using this tool/service, the author reviewed
and edited the content as needed and takes full responsibility for the publication’s content.</p>
      <sec id="sec-7-1">
        <title>3https://www.kaggle.com/code/lonansyayf/2025-model-geolifeclef</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N. E.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          , T. C. Edwards Jr.,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Pearman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Svenning</surname>
          </string-name>
          ,
          <article-title>New trends in species distribution modelling</article-title>
          ,
          <source>Ecography</source>
          <volume>33</volume>
          (
          <year>2010</year>
          )
          <fpage>985</fpage>
          -
          <lpage>989</lpage>
          . doi:
          <volume>10</volume>
          .1111/j. 1600-
          <fpage>0587</fpage>
          .
          <year>2010</year>
          .
          <volume>06953</volume>
          .x.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of GeoLifeCLEF 2025:
          <article-title>Plant species presence prediction with environmental and high-resolution remote sensing data</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janoušková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Čermák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Papafitsoros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Martellucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinatier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of lifeclef 2025:
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF)</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Palard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Geoplant: Spatial plant species prediction dataset</article-title>
          ,
          <source>NEURIPS</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Marcos</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Palard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of geolifeclef 2024:
          <article-title>Species composition prediction with high spatial resolution at continental scale using remote sensing</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Marcos</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of GeoLifeCLEF 2023:
          <article-title>Species presence prediction based on occurrence data and highresolution remote sensing images</article-title>
          ,
          <source>in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Syayfetdinov</surname>
          </string-name>
          ,
          <article-title>Multimodal networks for species distribution modeling</article-title>
          ,
          <year>2024</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-208.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1109/ICCV48922.
          <year>2021</year>
          .
          <volume>00986</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.-F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>ImageNet: a Large-scale hierarchical image database</article-title>
          ,
          <source>in: Conference: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2009</year>
          .
          <volume>5206848</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <article-title>Gaussian error linear units (GELUs) (</article-title>
          <year>2016</year>
          ). arXiv:
          <volume>1606</volume>
          .
          <fpage>08415</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <year>2014</year>
          )
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Conference: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>