<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multimodal networks for Species Distribution Modeling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aman R. Syayfetdinov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Moscow Institute of Physics and Technology (MIPT)</institution>
          ,
          <addr-line>Dolgoprudny, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Understanding the spatial and temporal distribution of plant species is important for many biodiversity management and conservation scenarios. This paper presents solution to the GeoLifeCLEF challenge, which involves prediction of the presence of plant species using satellite images and time series, climate time series and other rasterized environmental data. Multimodal model leveraged satellite images, bioclimatic cubes and feature vectors of satellite time series and environmental scalar values. With the selected presence probability threshold for inference this method allowed to reach 1-score of 0.347 on public and 0.345 on private leaderboard, placing us 9th on the leaderboard.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Species distribution modeling</kwd>
        <kwd>Biodiversity</kwd>
        <kwd>LifeCLEF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The GeoLifeCLEF 2024 competition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is held jointly as part of the LifeCLEF 2024 lab [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
the FGVC11 workshop. Just like in the GeoLifeCLEF 2023 competition [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the goal is to predict
a list of species most likely to be observed at a given location using various geographical and
environmental data such as satellite images and time series, climatic time series, and other
rasterized data: land cover, human footprint, bioclimatic, and soil variables. Typically, the
task of species distribution modelling [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] has challenges associated with imbalances in species
presence and absence in the data, large-scale multimodal learning, and plant species diversity.
Its results could be useful for predicting biodiversity change and mitigating environmental
pressures from human activities.
      </p>
      <p>The GeoLifeCLEF 2024 training data includes a collection of observations of plants in Europe.
Each survey consists of a list of plant species with the GPS coordinates and a set of variables
characterizing the landscape and environment around them. There are around 90K surveys
with around 5K unique plant species in the dataset.</p>
      <p>This technical report presents selected approach to the competition, which is a multimodal
network based on bioclimatic cubes, sentinel image patches (RGB-patch and NIR-patch) and
vector of climate, elevation, human footprint, land cover, soilgrid and landsat time series data.
Traing code can be found here1.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data and Evaluation Metric</title>
      <p>Data plays an important role in prediction plant species distribution in a given location and time.
In this section, we briefly present the data and the evaluation metric used for the competition.
2.1. Data
This paragraph is simply a description of the standard GeoLifeCLEF 2024 dataset. The training
dataset contains presence-absence (PA) surveys and presence-only (PO) surveys. PO data
includes about 5 million observations and reports only presence and not absence of certain plant
species in specific areas. On the other hand, PA data combines around 90K surveys with about
5K unique species of the European flora and reports presence and absence of plant species. In
solution only presence-absence surveys were used and everywhere below the report will only
be about this type of data. The total number of surveys in the test set was 5K.</p>
      <p>Training dataset distribution of the number of observations of each plant species is shown
in Figure 1. Almost 50% of plant species in training data have a number of occurrences less
than 16 and only 20% have more than 110 occurrences. Almost all observations were made in
Western Europe, a map of locations can be seen in Figure 2. More detailed descriptions can be
found at competitions’s homepage2.</p>
      <p>Each survey is paired with the following covariates:
• Satellite image patches: 128m×128m RGB-NIR patches centered at each observation, at a
resolution of 1 meter per pixel;
• Satellite time series: Up to 20 years of values for six satellite bands (R, G, B, NIR, SWIR1,
and SWIR2);
• Environmental rasters Various climatic, pedologic, land use, and human footprint variables
at the European scale. It was provided as scalar values, time-series, and original rasters;</p>
      <sec id="sec-2-1">
        <title>2.2. Evaluation Metric</title>
        <p>The evaluation metric for the GeoLifeCLEF 2024 competition is the samples-averaged 1-score
computed on a set made of species presence-absence samples. The 1-score is an average
measure of overlap between the predicted and actual set of species present at a given location
and time. Each observation  is associated with a list of ground-truth labels  corresponding</p>
        <sec id="sec-2-1-1">
          <title>1https://www.kaggle.com/code/lonansyayf/baseline-with-modifications/notebook 2https://www.kaggle.com/competitions/geolifeclef-2024/data</title>
          <p>to the observed plant species. For each observation, the submissions provide a set of species
predicted presence ,1, ,2, ..., , . The micro 1-score is then computed using:
1 =</p>
          <p>1 ∑︁</p>
          <p>=1   + (  +  )/2
where   ,   and   are the true positive, the false positive and the false negative of
the j-th input sample, respectively.  is the number of samples for evaluation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section describes the methods that were tried during the competition. Strategy was centered
around the baseline model3 provided by the competition organizers. The baseline 1-score is
0.31 on the public set. This model leveraged all environmental data and utilized a multimodal
neural network with separated features extractors to return a single prediction set in order to
take advantage of every modality (satellite images, bioclimatic cubes, landsat cubes). The main
change was to replace landsat cubes with a vector of satellite time series and environmental
scalar values, everywhere below it is called feature vector. In addition, plant species with an
occurrence number greater than 10 was used to train the model.</p>
      <sec id="sec-3-1">
        <title>3.1. Feature vector</title>
        <p>Feature vector consists of climate, elevation, human footprint, land cover, soilgrid and landsat
time series data. Methods for compiling this data are taken from the public notebook4. Climatic
time series data was merged within a 10-year time window. Some positions had missing values,
which were filled with spatial interpolation. It appeared that there were densely populated
measurements near the missing regions, so missing values were filled with values from the
nearest neighbors. Finally, each survey had 1198 values of feature vector. The train and test
versions can be found here. Before going to model feature vectors are normalized with standard
scaler.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model architecture</title>
        <p>
          The architecture closely follows the baseline model, incorporating a multimodal neural network
that utilizes three distinct feature extractors for bioclimatic rasters (19 channels), satellite images
(4-channel RGB with NIR), and feature vectors (1198 channels). These outputs are combined
and processed through fully connected layers to generate predictions. The first bioclimatic head
involves layer normalization, ResNet18 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] without pretrained weights, and a dropout [6] with a
0.1 probability. The second image head employs a , swin transformer [7] model with ImageNet
[8] weights and a dropout layer with a 0.1 probability. Prior to this stage, image data undergo
augmentation techniques like random rotation, random brightness contrast, and normalization.
The third head comprises a sequence of layer normalization and three linear layers with GELU
[9] activation function, along with dropout set at a 0.1 probability (the first layer mapping from
1198 to 1198, the second and the third layers map to 1000 outputs). Subsequently, the bioclimatic
and feature outputs are normalized and combined with the image output. The final classifier is
constructed with three linear layers utilizing GELU activation function and dropout at a 0.1
probability.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Training and inference</title>
        <p>The model was trained on PA data for 12 epochs using the Adam optimizer with a learning rate
of 8e-5 and binary cross entropy (BCE) loss and batch size equal to 128. During training, we</p>
        <sec id="sec-3-3-1">
          <title>3https://www.kaggle.com/code/picekl/sentinel-landsat-bioclim-baseline-0-31626 4https://www.kaggle.com/code/gobyeonggeon/preprocess-visualize-spatial-data-eda-xgb</title>
          <p>focused on plant species with an occurrence number greater than 10, resulting in 2857 unique
species out of a total of 5015. It’s important to highlight that the occurrence threshold value
was determined through experimentation.</p>
          <p>In final approach to inference, the strategy used in the baseline notebook was changed. Rather
than forecasting the 25 most probable species for every observation in the test dataset, selected
threshold of 0.18 was used. This threshold determined that species with probabilities surpassing
this value were classified as present. Additionally, test observations featuring fewer than 4
represented species was assigned with the 4 most likely species.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental results</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental settings</title>
        <p>Experiments were conducted with the multimodal network described in Section 3.2. The detailed
settings of training are shown in Table 1. For comparing diferent versions of models we used
25 most probable species to remove bias with probability threshold described in Section 3.3.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Usage of feature vector</title>
        <p>In order to investigate the impact of using the feature vector head we conducted ablation study.
Table 2 represents the detailed results. It seems that with selected hyperparameters combination
of bioclimatic, image and feature heads gives the best performance of around 0.32 on both public
and private scores. The performances of other configurations are about 0.31 or less.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Imbalanced data</title>
        <p>As was mentioned before, the dataset is strongly unbalanced, which means that for almost
all species the number of observations detecting their presence is much less than the number
of observations detecting their absence. we tried to solve this problem in diferent ways, for
example, adding pos_weight to bce loss, adding diferent data augmentation. The final option
was to limit the number of species on which the model is trained, taking only those with
occurrence number greater than 10. Table 2 shows how the score depends on the threshold
for the occurrence number. Another thing was lowering the threshold for a species having
a probability higher than which it was considered present. For those observations that had
fewer than 4 species present we assigned the 4 most likely plant species. Results of diferent
probability thresholds are presented in Table 3.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We presented the working principles of submission to the GeoLifeCLEF 2024 challenge and
discussed some of the key findings of the results. We have not conducted an expansive, let alone
exhaustive hyperparameter search and believe that doing so could raise performance a bit. The
main achievement was to use proper model architecture, choosing training data and changing
the inference strategy. In final solution, we did not use PO data and training strategies used in
previous years [10, 11]. Obviously, using more data would help for better generalization and it
is certainly high on the list of improvements that need to be made. Also, possible improvements
can be achieved by additionally searching for better backbone models, like Inception-v4 [12] or
Vision Transformer, ViT B / 16 [13] for diferent modalities and using an ensemble of various
models.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp.
770–778. doi:10.1109/CVPR.2016.90.
[6] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple
way to prevent neural networks from overfitting, Journal of Machine Learning Research
15 (2014) 1929–1958.
[7] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical
vision transformer using shifted windows, 2021. doi:10.1109/ICCV48922.2021.00986.
[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, F.-F. Li, ImageNet: a Large-scale hierarchical
image database, in: Conference: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2009, pp. 248–255. doi:10.1109/CVPR.2009.5206848.
[9] D. Hendrycks, K. Gimpel, Gaussian error linear units (GELUs) (2016). arXiv:1606.08415.
[10] H. Ung, R. Kojima, S. Wada, Leverage samples with single positive labels to train
CNNbased models for multi-label plant species prediction, 2023.
[11] B. Kellengerger, D. Tuia, Block label swap for species distribution modelling, 2022.
[12] C. Szegedy, S. Iofe, V. Vanhoucke, A. Alemi, Inception-v4, Inception-ResNet and the
impact of residual connections on learning, AAAI Conference on Artificial Intelligence 31
(2016). doi:10.1609/aaai.v31i1.11231.
[13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.
Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth
16x16 words: Transformers for image recognition at scale, 2021. arXiv:2010.11929.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Marcos</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Palard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of GeoLifeCLEF 2024:
          <article-title>Species presence prediction based on occurrence data and high-resolution remote sensing images</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2024</year>
          :
          <article-title>Challenges on species distribution prediction and identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Marcos</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of GeoLifeCLEF 2023:
          <article-title>Species presence prediction based on occurrence data and highresolution remote sensing images</article-title>
          ,
          <source>in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N. E.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          , T. C. Edwards Jr.,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Pearman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Svenning</surname>
          </string-name>
          ,
          <article-title>New trends in species distribution modelling</article-title>
          ,
          <source>Ecography</source>
          <volume>33</volume>
          (
          <year>2010</year>
          )
          <fpage>985</fpage>
          -
          <lpage>989</lpage>
          . doi:
          <volume>10</volume>
          .1111/j. 1600-
          <fpage>0587</fpage>
          .
          <year>2010</year>
          .
          <volume>06953</volume>
          .x.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          , in: Conference:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>