Multimodal networks for Species Distribution Modeling Notebook for the LifeCLEF Lab at CLEF 2024

Multimodal networks for Species Distribution Modeling Notebook for the LifeCLEF Lab at CLEF 2024 AmanRSyayfetdinov asysyfetdinov@gmail.com Moscow Institute of Physics and Technology (MIPT)

Dolgoprudny Russian Federation

Multimodal networks for Species Distribution Modeling Notebook for the LifeCLEF Lab at CLEF 2024 1613-0073 4EAC6110A0ED959F6EC0D8F979B2F0A9 GROBID - A machine learning software for extracting information from scholarly documents Species distribution modeling Biodiversity LifeCLEF

Understanding the spatial and temporal distribution of plant species is important for many biodiversity management and conservation scenarios. This paper presents solution to the GeoLifeCLEF challenge, which involves prediction of the presence of plant species using satellite images and time series, climate time series and other rasterized environmental data. Multimodal model leveraged satellite images, bioclimatic cubes and feature vectors of satellite time series and environmental scalar values. With the selected presence probability threshold for inference this method allowed to reach 𝐹1-score of 0.347 on public and 0.345 on private leaderboard, placing us 9th on the leaderboard.

Introduction

The GeoLifeCLEF 2024 competition [1] is held jointly as part of the LifeCLEF 2024 lab [2] and the FGVC11 workshop. Just like in the GeoLifeCLEF 2023 competition [3] the goal is to predict a list of species most likely to be observed at a given location using various geographical and environmental data such as satellite images and time series, climatic time series, and other rasterized data: land cover, human footprint, bioclimatic, and soil variables. Typically, the task of species distribution modelling [4] has challenges associated with imbalances in species presence and absence in the data, large-scale multimodal learning, and plant species diversity. Its results could be useful for predicting biodiversity change and mitigating environmental pressures from human activities.

The GeoLifeCLEF 2024 training data includes a collection of observations of plants in Europe. Each survey consists of a list of plant species with the GPS coordinates and a set of variables characterizing the landscape and environment around them. There are around 90K surveys with around 5K unique plant species in the dataset. This technical report presents selected approach to the competition, which is a multimodal network based on bioclimatic cubes, sentinel image patches (RGB-patch and NIR-patch) and vector of climate, elevation, human footprint, land cover, soilgrid and landsat time series data. Traing code can be found here1 .

Data and Evaluation Metric

Data plays an important role in prediction plant species distribution in a given location and time. In this section, we briefly present the data and the evaluation metric used for the competition.

Data

This paragraph is simply a description of the standard GeoLifeCLEF 2024 dataset. The training dataset contains presence-absence (PA) surveys and presence-only (PO) surveys. PO data includes about 5 million observations and reports only presence and not absence of certain plant species in specific areas. On the other hand, PA data combines around 90K surveys with about 5K unique species of the European flora and reports presence and absence of plant species. In solution only presence-absence surveys were used and everywhere below the report will only be about this type of data. The total number of surveys in the test set was 5K.

Training dataset distribution of the number of observations of each plant species is shown in Figure 1. Almost 50% of plant species in training data have a number of occurrences less than 16 and only 20% have more than 110 occurrences. Almost all observations were made in Western Europe, a map of locations can be seen in Figure 2. More detailed descriptions can be found at competitions's homepage 2 .

Each survey is paired with the following covariates:

• Satellite image patches: 128m×128m RGB-NIR patches centered at each observation, at a resolution of 1 meter per pixel; • Satellite time series: Up to 20 years of values for six satellite bands (R, G, B, NIR, SWIR1, and SWIR2); • Environmental rasters Various climatic, pedologic, land use, and human footprint variables at the European scale. It was provided as scalar values, time-series, and original rasters;

Evaluation Metric

The evaluation metric for the GeoLifeCLEF 2024 competition is the samples-averaged 𝐹 1 -score computed on a set made of species presence-absence samples. The 𝐹 1 -score is an average measure of overlap between the predicted and actual set of species present at a given location and time. Each observation 𝑖 is associated with a list of ground-truth labels 𝑌 𝑖 corresponding to the observed plant species. For each observation, the submissions provide a set of species predicted presence 𝑃 𝑖,1 , 𝑃 𝑖,2 , ..., 𝑃 𝑖,𝑅 𝑖 . The micro 𝐹 1 -score is then computed using:

𝐹 1 = 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑇 𝑃 𝑖 𝑇 𝑃 𝑖 + (𝐹 𝑃 𝑖 + 𝐹 𝑁 𝑖 )/2

where 𝑇 𝑃 𝑗 , 𝐹 𝑃 𝑗 and 𝐹 𝑁 𝑗 are the true positive, the false positive and the false negative of the j-th input sample, respectively. 𝑁 is the number of samples for evaluation.

Methodology

This section describes the methods that were tried during the competition. Strategy was centered around the baseline model 3 provided by the competition organizers. The baseline 𝐹 1 -score is 0.31 on the public set. This model leveraged all environmental data and utilized a multimodal neural network with separated features extractors to return a single prediction set in order to take advantage of every modality (satellite images, bioclimatic cubes, landsat cubes). The main change was to replace landsat cubes with a vector of satellite time series and environmental scalar values, everywhere below it is called feature vector. In addition, plant species with an occurrence number greater than 10 was used to train the model.

Feature vector

Feature vector consists of climate, elevation, human footprint, land cover, soilgrid and landsat time series data. Methods for compiling this data are taken from the public notebook 4 . Climatic time series data was merged within a 10-year time window. Some positions had missing values, which were filled with spatial interpolation. It appeared that there were densely populated measurements near the missing regions, so missing values were filled with values from the nearest neighbors. Finally, each survey had 1198 values of feature vector. The train and test versions can be found here. Before going to model feature vectors are normalized with standard scaler.

Model architecture

The architecture closely follows the baseline model, incorporating a multimodal neural network that utilizes three distinct feature extractors for bioclimatic rasters (19 channels), satellite images (4-channel RGB with NIR), and feature vectors (1198 channels). These outputs are combined and processed through fully connected layers to generate predictions. The first bioclimatic head involves layer normalization, ResNet18 [5] without pretrained weights, and a dropout [6] with a 0.1 probability. The second image head employs a , swin transformer [7] model with ImageNet [8] weights and a dropout layer with a 0.1 probability. Prior to this stage, image data undergo augmentation techniques like random rotation, random brightness contrast, and normalization. The third head comprises a sequence of layer normalization and three linear layers with GELU [9] activation function, along with dropout set at a 0.1 probability (the first layer mapping from 1198 to 1198, the second and the third layers map to 1000 outputs). Subsequently, the bioclimatic and feature outputs are normalized and combined with the image output. The final classifier is constructed with three linear layers utilizing GELU activation function and dropout at a 0.1 probability.

Training and inference

The model was trained on PA data for 12 epochs using the Adam optimizer with a learning rate of 8e-5 and binary cross entropy (BCE) loss and batch size equal to 128. During training, we focused on plant species with an occurrence number greater than 10, resulting in 2857 unique species out of a total of 5015. It's important to highlight that the occurrence threshold value was determined through experimentation.

In final approach to inference, the strategy used in the baseline notebook was changed. Rather than forecasting the 25 most probable species for every observation in the test dataset, selected threshold of 0.18 was used. This threshold determined that species with probabilities surpassing this value were classified as present. Additionally, test observations featuring fewer than 4 represented species was assigned with the 4 most likely species.

Experimental results

Experimental settings

Experiments were conducted with the multimodal network described in Section 3.2. The detailed settings of training are shown in Table 1. For comparing different versions of models we used 25 most probable species to remove bias with probability threshold described in Section 3.3.

Usage of feature vector

In order to investigate the impact of using the feature vector head we conducted ablation study. Table 2 represents the detailed results. It seems that with selected hyperparameters combination

Imbalanced data

As was mentioned before, the dataset is strongly unbalanced, which means that for almost all species the number of observations detecting their presence is much less than the number of observations detecting their absence. we tried to solve this problem in different ways, for example, adding pos_weight to bce loss, adding different data augmentation. The final option was to limit the number of species on which the model is trained, taking only those with occurrence number greater than 10. Table 2 shows how the score depends on the threshold for the occurrence number. Another thing was lowering the threshold for a species having a probability higher than which it was considered present. For those observations that had fewer than 4 species present we assigned the 4 most likely plant species. Results of different probability thresholds are presented in Table 3.

Conclusion

We presented the working principles of submission to the GeoLifeCLEF 2024 challenge and discussed some of the key findings of the results. We have not conducted an expansive, let alone exhaustive hyperparameter search and believe that doing so could raise performance a bit. The main achievement was to use proper model architecture, choosing training data and changing the inference strategy. In final solution, we did not use PO data and training strategies used in previous years [10,11]. Obviously, using more data would help for better generalization and it is certainly high on the list of improvements that need to be made. Also, possible improvements can be achieved by additionally searching for better backbone models, like Inception-v4 [12] or Vision Transformer, ViT B / 16 [13] for different modalities and using an ensemble of various models.

Figure 1 :1Figure 1: Histogram for distribution of the occurrences of plant species in the training dataset. Horizontal axis on a logarithmic scale for better understanding.

Figure 2 :2Figure 2: Map of Europe with observation distribution. The train data location is green point, the test data is red points.

Figure 3 :3Figure 3: Selected multi-model architecture. Bioclimatic, image and feature heads mapping to 1000, 768, 1000 outputs, respectively. Then stacked outputs pass through linear layers mapping to the 2857 species (species with occurrence number > 10)

Table 11Frequency of Special CharactersHyper-parametersBatch size128OptimizerAdamLearning rate8e-5Lr schedulerCosineAnnelingLRNumber of epochs12Table 2Ablation study of usage the feature vector headBioclimatic head Image head Feature head Landsat head𝐹 1 -score Public Private✓✓-✓0.3150.316✓✓✓✓0.3170.317-✓✓✓0.3060.311✓✓✓-0.3220.323

Table 33Score depending on the number of occurrences of plant species for model trainingSpecies with number of occurrences𝐹 1 -score Public Private>0 (5096 in total)0.3220.323>5 (3425 in total)0.3220.326>10 (2857 in total)0.3260.329>15 (2511 in total)0.3240.328of bioclimatic, image and feature heads gives the best performance of around 0.32 on both publicand private scores. The performances of other configurations are about 0.31 or less.

Table 44Score depending on the presence probability thresholdProbability threshold𝐹 1 -score Public Private0.40.3090.3030.30.3340.3320.20.3460.3450.150.3450.3420.10.3290.327

https://www.kaggle.com/code/lonansyayf/baseline-with-modifications/notebook https://www.kaggle.com/competitions/geolifeclef-2024/data https://www.kaggle.com/code/picekl/sentinel-landsat-bioclim-baseline-0-31626 https://www.kaggle.com/code/gobyeonggeon/preprocess-visualize-spatial-data-eda-xgb

Overview of GeoLifeCLEF 2024: Species presence prediction based on occurrence data and high-resolution remote sensing images LPicek CBotella MServajean BDeneu DMarcosGonzalez RPalard TLarcher CLeblanc JEstopinan PBonnet AJoly Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum 2024 Overview of lifeclef 2024: Challenges on species distribution prediction and identification AJoly LPicek SKahl HGoëau VEspitalier CBotella BDeneu DMarcos JEstopinan CLeblanc TLarcher MŠulc MHrúz MServajean JMatas International Conference of the Cross-Language Evaluation Forum for European Languages Springer 2024 Overview of GeoLifeCLEF 2023: Species presence prediction based on occurrence data and highresolution remote sensing images CBotella BDeneu JEstopinan MServajean DMarcosGonzalez AJoly Working Notes of CLEF 2023 -Conference and Labs of the Evaluation Forum 2023 New trends in species distribution modelling NEZimmermann TCEdwardsJr CHGraham PBPearman J.-CSvenning 10.1111/j.1600-0587.2010.06953.x Ecography 33 2010 Deep residual learning for image recognition KHe XZhang SRen JSun 10.1109/CVPR.2016.90 Conference: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016 Dropout: A simple way to prevent neural networks from overfitting NSrivastava GHinton AKrizhevsky ISutskever RSalakhutdinov Journal of Machine Learning Research 15 2014 Swin transformer: Hierarchical vision transformer using shifted windows ZLiu YLin YCao HHu YWei ZZhang SLin BGuo 10.1109/ICCV48922.2021.00986 2021 ImageNet: a Large-scale hierarchical image database JDeng WDong RSocher L.-JLi KLi F.-FLi 10.1109/CVPR.2009.5206848 Conference: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2009 DHendrycks KGimpel arXiv:1606.08415 Gaussian error linear units GELUs 2016 Leverage samples with single positive labels to train CNNbased models for multi-label plant species prediction HUng RKojima SWada 2023 Block label swap for species distribution modelling BKellengerger DTuia 2022 Inception-v4, Inception-ResNet and the impact of residual connections on learning CSzegedy SIoffe VVanhoucke AAlemi 10.1609/aaai.v31i1.11231 AAAI Conference on Artificial Intelligence 31 2016 An image is worth 16x16 words: Transformers for image recognition at scale ADosovitskiy LBeyer AKolesnikov DWeissenborn XZhai TUnterthiner MDehghani MMinderer GHeigold SGelly JUszkoreit NHoulsby arXiv:2010.11929 2021