Multimodal networks for Species Distribution Modeling Notebook for the LifeCLEF Lab at CLEF 2024 Aman R. Syayfetdinov1 1 Moscow Institute of Physics and Technology (MIPT), Dolgoprudny, Russian Federation Abstract Understanding the spatial and temporal distribution of plant species is important for many biodiversity management and conservation scenarios. This paper presents solution to the GeoLifeCLEF challenge, which involves prediction of the presence of plant species using satellite images and time series, climate time series and other rasterized environmental data. Multimodal model leveraged satellite images, bioclimatic cubes and feature vectors of satellite time series and environmental scalar values. With the selected presence probability threshold for inference this method allowed to reach 𝐹1 -score of 0.347 on public and 0.345 on private leaderboard, placing us 9th on the leaderboard. Keywords Species distribution modeling, Biodiversity, LifeCLEF This is a technical report for final contribution to the GeoLifeCLEF 2024 challenge , sub- mitted under pseudonym β€œLonan Syayf”, with which the ninth place was obtained (out of 51 competitors) on the private leaderboard. 1. Introduction The GeoLifeCLEF 2024 competition [1] is held jointly as part of the LifeCLEF 2024 lab [2] and the FGVC11 workshop. Just like in the GeoLifeCLEF 2023 competition [3] the goal is to predict a list of species most likely to be observed at a given location using various geographical and environmental data such as satellite images and time series, climatic time series, and other rasterized data: land cover, human footprint, bioclimatic, and soil variables. Typically, the task of species distribution modelling [4] has challenges associated with imbalances in species presence and absence in the data, large-scale multimodal learning, and plant species diversity. Its results could be useful for predicting biodiversity change and mitigating environmental pressures from human activities. The GeoLifeCLEF 2024 training data includes a collection of observations of plants in Europe. Each survey consists of a list of plant species with the GPS coordinates and a set of variables characterizing the landscape and environment around them. There are around 90K surveys with around 5K unique plant species in the dataset. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ asysyfetdinov@gmail.com (A. R. Syayfetdinov)  0009-0005-5170-0829 (A. R. Syayfetdinov) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings This technical report presents selected approach to the competition, which is a multimodal network based on bioclimatic cubes, sentinel image patches (RGB-patch and NIR-patch) and vector of climate, elevation, human footprint, land cover, soilgrid and landsat time series data. Traing code can be found here1 . 2. Data and Evaluation Metric Data plays an important role in prediction plant species distribution in a given location and time. In this section, we briefly present the data and the evaluation metric used for the competition. 2.1. Data This paragraph is simply a description of the standard GeoLifeCLEF 2024 dataset. The training dataset contains presence-absence (PA) surveys and presence-only (PO) surveys. PO data includes about 5 million observations and reports only presence and not absence of certain plant species in specific areas. On the other hand, PA data combines around 90K surveys with about 5K unique species of the European flora and reports presence and absence of plant species. In solution only presence-absence surveys were used and everywhere below the report will only be about this type of data. The total number of surveys in the test set was 5K. Training dataset distribution of the number of observations of each plant species is shown in Figure 1. Almost 50% of plant species in training data have a number of occurrences less than 16 and only 20% have more than 110 occurrences. Almost all observations were made in Western Europe, a map of locations can be seen in Figure 2. More detailed descriptions can be found at competitions’s homepage2 . Each survey is paired with the following covariates: β€’ Satellite image patches: 128mΓ—128m RGB-NIR patches centered at each observation, at a resolution of 1 meter per pixel; β€’ Satellite time series: Up to 20 years of values for six satellite bands (R, G, B, NIR, SWIR1, and SWIR2); β€’ Environmental rasters Various climatic, pedologic, land use, and human footprint variables at the European scale. It was provided as scalar values, time-series, and original rasters; 2.2. Evaluation Metric The evaluation metric for the GeoLifeCLEF 2024 competition is the samples-averaged 𝐹1 -score computed on a set made of species presence-absence samples. The 𝐹1 -score is an average measure of overlap between the predicted and actual set of species present at a given location and time. Each observation 𝑖 is associated with a list of ground-truth labels π‘Œπ‘– corresponding 1 https://www.kaggle.com/code/lonansyayf/baseline-with-modifications/notebook 2 https://www.kaggle.com/competitions/geolifeclef-2024/data Figure 1: Histogram for distribution of the occurrences of plant species in the training dataset. Hori- zontal axis on a logarithmic scale for better understanding. Figure 2: Map of Europe with observation distribution. The train data location is green point, the test data is red points. to the observed plant species. For each observation, the submissions provide a set of species predicted presence 𝑃𝑖,1 , 𝑃𝑖,2 , ..., 𝑃𝑖,𝑅𝑖 . The micro 𝐹1 -score is then computed using: 𝑁 1 βˆ‘οΈ 𝑇 𝑃𝑖 𝐹1 = 𝑁 𝑇 𝑃𝑖 + (𝐹 𝑃𝑖 + 𝐹 𝑁𝑖 )/2 𝑖=1 where 𝑇 𝑃𝑗 , 𝐹 𝑃𝑗 and 𝐹 𝑁𝑗 are the true positive, the false positive and the false negative of the j-th input sample, respectively. 𝑁 is the number of samples for evaluation. 3. Methodology This section describes the methods that were tried during the competition. Strategy was centered around the baseline model3 provided by the competition organizers. The baseline 𝐹1 -score is 0.31 on the public set. This model leveraged all environmental data and utilized a multimodal neural network with separated features extractors to return a single prediction set in order to take advantage of every modality (satellite images, bioclimatic cubes, landsat cubes). The main change was to replace landsat cubes with a vector of satellite time series and environmental scalar values, everywhere below it is called feature vector. In addition, plant species with an occurrence number greater than 10 was used to train the model. 3.1. Feature vector Feature vector consists of climate, elevation, human footprint, land cover, soilgrid and landsat time series data. Methods for compiling this data are taken from the public notebook4 . Climatic time series data was merged within a 10-year time window. Some positions had missing values, which were filled with spatial interpolation. It appeared that there were densely populated measurements near the missing regions, so missing values were filled with values from the nearest neighbors. Finally, each survey had 1198 values of feature vector. The train and test versions can be found here. Before going to model feature vectors are normalized with standard scaler. 3.2. Model architecture The architecture closely follows the baseline model, incorporating a multimodal neural network that utilizes three distinct feature extractors for bioclimatic rasters (19 channels), satellite images (4-channel RGB with NIR), and feature vectors (1198 channels). These outputs are combined and processed through fully connected layers to generate predictions. The first bioclimatic head involves layer normalization, ResNet18 [5] without pretrained weights, and a dropout [6] with a 0.1 probability. The second image head employs a , swin transformer [7] model with ImageNet [8] weights and a dropout layer with a 0.1 probability. Prior to this stage, image data undergo augmentation techniques like random rotation, random brightness contrast, and normalization. The third head comprises a sequence of layer normalization and three linear layers with GELU [9] activation function, along with dropout set at a 0.1 probability (the first layer mapping from 1198 to 1198, the second and the third layers map to 1000 outputs). Subsequently, the bioclimatic and feature outputs are normalized and combined with the image output. The final classifier is constructed with three linear layers utilizing GELU activation function and dropout at a 0.1 probability. 3.3. Training and inference The model was trained on PA data for 12 epochs using the Adam optimizer with a learning rate of 8e-5 and binary cross entropy (BCE) loss and batch size equal to 128. During training, we 3 https://www.kaggle.com/code/picekl/sentinel-landsat-bioclim-baseline-0-31626 4 https://www.kaggle.com/code/gobyeonggeon/preprocess-visualize-spatial-data-eda-xgb Figure 3: Selected multi-model architecture. Bioclimatic, image and feature heads mapping to 1000, 768, 1000 outputs, respectively. Then stacked outputs pass through linear layers mapping to the 2857 species (species with occurrence number > 10) focused on plant species with an occurrence number greater than 10, resulting in 2857 unique species out of a total of 5015. It’s important to highlight that the occurrence threshold value was determined through experimentation. In final approach to inference, the strategy used in the baseline notebook was changed. Rather than forecasting the 25 most probable species for every observation in the test dataset, selected threshold of 0.18 was used. This threshold determined that species with probabilities surpassing this value were classified as present. Additionally, test observations featuring fewer than 4 represented species was assigned with the 4 most likely species. 4. Experimental results 4.1. Experimental settings Experiments were conducted with the multimodal network described in Section 3.2. The detailed settings of training are shown in Table 1. For comparing different versions of models we used 25 most probable species to remove bias with probability threshold described in Section 3.3. 4.2. Usage of feature vector In order to investigate the impact of using the feature vector head we conducted ablation study. Table 2 represents the detailed results. It seems that with selected hyperparameters combination Table 1 Frequency of Special Characters Hyper-parameters Batch size 128 Optimizer Adam Learning rate 8e-5 Lr scheduler CosineAnnelingLR Number of epochs 12 Table 2 Ablation study of usage the feature vector head 𝐹1 -score Bioclimatic head Image head Feature head Landsat head Public Private βœ“ βœ“ - βœ“ 0.315 0.316 βœ“ βœ“ βœ“ βœ“ 0.317 0.317 - βœ“ βœ“ βœ“ 0.306 0.311 βœ“ βœ“ βœ“ - 0.322 0.323 Table 3 Score depending on the number of occurrences of plant species for model training 𝐹1 -score Species with number of occurrences Public Private >0 (5096 in total) 0.322 0.323 >5 (3425 in total) 0.322 0.326 >10 (2857 in total) 0.326 0.329 >15 (2511 in total) 0.324 0.328 of bioclimatic, image and feature heads gives the best performance of around 0.32 on both public and private scores. The performances of other configurations are about 0.31 or less. 4.3. Imbalanced data As was mentioned before, the dataset is strongly unbalanced, which means that for almost all species the number of observations detecting their presence is much less than the number of observations detecting their absence. we tried to solve this problem in different ways, for example, adding pos_weight to bce loss, adding different data augmentation. The final option was to limit the number of species on which the model is trained, taking only those with occurrence number greater than 10. Table 2 shows how the score depends on the threshold for the occurrence number. Another thing was lowering the threshold for a species having a probability higher than which it was considered present. For those observations that had fewer than 4 species present we assigned the 4 most likely plant species. Results of different probability thresholds are presented in Table 3. Table 4 Score depending on the presence probability threshold 𝐹1 -score Probability threshold Public Private 0.4 0.309 0.303 0.3 0.334 0.332 0.2 0.346 0.345 0.15 0.345 0.342 0.1 0.329 0.327 5. Conclusion We presented the working principles of submission to the GeoLifeCLEF 2024 challenge and discussed some of the key findings of the results. We have not conducted an expansive, let alone exhaustive hyperparameter search and believe that doing so could raise performance a bit. The main achievement was to use proper model architecture, choosing training data and changing the inference strategy. In final solution, we did not use PO data and training strategies used in previous years [10, 11]. Obviously, using more data would help for better generalization and it is certainly high on the list of improvements that need to be made. Also, possible improvements can be achieved by additionally searching for better backbone models, like Inception-v4 [12] or Vision Transformer, ViT B / 16 [13] for different modalities and using an ensemble of various models. References [1] L. Picek, C. Botella, M. Servajean, B. Deneu, D. Marcos Gonzalez, R. Palard, T. Larcher, C. Leblanc, J. Estopinan, P. Bonnet, A. Joly, Overview of GeoLifeCLEF 2024: Species presence prediction based on occurrence data and high-resolution remote sensing images, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [2] A. Joly, L. Picek, S. Kahl, H. GoΓ«au, V. Espitalier, C. Botella, B. Deneu, D. Marcos, J. Estopinan, C. Leblanc, T. Larcher, M. Ε ulc, M. HrΓΊz, M. Servajean, J. Matas, et al., Overview of lifeclef 2024: Challenges on species distribution prediction and identification, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2024. [3] C. Botella, B. Deneu, J. Estopinan, M. Servajean, D. Marcos Gonzalez, A. Joly, Overview of GeoLifeCLEF 2023: Species presence prediction based on occurrence data and high- resolution remote sensing images, in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, 2023. [4] N. E. Zimmermann, T. C. Edwards Jr., C. H. Graham, P. B. Pearman, J.-C. Svenning, New trends in species distribution modelling, Ecography 33 (2010) 985–989. doi:10.1111/j. 1600-0587.2010.06953.x. [5] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Conference: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi:10.1109/CVPR.2016.90. [6] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958. [7] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, 2021. doi:10.1109/ICCV48922.2021.00986. [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, F.-F. Li, ImageNet: a Large-scale hierarchical image database, in: Conference: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255. doi:10.1109/CVPR.2009.5206848. [9] D. Hendrycks, K. Gimpel, Gaussian error linear units (GELUs) (2016). arXiv:1606.08415. [10] H. Ung, R. Kojima, S. Wada, Leverage samples with single positive labels to train CNN- based models for multi-label plant species prediction, 2023. [11] B. Kellengerger, D. Tuia, Block label swap for species distribution modelling, 2022. [12] C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, Inception-ResNet and the impact of residual connections on learning, AAAI Conference on Artificial Intelligence 31 (2016). doi:10.1609/aaai.v31i1.11231. [13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, 2021. arXiv:2010.11929.