Block Label Swap for Species Distribution Modelling Benjamin Kellenberger1 , Devis Tuia1 1 Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland Abstract We present our solution to the GeoLifeCLEF 2022 challenge, which consists in identifying the species (out of 17,034 floral and faunal taxa) across data points over the contiguous U.S. and France based on remote sensing data and other covariates. We cast the objective as a classification problem and regularise the hard-assigned, single species with a random label swap with another sample in spatial vicinity during training. Ensembling multiple deep learning models that ingest three or six satellite remote sensing bands each, we achieve a top-30 accuracy on the private test set of 31.22%, placing us second on the leaderboard and 0.31 percentage points behind the contest winners. We discuss our design choices and reflect on the results, possible future work, and extended objective of species distribution modelling in general. Keywords Species Distribution Modelling, Deep Learning, Remote Sensing This is a technical report for our contribution to the GeoLifeCLEF 2022 challenge1 , submit- ted under pseudonym “Matsushita-san”, with which we obtained the second place (out of 56 competitors) on the private leaderboard. 1. Introduction Species distribution modelling is a research discipline that aims at predicting the likelihood of sighting a taxon (floral, faunal, funga) at a particular location in space (and optionally time) on earth [1, 2, 3]. Doing so generally involves correlating species sightings (occurrence records) with covariates describing the species’ environment, such as climatic, pedologic or biotic (e.g., interactions like symbiosis or predation) variables. This process has been extensively researched in the field of ecology and a large variety of studies and approaches have been published over time, such as for forests [4], fungi [5], marine taxa [6], and more. Due to the probabilistic- correlative nature of species distribution modelling, as well as increasingly large and varied data sources like satellite remote sensing (for environmental covariates) and crowdsourcing (for species observation records), the topic has recently gained attention in the machine learning community [7]. The GeoLifeCLEF 2022 challenge [8], as part of the LifeCLEF 2022 challenges [9], is a testimony to this, as it provides a benchmark to train, evaluate, and compare machine learning models for a large number of species and data points. It has been hosted for three years CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ benjamin.kellenberger@epfl.ch (B. Kellenberger); devis.tuia@epfl.ch (D. Tuia) € https://bkellenb.github.io (B. Kellenberger)  0000-0002-2902-2014 (B. Kellenberger); 0000-0003-0374-2459 (D. Tuia) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.kaggle.com/competitions/geolifeclef-2022-lifeclef-2022-fgvc9 and has resulted in steady increases in performance of prediction models (see e.g. Seneviratne [10] for the winning solution of the previous GeoLifeCLEF 2021 contest). The following sections describe the dataset underlying the challenge, as well as our entry in it. 2. Data The GeoLifeCLEF 2022 challenge is an evolution of the GeoLifeCLEF location-based species prediction competition, which was hosted since 2020 based on the GeoLifeCLEF 2020 dataset [11]. Here, the aim is to predict the occurrence probability of around 1.8 million locations over the contiguous U.S. and France, obtained from the public iNaturalist crowdsourcing initiative2 . To do so, multiple data sources describing the environment of each location (“covariates”) are provided: high- (1 m) resolution satellite remote sensing imagery (RGB, near-infrared (NIR), altitude) and a land cover product, climatic and topographic rasters at lower resolution (250 m to 0.5 arcseconds), as well as the position of the observational data points in degrees lat/lon. Evaluation of the contenders’ model performances is done through the top-30 accuracy in species prediction on a held-out test dataset (i.e., a prediction of the 30 most likely present species has to be submitted for each data point in the test set; predictions are deemed “correct” if the ground truth species is among the 30). Compared to the preceding GeoLifeCLEF 2020 dataset, the 2022 version features one major change: all species with less than three data points in the training and validation set combined have been discarded. This reduced the number of total species from 31,435 to 17,034. 3. Approach The key methodology pursued in this study mostly corresponds to the one described in [12], pri- marily regarding the spatial block-label swap principle described below. This approach allowed us to beat the winning submission of the previous year’s GeoLifeCLEF competition [10] (sub- mitted and evaluated post challenge runtime), without expensive self-supervised pre-training employed by the winners of the 2021 challenge. For the 2022 challenge, we make the following fundamental changes compared to [12]: (i.) we drop the originally proposed curriculum learning approach, as it did not improve performance on the validation or test sets but led to serious model overfitting; (ii.) we replace the original land cover input layer with NDVI [13], computed from the red and NIR bands available: 𝑁 𝐼𝑅 − 𝑟𝑒𝑑 𝑁 𝐷𝑉 𝐼 = (1) 𝑁 𝐼𝑅 + 𝑟𝑒𝑑 The general model architecture is shown in Figure 1. By default, the model is composed of two feature extractors of identical architecture, but separate weights, that accept spatial remote sensing rasters. The architectures used for the feature extractors vary (see Table 1), but we apply the same general principle for all (i.e., keep all layers except for the final, fully-connected classification layer usually appended). We group the remote sensing rasters into “packets” of three-band inputs, i.e., RGB (packet 1), as well as NIR, altitude, and NDVI (packet 2), and feed 2 https://www.inaturalist.org RGB feature extractor A nx1 256x256x3 2nx1 stack Dropout MLP 17034x1 near-infrared nx1 altitude feature extractor B NDVI 256x256x3 Figure 1: General architectural design used for all models trained in this study. The two feature extractors ingest different parts of the remote sensing data and are of identical architecture, but do not share parameters. Their outputs (latent feature vectors, prior to the original but removed classifier) get stacked and subjected to dropout and a fully-connected layer that maps to the 17,034 species. Variations like the RGB-only model or taxonomy predictor are not shown. each packet into its own feature extractor, joined together by feature stacking. We then apply dropout [14] with a relatively high probability of 0.45 to this stacked feature (we found this probability to lead to better generalisation), and a final fully-connected layer, mapping from the stacked latent feature vectors to the 17,037 species classes. We encountered different levels of instability of the batch normalisation [15], leading to degraded up to unusable performances. We hypothesise this to be due to the comparably small variance of pixels in the remote sensing products and the conflict with respect to weights pre-trained on ImageNet [16]. This instability affected both the RGB and NIR+altitude+NDVI branch of our models. We experimented with a replacement of all batch normalisation layers by instance normalisation [17], but found this solution to slightly lag behind in terms of performance as well. We also attempted to train one model (#3a in Table 1) from scratch (i.e., without pre-training), hoping for the model to learn more sensible statistics. However, this attempt was unsuccessful; the model would not exceed around 5% of top-30 accuracy, even in the training set. Hence, we retain all batch normalisation layers, but always keep them in training mode even during inference (i.e., band-wise mean and variances are always inferred from the batch and not learnt). This technically decreases stability, as the predictions depend on the size and composition of the batch. We thus use the same batch size during inference as during training and employ test-time augmentation to partially remedy the prediction variability issue and increase prediction robustness overall. 3.1. Spatial block-label swap As in [12], we attempt to address the presence-only limitation of the dataset by an ad-hoc relaxation in geographic space. To do so, we let the model implicitly infer additional knowledge about a particular data point from its spatial neighbours. In practice, we proceed by a heuristic we term “spatial block-label swap”: we construct a grid of square cells (size 0.01∘ × 0.01∘ lat/lon), spanning an area from the northwesternmost to the southeasternmost training point in the dataset. We then assign each data point to its encompassing grid cell. During training, we look up the grid cell for each training point and replace the target species ID label with another one we encountered in the grid cell, with a probability of 10%. We experimented with different probabilities and different heuristics of swapping (e.g., increasing the swap likelihood based on the species occurrence histogram per grid cell), but found this not to make too much of a difference. In the end, we perform the spatial block-label swap in 10% of times and simply replace the species class with the one taken from another training set sample, uniformly drawn from all the other samples within the grid cell, irrespective of any species abundances or spatial proximity to the current sample. Note that this can include the species already assigned to the current sample; the true percentage of labels swapped is thus likely less than 10%. As-is, the spatial block-label swap provided us with a boost of about two percent on test accuracy in the previous challenge, so we used it for all model variations in this work. 3.2. Variance-based model ensembling For this contest we experiment with multiple model instantiations of the same idea (spatial block-label swap). The variations between different model runs are listed in Table 1 and can be summarised as follows: • Variation of model architecture: we employ three types of architectures as base feature extractors; i.e., ResNet-50 [18], DenseNet-201 [19], and Inception-v4 [20]. • Variation of model inputs: in general, we only use the provided remote sensing products and discard the environmental/climatic rasters, coordinates, and other, auxiliary inputs. One model (#2 in Tables 1 and 2) only receives RGB imagery; this model thus has only one feature extractor, with half the latent feature vector size, but is otherwise trained the same way as the others. We experimented with models that also processed bioclimatic rasters and/or GPS coordinates, but further work would be required to integrate the different data sources appropriately. • Optional auxiliary task: we design one model (#5) to not just predict the species, but also the other available levels of taxonomy (genus, family, kingdom), each with a separate, fully-connected layer on top of the common, fused features after the dropout layer, mapping to 6,467 (genus), 1,286 (family), and 2 (kingdom) outputs, respectively. At test time, we discard the additional taxonomy predictions. • Variations in pre-training: generally, models start from weights pre-trained on Ima- geNet [16]. For model #7 we attempted to use a different strategy based on meta- learning [21], specifically, Almost No Inner Loop (ANIL; [22]): here, we initialise the model as usual, but add a second fully-connected layer that maps to 20 outputs. We then perform meta-learning as follows: we construct a task by drawing two samples of 20 species classes each at random and shuffling species label indices ∈ [0, 19]. The meta model head then needs to be able to assign samples to the correct index within a task with little adaptation (cf. few-shot learning). To do so, we use the first 20 samples (support set; one per species) to train the fully-connected layer in an inner loop and then obtain the error and gradients of the model trying to predict the second 20 samples (query set). We then backpropagate this error (ℒ𝑚𝑒𝑡𝑎 ) in the outer loop, teaching the model to learn to adapt quickly to different species. We train the other fully-connected layer with the original species labels as usual (ℒ𝑐𝑙𝑠 ). The full loss then is ℒ = 𝜆 * ℒ𝑚𝑒𝑡𝑎 + ℒ𝑐𝑙𝑠 , with hyperparameter 𝜆 = 0.6. We use stochastic gradient descent with learning rate 0.05 (inner loop) and 0.01 (outer loop), reduced by 10 at epoch 10, weight decay 10−4 and momentum 0.9 (outer) and 0.0 (inner) for the meta-learning loops. One epoch corresponds to 10,000 tasks drawn at random from the training set. Finally, after 17 epochs, we discard the fully-connected layer used for meta-learning and fine-tune the model as usual for another eight epochs. • Variations of hyperparameters and training routines: by default, we train and test all models with all random number generators (NumPy and PyTorch) primed with the same seed value for maximal reproducibility. For one model (#3 in Table 1), we perform two more training and inference runs with two different seeds. This allows us to check the stability of the model with respect to the randomness of parameter initialisation, image ordering, etc., and to compare variations to the other modifications as described above. We expect such random factors to cause less accuracy variation than the explicit model/training alterations. The remaining differences in performance caused by such randomisation of parameters still provide a (small) increase in heterogeneity during model ensembling, as described below. We selectively apply more variations, such as different learning rates, random seeds, etc.; a summary of model run-specific alterations is given in Table 1. All other model hyperparameters are kept identical between setups; for a complete list see Table 3 in the Appendix. # architecture # inputs batch size base LR LR steps comments 1 ResNet-50 6 32 0.045 6, 12 2 ResNet-50 3 32 0.045 12 3a, b, c Inception-v4 6 64 0.045 6, 12 three models trained with different random seeds (54311121, 9236457, 1348) 4 Inception-v4 6 64 0.01 12, 24 5 Inception-v4 6 64 0.045 6, 12 included taxonomy predic- tion as auxiliary task 6 Inception-v4 6 64 0.045 6, 12 trained on training and vali- dation sets combined 7 Inception-v4 6 64 0.045 6 pre-trained with Meta- Learning 8 DenseNet-201 6 32 0.045 6, 12 9 ensemble variance-weighted combina- tion of all models above Table 1 Models overview, with changes made to the default set of hyperparameters (Table 3). # inputs denotes whether the model received three (RGB only) or six (RGB, NIR + altitude + NDVI) inputs. LR steps denote the epoch number(s) at which the current learning rate (LR) is divided by a factor of ten. The final submission then constitutes of an ensemble of all ten model runs (models 1–7 in Table 1, including three random seeds for model #3), averaged together as follows (Figure 2): we iterate over the entire test set and draw batches (identical to the training batch size) of non-augmented images. Then, we apply test-time augmentation by subjecting the batch to simple normalisation, as well as three or seven times (cf. Table 2) the training augmentation routine, providing a total of either four or eight realisations of the batch. We then predict softmax-activated species probability vectors with each of the trained models for all realisations and calculate the species-wise mean and realisation-wise variance of softmax logits for each model. The averaging over test-time augmentation runs already provides a degree of stability per model; in addition, we hypothesise the variance over test-time augmentation runs to provide us with a crude notion of model confidence. The idea behind this is to obtain a surrogate weight per model per sample: we assume not all models to be equally good at predicting a specific sample, due to the training variations, which forces them to focus on different aspects of the dataset. Hence, we obtain multiple predictions per sample per model to assess the models’ prediction stability, measured by the variance in softmax-activated logits: if a model is highly certain about a data point, we expect the confidences to vary very little across different test-time augmentations; if the model is less confident, it will predict different class probabilities for each augmentation. Using softmax as an activation for the variances across models allows us to perform a simple multiplication of weights and logits: as it rescales the per-model weights to sum to one, we can simply multiply them with the per-model logit vectors for all classes and sum them together along the model dimension. The general process of ensembling here is, to a certain degree, related to Bayesian dropout [23], with the exception that we use test-time augmentations instead of multiple forward passes with dropout for each model layer on the same image. For the final model, we then record mean and variance predictions per model, data point and species, and combine them as a weighted average: we normalise the per-model variances to the [0, 1] range and calculate a softmax of one minus the normalised variances across the models, for each test data point separately. This provides us with a weight for each model we use to multiply its predicted logits vector with. We then sum all vectors together and obtain a variance-weighted average over all model runs and test-time augmentations for each species class, which we use to extract the top-30 most confidently predicted species from (Figure 2). We distribute the training of the models to multiple machines: two desktop workstations (AMD Ryzen 9 3950X, 128 GB RAM, one NVIDIA GeForce RTX 3090 with 24 GB VRAM) running Ubuntu 20.04 LTS each, and one node on a high-performance cluster (2x Intel Xeon Gold, 12.4 TB RAM, using one of two available NVIDIA Tesla V100 PCIe with 32 GB VRAM), running Red Hat Enterprise Linux 7. All code is implemented in Python 3.8.10 and PyTorch 1.9.0 and uses CUDA 11.1. We use the pre-trained model and architecture provided by Torchvision for ResNet-50, and pre-trained models and architectures by PyTorch Image Models (TIMM version 0.5.4)3 for DenseNet-201 and Inception-v4. 4. Results and Discussion Table 2 lists the model runs and resulting top-30 accuracies obtained. 3 https://rwightman.github.io/pytorch-image-models tta 1 0.45 0.12 … tta 2 0.22 0.01 … means tta 3 0.84 0.02 … … … … … 0.51 0.06 … mean 0.51 0.06 … 0.21 0.01 … model 1 … … … variance 0.12 0.01 … 0.81 0.11 … mean 0.21 0.01 … * model 2 variance 0.03 0.01 … weights normalise 0.10 0.41 … 1-x 0.87 0.41 … softmax ••• … … … 0.01 0.02 … mean 0.81 0.02 … model m variance 0.33 0.11 … ∑ final scores 0.62 0.03 … Figure 2: Flowchart of our ensembling strategy. We use each model to predict 𝑛 test-time augmentations (tta) of the inference data point at hand, resulting in 𝑛 vectors of (softmax-activated) confidences per species. We then calculate the mean (blue) and variance (green; left) of the confidences per model per species. The variance then gets [0, 1] normalised, complemented (1 − 𝑥; i.e., values → 1 correspond to lower variances) and softmax-activated across all model runs. This results in weights (green; right) that approximate model trustworthiness. We multiply these weights element-wise with the stacked mean confidences and sum along the model dimension to obtain the final per-species logits (orange). top-30 # epoch # tta train val test (public) test (private) 1 15 4 32.43 26.55 28.02 27.49 2 28 4 31.07 27.64 29.66 28.82 3a 14 4 35.98 28.92 29.91 29.02 3b 15 4 35.97 28.50 29.52 29.32 3c 12 4 34.93 28.29 30.15 29.64 4 30 8 35.74 28.35 29.39 29.48 5 14 4 35.28 28.10 29.03 29.23 6 14 8 35.99 37.35* 30.04 29.77 7 8 8 33.59 26.70 27.44 28.70 8 14 8 31.16 27.70 29.33 28.97 9 ensemble 31.62* 31.76 31.22 Table 2 Performances obtained by the different models. # tta: number of test-time augmentations employed (for val and test set performances only). Top-30 values are provided as accuracies; for the train and val sets they are calculated directly; for the test sets they correspond to 1 - error reported on the competition leaderboard. *performances are not comparable to other rows, since models have been (partially) trained on the validation set. Performance of the ensemble on the training set has not been tested due to time constraints. We can observe among these results that performances are relatively stable across the different individual models, ranging within 2.28 percentage points of top-30 accuracy. This is also reflected in models 3a, b, and c, which are identical except for different seeds used to initialise the random number generators. Their performance is within 0.62 percentage points on the private test set, which seems reasonably close. Yet, it highlights a recurring phenomenon that differences in benchmark performance are often less pronounced in practice than they seem. When ensembled together, accuracy raises by almost 1.5 percentage points compared to the best single model. The best score on the private test set (approx. 90% of the number of data points, as opposed to 10% in the public test set), 31.22% top-30 rate, placed this ensemble in second place, 0.31 percentage points behind the contest winners. A number of design variations between models resulted in slight performance increases. Switching from ResNet-50 to Inception-v4 provided some of the strongest performance increases and gave about one percent of gain. DenseNet-201 provided only a marginal boost over ResNet- 50. It is unclear how much of this gain is attributable to potentially better pre-training of the Inception-v4 against the ResNet-50 model (in any case we found pre-training on ImageNet to be a requirement for proper learning, despite the dataset’s generous number of samples). More advanced architectures might be worth trying in future work. Note, however, that we tried using a Vision Transformer, ViT B/16 [24], but without success: apart from being unacceptably slow for training, despite powerful GPUs, this model exhibited the worst degree of overfitting by far (similar training performance as other models, 5% validation top-30). A second variation that seemed to provide a minor boost was to train a model on the validation set as well. However, exact epoch choice was aggravated by overfitting to not only the training set (as all models did), but also the validation set. Conversely, some attempts turned out not to influence performance for the better. Among those is experiment 4, which is identical to 3a apart from a smaller initial learning rate and a reduction of the rate at later stage. We assumed a model under an overall smaller learning rate to take up speed more slowly, but reach a better optimum in the long run, which turned out not to be the case. Also, the rather involved pre-training strategy of model #7 using meta-learning (ANIL) did not provide any benefit to the final score. Post-challenge, this was to be expected: we originally devised this strategy to better cope with the severely long-tailed class distribution of the dataset. The original intuition was that drawing species classes equiprobably (via task sampling) and forcing the model to be able to adapt to all species classes quickly (via meta-learning) would steer the learning focus of the model towards the rarer species, thereby addressing the imbalance issue. While we could indeed observe this behaviour to some extent, it caused the detrimental effect of degrading performance for the dominant classes. This is a common phenomenon of class-balanced learning; the utility of such strategies boils down to a single unknown: whether the test set is balanced or not. Long-tailed learning methods oftentimes assume the test set to be perfectly balanced, so they try to artificially compensate for the bias during training. However, datasets are rarely balanced in real life, and GeoLifeCLEF 2022 is no exception. Hence, a model is still better off if it predicts the most common species proportionally, especially given the (unbalanced) top-30 accuracy used as a measurement of prediction quality. Perhaps surprisingly, using only RGB imagery (experiment 2) gave a better performance than also including NIR, altitude and NDVI (experiment 1), despite having about half the parameters and less information per sample at hand. As it seems, the complexity of the objective of predicting a single species per data point exceeds the information encoded in the inputs. The auxiliary task of predicting three more taxonomy levels (experiment 5) made no difference to species ID prediction. We tried a large number of additional experiments not submitted and hence not shown here; listing them all would exceed available space. A few more noteworthy observations from these runs are, however, the following: (i.) optimisers that estimate per- parameter momentum, in particular Adam [25], resulted in total failure of the model learning anything (we assume the implicit regularisation of stochastic gradient descent [26] to be a vital requirement, possibly due to the sheer imbalance of the dataset); (ii.) any attempts to counteract the long-tailed class distribution, from strong re-weighting to balanced softmax losses [27] and meta-learning [22], worsened performance, likely since the validation and test sets are similarly imbalanced; (iii.) any attempts to reduce overfitting on the training set, such as stronger weight decay, simply resulted in a lower performance on all data splits; (iv.) separating between territories (U.S. and France), either by training two separate models or by adjusting confidences post-hoc based on where the species (does not) occur, made no difference. 5. Conclusion We presented the working principles of our submission to the GeoLifeCLEF 2022 challenge and discussed some of the key findings of the results, as well as ideas that did not work. We have not conducted an expansive, let alone exhaustive hyperparameter search and believe that doing so could raise performance a bit. The relatively strong degree of overfitting on the training set and visible improvement of performance via ensembling indicate that more sophisticated strategies for better generalisation are needed. Spatial block-label swap already improves matters here (in the previous GeoLifeCLEF 2021 challenge on the GeoLifeCLEF 2020 dataset, it reduced the gap between training and validation performance from 23.48% to 5.96% [12]). The inclusion of environmental covariates, as done by the winning team4 , certainly is high on the list of improvements to be done, especially since they more directly correspond to measurements and observations of properties one might expect the habitat of a species to be characterised of. We had little luck including these measurements in our deep models directly (and the winning team did so by a separate, random forest-based predictor), but further work might continue researching on this idea towards a joint reasoning across multiple covariates. Acknowledgments I would like to acknowledge my workstation and GPU that have always been there for me. Can’t say the same about Ubuntu and NVIDIA drivers; I kind of know my way around this match made in hell, but it still is a hot mess as usual. Also worth acknowledging (or admitting?) is the somewhat illogical nature of the team name used, but for a different reason than what one might expect: Matsushita (松下) is a Japanese 4 https://www.kaggle.com/competitions/geolifeclef-2022-lifeclef-2022-fgvc9/discussion/327055 surname, for example of 松下幸之助 (Matsushita Kōnosuke), founder of the Panasonic Corpora- tion. Suffix “-san” (-さん) is used to address others of equal or lower rank, never for oneself, though. Hence, “Matsushita” would have been a better pseudonym overall. チームの名前「松下さん」はちょっとおかしいでした。誠に、松下幸之助はとても有 名な人でした。しかし、自分のことを 「さん」づけにするのはおかしいです。そう して、「松下」は「松下さん」より適当です。これと下手な日本語は御免なさい。 Appendix Table 3 lists default hyperparameters used throughout all experiments. For model-specific alterations, such as different learning rates (LR) and schedulings, please refer to Table 1. hyperparameter value random seed 54311121 model pre-training ImageNet; Torchvision (ResNet-50) and TIMM (Inception-v4, DenseNet) models batch size 64 loss softmax-cross-entropy loss weight squared inverse [0, 1]-normalised fre- quency of species classes in training set optimiser stochastic gradient descent base LR 0.045 LR reduction steps (epoch) 6, 12, 24 LR reduction gamma (multiplier) 0.1 weight decay 0.0001 momentum 0.0 dropout 𝑝: 0.45 training and test-time augmentations random horizontal flip 𝑝: 0.5 random vertical flip 𝑝: 0.5 random noise multiplier [0.0, 0.05]× band-wise image mean random noise offset [−0.05, 0.05] normalisation [0, 1]-scaling followed by whitening on band-wise mean and std.dev. values of all training set points inference augmentations normalisation [0, 1]-scaling followed by whitening on band-wise mean and std.dev. values of all training set points spatial quantisation grid cell size 0.1∘ lat/lon spatial quantisation label replacement probability 𝑝: 0.1 Table 3 Default hyperparameters common to all experiments. References [1] A. Guisan, N. E. Zimmermann, Predictive habitat distribution models in ecology, Ecological Modelling 135 (2000) 147–186. doi:10.1016/S0304-3800(00)00354-9. [2] M. B. Araújo, A. Guisan, Five (or so) challenges for species distribution modelling, Journal of Biogeography 33 (2006) 1677–1688. doi:10.1111/j.1365-2699.2006.01584.x. [3] N. E. Zimmermann, T. C. Edwards Jr, C. H. Graham, P. B. Pearman, J.-C. Svenning, New trends in species distribution modelling, Ecography 33 (2010) 985–989. [4] M. Pecchi, M. Marchi, V. Burton, F. Giannetti, M. Moriondo, I. Bernetti, M. Bindi, G. Chirici, Species distribution modelling to support forest management. a literature review, Ecologi- cal Modelling 411 (2019) 108817. [5] T. Hao, G. Guillera-Arroita, T. W. May, J. J. Lahoz-Monfort, J. Elith, Using species distribu- tion models for fungi, Fungal Biology Reviews 34 (2020) 74–88. [6] S. M. Melo-Merino, H. Reyes-Bonilla, A. Lira-Noriega, Ecological niche models and species distribution models in marine environments: A literature review and spatial analysis of evidence, Ecological Modelling 415 (2020) 108837. [7] S. Beery, E. Cole, J. Parker, P. Perona, K. Winner, Species Distribution Modeling for Ma- chine Learning Practitioners: A Review, ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS) (COMPASS ’21), June 28-July 2, 2021, Virtual Event, Aus- tralia 1 (2021). URL: http://arxiv.org/abs/2107.10400. doi:10.1145/3460112.3471966. arXiv:2107.10400. [8] T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Joly, Overview of GeoLifeCLEF 2022: Predicting species presence from multi-modal remote sensing, bioclimatic and pedologic data, in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, 2022. [9] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso, H. Glotin, R. Planqué, W.-P. Vellinga, A. Navine, H. Klinck, T. Denton, I. Eggel, P. Bonnet, M. Šulc, M. Hruz, Overview of LifeCLEF 2022: an evaluation of machine-learning based species identification and species distribution prediction, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2022. [10] S. Seneviratne, Contrastive Representation Learning for Natural World Imagery : Habitat prediction for 30 , 000 species (2021). [11] E. Cole, B. Deneu, T. Lorieul, M. Servajean, C. Botella, D. Morris, N. Jojic, P. Bonnet, A. Joly, The GeoLifeCLEF 2020 dataset, arXiv preprint arXiv:2004.04192 (2020). [12] B. Kellenberger, E. Cole, D. Marcos, D. Tuia, Training techniques for presence-only habitat suitability mapping with deep learning, in: 2022 IEEE international geoscience and remote sensing symposium (IGARSS), IEEE, 2022. [13] C. J. Tucker, Red and photographic infrared linear combinations for monitor- ing vegetation, Remote Sensing of Environment 8 (1979) 127–150. URL: https:// www.sciencedirect.com/science/article/pii/0034425779900130. doi:https://doi.org/ 10.1016/0034-4257(79)90013-0. [14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The journal of machine learning research 15 (2014) 1929–1958. [15] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International conference on machine learning, PMLR, 2015, pp. 448–456. [16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255. [17] D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: The missing ingredient for fast stylization, arXiv preprint arXiv:1607.08022 (2016). [18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [19] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolu- tional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708. [20] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning, in: Thirty-first AAAI conference on artificial intelligence, 2017. [21] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in: International conference on machine learning, PMLR, 2017, pp. 1126–1135. [22] A. Raghu, M. Raghu, S. Bengio, O. Vinyals, Rapid learning or feature reuse? towards understanding the effectiveness of maml, arXiv preprint arXiv:1909.09157 (2019). [23] Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, in: international conference on machine learning, PMLR, 2016, pp. 1050–1059. [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Trans- formers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020). [25] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations, 2015. [26] Z. Zhu, J. Wu, B. Yu, L. Wu, J. Ma, The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects, arXiv preprint arXiv:1803.00195 (2018). [27] J. Ren, C. Yu, X. Ma, H. Zhao, S. Yi, et al., Balanced meta-softmax for long-tailed visual recognition, Advances in Neural Information Processing Systems 33 (2020) 4175–4186.