=Paper=
{{Paper
|id=Vol-2696/paper_272
|storemode=property
|title=Overview of LifeCLEF Location-based Species Prediction Task 2020 (GeoLifeCLEF)
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_272.pdf
|volume=Vol-2696
|authors=Benjamin Deneu,Titouan Lorieul,Elijah Cole,Maximilien Servajean,Christophe Botella,Pierre Bonnet,Alexis Joly
|dblpUrl=https://dblp.org/rec/conf/clef/DeneuLCSBBJ20
}}
==Overview of LifeCLEF Location-based Species Prediction Task 2020 (GeoLifeCLEF)==
Overview of LifeCLEF location-based species prediction task 2020 (GeoLifeCLEF) Benjamin Deneu1,2 , Titouan Lorieul1 , Elijah Cole3 , Maximilien Servajean4 , Christophe Botella5 , Pierre Bonnet6 , Alexis Joly1 1 INRIA, UMR LIRMM, Univ Montpellier, France 2 AMAP, Univ Montpellier, CIRAD, CNRS, INRAE, IRD, Montpellier, France 3 Caltech, Pasadena, US 4 LIRMM, Université Paul Valéry, University of Montpellier, CNRS, Montpellier, France 5 CNRS, LECA, France 6 CIRAD, UMR AMAP, F-34398 Montpellier, France Abstract. Understanding the geographic distribution of species is a key concern in conservation. By pairing species occurrences with environmen- tal features, researchers can model the relationship between an environ- ment and the species which may be found there. To advance the state- of-the-art in this area, a large-scale machine learning competition called GeoLifeCLEF 2020 was organized. It relied on a dataset of 1.9 million species observations paired with high-resolution remote sensing imagery, land cover data, and altitude, in addition to traditional low-resolution climate and soil variables. This paper presents an overview of the com- petition, synthesizes the approaches used by the participating groups, and analyzes the main results. In particular, we highlight the ability of remote sensing imagery and convolutional neural networks to improve predictive performance, complementary to traditional approaches. Keywords: LifeCLEF, biodiversity, environmental data, species dis- tribution, evaluation, benchmark, species distribution models, methods comparison, presence-only data, model performance, prediction, predic- tive power 1 Introduction In order to make informed conservation decisions it is essential to understand where different species live. Citizen science projects now generate millions of geo- located species observations every year, covering tens of thousands of species. But how can these point observations be used to predict what species might be found at a new location? Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0) Fig. 1: Each species observation is paired with high-resolution covariates (clock- wise from top left: RGB imagery, IR imagery, altitude, land cover). A common approach is to build a species distribution model (SDM) [3], which uses a location’s environmental covariates (e.g. temperature, elevation, land cover) to predict which species may be found there. Once trained, the model can be used to make predictions for any location where those covariates are available. Developing an SDM requires a dataset where each species occurrence is paired with a collection of environmental covariates. However, many existing SDM datasets are both highly specialized and not readily accessible, having been assembled by scientists studying particular species or regions. In addition, the provided environmental covariates are typically coarse, with resolutions ranging from hundreds of meters to kilometers per pixel. In this work, we present the results of the GeoLifeCLEF 2020 competition. This competition aimed at bridging these gaps by (i) sharing a large-scale dataset of observations paired with high-resolution covariates and (ii) defining a common evaluation methodology to measure the predictive performance of models trained on this dataset. The dataset is based on over 1.9 million observations of plant and animal species. Each observation is paired with high-resolution satellite imagery as well as traditional environmental covariates (e.g. climate, altitude and soil variables). To the best of our knowledge, this is the first publicly available dataset to pair remote sensing imagery with species observations. Our hope is that this analysis-ready dataset and associated evaluation methodology will (i) make the SDM problem more accessible to machine learning researchers and (ii) facilitate novel research in large-scale, high-resolution, and remote-sensing-based species distribution modeling. 2 Dataset and Evaluation Protocol Data collection: A detailed description of the GeoLifeCLEF 2020 dataset is provided in [1]. For completeness, we give a brief description here. The GeoLife- CLEF 2020 dataset consists of 1,921,123 observations from the US (1,097,640) and France (823,483) covering 31, 435 plant and animal species. Each species ob- servation is paired with high-resolution covariates (RGB-IR imagery, land cover and altitude) as illustrated in Figure 1. These high-resolution covariates are re- sampled to a spatial resolution of 1 meter per pixel and provided as 256 × 256 images covering a 256m × 256m square centered on each observation. RGB-IR imagery come from the 2009-2011 cycle of the National Agriculture Imagery Program (NAIP) for the U.S.1 , and from the BD-ORTHO R 2.0 and ORTHO- HR R 1.0 databases from the IGN for France2 . Land cover data originates from the National Land Cover Database (NLCD) [7] for the U.S. and from CESBIO3 for France. All elevation data comes from the NASA Shuttle Radar Topography Mission (SRTM)4 . In addition, the dataset also includes traditional coarser res- olution covariates: 19 bio-climatic rasters (30arcsec2 /pixel (above 1km2 /pixel) from WorldClim [6]) and 8 pedologic rasters (250m2 /pixel, from SoilGrids [5]). The details on these rasters are given in Table 1. Train-test split: The full set of occurrences was split in a training and test- ing set using a spatial block holdout procedure (see Figure 2). This limits the effect of spatial auto-correlation in the data as explained in [8]. This means that a model cannot achieve a high performance by simply interpolating between training samples. The split was based on a global grid of 5 km × 5 km quadrats. 2.5% of the quadrats were randomly sampled for the test set, and the remaining quadrats were assigned to the training set. Evaluation metric: For each occurrence in the test set, the goal of the task was to return a candidate set of species with associated confidence scores. The main evaluation criterion is an adaptive variant of the top-K accuracy. Contrary to a classical top-K accuracy, this metric accounts for the fact that the number of species K may not be the same at each location. It is computed by thresholding the confidence score of the predictions and keeping only the species above that threshold. The threshold is determined automatically so as to have K = 30 1 National Agriculture Image Program, https://www.fsa.usda.gov 2 https://geoservices.ign.fr 3 http://osr-cesbio.ups-tlse.fr/~oso/posts/2017-03-30-carte-s2-2016/ 4 https://lpdaac.usgs.gov/products/srtmgl1v003/ Name Description Resolution bio 1 Annual Mean Temperature 30 arcsec bio 2 Mean Diurnal Range (Mean of monthly (max temp - min temp)) 30 arcsec bio 3 Isothermality (bio 2/bio 7) (* 100) 30 arcsec bio 4 Temperature Seasonality (standard deviation *100) 30 arcsec bio 5 Max Temperature of Warmest Month 30 arcsec bio 6 Min Temperature of Coldest Month 30 arcsec bio 7 Temperature Annual Range (bio 5-bio 6) 30 arcsec bio 8 Mean Temperature of Wettest Quarter 30 arcsec bio 9 Mean Temperature of Driest Quarter 30 arcsec bio 10 Mean Temperature of Warmest Quarter 30 arcsec bio 11 Mean Temperature of Coldest Quarter 30 arcsec bio 12 Annual Precipitation 30 arcsec bio 13 Precipitation of Wettest Month 30 arcsec bio 14 Precipitation of Driest Month 30 arcsec bio 15 Precipitation Seasonality (Coefficient of Variation) 30 arcsec bio 16 Precipitation of Wettest Quarter 30 arcsec bio 17 Precipitation of Driest Quarter 30 arcsec bio 18 Precipitation of Warmest Quarter 30 arcsec bio 19 Precipitation of Coldest Quarter 30 arcsec orcdrc Soil organic carbon content (g/kg at 15cm depth) 250 m phihox Ph x 10 in H20 (at 15cm depth) 250 m cecsol cation exchange capacity of soil in cmolc/kg 15cm depth 250 m bdticm Absolute depth to bedrock in cm 250 m clyppt Clay (0-2 micro meter) mass fraction at 15cm depth 250 m sltppt Silt mass fraction at 15cm depth 250 m sndppt Sand mass fraction at 15cm depth 250 m bldfie Bulk density in kg/m3 at 15cm depth 250 m Table 1: Summary of environmental variable rasters provided. results per occurrence on average on the test set. Traditional top-K accuracy with K = 30 is used as secondary evaluation metric. See [1] for full details and justification. Course of the challenge: The training data was publicly shared in early April 2020 through the AICrowd platform5 . Any research team wishing to par- ticipate in the evaluation could register on the platform and download the data. The test data was shared a few weeks later but without the species labels, which were kept secret. Each team could then submit up to 10 submissions correspond- ing to different methods or different settings of the same method. A submission (also called a run) takes the form of a CSV file containing the predictions of the method being evaluated for all observations in the test set. For each submission, the evaluation metrics are computed and made visible to the participant. Once the submission phase was closed (mid-June), the participants could also see the evaluation metric values of the other participants. As a last important step, each 5 https://www.aicrowd.com/ (a) US (b) France Fig. 2: Occurrences distribution over the US and France. Blue dots represent training data, red dots represent test data. participant was asked to provide a working note, i.e. a detailed report containing all technical information required to reproduce the results of the submissions. All LifeCLEF working notes are reviewed by at least two members of LifeCLEF organizing committee to ensure a sufficient level of quality and reproducibility. 3 Participants and methods 40 participants registered for the GeoLifeCLEF 2020 challenge and downloaded the dataset. Only two teams succeeded in submitting results in the end: Stan- ford and LIRMM. A major obstacle to participation was the volume of data as well as the computing power needed to train a model. For instance, it took the LIRMM team almost two weeks to train a convolutional neural network on the full dataset using 8 GPUs. The details of the methods and systems used by the two participants are summarized below. – LIRMM: This research team uploaded four submissions to the AICrowd platform but they reported problems for two of them afterwards, so we only report the correct ones here: • Submission 1 - Random forest trained on environmental feature vectors only (i.e. on the 27 climatic and soil variables). • Submission 3 - Convolution neural network trained on high-resolution image covariates (i.e. on 6-channel tensors composed of RGB-IR images, land cover image and altitude image). More information about the used methods can be found in the individual working note of this team [2]. – Stanford: This research team uploaded five submissions to the AICrowd platform. Most of the submissions were based on deep neural networks, but the authors informed us that they encountered convergence issues resulting in performance poorer than expected. Thus, it was mutually agreed that they would not provide a working note describing their method in detail. Only one of their submissions (referred as Submission 3) was valid. This was a baseline method that always predicted the list of the most frequent species in the training set. 4 Global results of the evaluation Participant Submission # Adaptive top-30 acc. Top-30 acc. LIRMM Submission 3 23.3% 23.5% LIRMM Submission 1 21.3% 20.4% Stanford Submission 3 4.8% 4.8% Table 2: GeoLifeCLEF 2020 main results, adaptive top-30 accuracy and top-30 accuracy per submission (sorted by decreasing adaptive top-30 accuracy). In Table 2, we report the performance measured for each of the 8 submis- sions. The main outcome is that the method achieving the best results (LIRMM Submission 3) was based solely on a convolutional neural network (CNN) trained on the high-resolution covariates (RGB-IR imagery, land cover, and altitude). It did not make use of any bioclimatic or soil variables, which are often considered to be the most informative in the ecological literature. On the contrary, LIRMM Submission 1 was a machine learning method classically used for species distri- bution models [4] trained solely on the climatic and soil variables. This shows two things: (i) important information explaining the species composition is con- tained in the high-resolution covariates, and, (ii) convolutional neural networks are able to capture this information. The performance achieved by the baseline predictor of Stanford shows that the other methods are consistently better than just returning the most common species everywhere. 5 Complementary analysis In this section we provide complementary analyses of the submitted results fo- cusing on certain aspects of the dataset. In particular, we will consider the two main methods submitted which we denote: – RF (env.): the model from LIRMM Submission 1 consisting of a random forest trained solely on environmental variables; – CNN (high res.): the model from LIRMM Submission 3 consisting of a CNN trained on the high-resolution covariates. 1.0 0.9 0.8 proportion of occurrences 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 proportion of species Fig. 3: Cumulative distribution of number of occurrences per species (ordered from most to least represented species in the training set) highlighting the long tail characteristic of the dataset. Species-wise performance: First, we study the performance of the models de- pending on species frequencies. The dataset contains 31, 435 species distributed according to a long tail distribution which can be seen in Figure 3. For example, the curve shows that the most common 10% of species represent more than 80% of the occurrences. On the other hand, the least common 70% of species account for fewer than 5% of the occurrences. In such long-tailed contexts, metrics that are averaged over all observations can seem satisfactory while the model only performs well on a few common species. To evaluate model performance across all species, we computed the top-30 accuracy for each species independently and then averaged the resulting scores. The results, presented in Figure 4, show that the top-30 accuracy averaged over the species is much lower than the one averaged over the occurrences. This is expected because the average over species gives much more weight to less represented species. However, it is interesting to compare how this affects each of the two models. The CNN drops from an accuracy of 23.5% to an accuracy of 13.2% with a 44% relative loss. The random forest falls from 20.4% to 6.9% with a 66% relative loss. RF (env.) is thus more affected by the average over species. To better understand this, Figure 5 shows the performance of the two models on each species of the test set ranked by their frequency in the training set (with an adaptive sliding average for display purposes). It shows that the RF (env.) CNN (high res.) 20 15 10 5 0 Top-30 acc. (occ.) Top-30 acc. (species) Fig. 4: Comparison of top-30 accuracy averaged over the occurrences with top-30 accuracy averaged over the species. CNN outperforms the random forest on nearly every species. The random forest is slightly better for only the most frequent species. Figure 6 shows a similar graph but instead of ranking the species by their frequency, species are ranked by the average top-30 accuracy achieved by the model. On this figure we can see how many species are predicted by the model with a top-30 accuracy over a given threshold. We can see in particular that for the CNN (high res.) model, 2, 300 species over the 7, 794 species are predicted with a top-30 accuracy greater than 0. In comparison, only 1, 100 have a top- 30 accuracy greater than 0 when using the RF (env.) model. Moreover, we can observe that the CNN (high res.) model is able to perfectly predict (with a score of 1) more species than the RF (env.) model. The CNN (high res.) model perfectly predicts near 400 species of the test set whereas as the RF (env.) model only predicts 150 species perfectly. Analysis by kingdom and geographic area: As the dataset covers France and US and contains both plants and animals species, it is interesting to compare the accuracy obtained by the models over these criteria. The resulting top-30 accuracy values are provided in Figure 7. Concerning the prediction of plants vs. animals, both models have similar results as shown in Figure 7a. Both have a slightly better prediction on plants, which can in part be explained by the greater number of occurrences per plant species than animals. Concerning the geographical area, Figure 7b shows that the performance is globally lower in France, in particular for animals where the performance is dramatically low. The Table 3 gives the average number of occurrences per species by kingdoms and regions. It can be noted that the very poor prediction performance on animals in France may be due to the low number of occurrences per animal species, on average 12.6. The lower performance on plants in France, however, is more difficult to explain since the average number of occurrences per species is pretty high (238). A possible interpretation could be that the lack of animal occurrences globally degrades the performance of the model, for instance because animals 100 RF (env.) CNN (high res.) 80 species top-30 accuracy 60 40 20 0 0 1000 2000 3000 4000 5000 6000 7000 species ranked from most to least frequent in training data Fig. 5: Top-30 accuracy of the models over species with an adaptive sliding av- erage. Species are ranked on the x-axis by their frequency in the training set. could be stronger markers of the habitat. But this could be due to other reasons such as distribution of occurrences over species or to macro-ecological differences between US and France. Nb. of occurrences Nb. of species Nb. of occ. per sp. Kingdom US France US France US France Plants 524,280 741,010 11,369 3,114 46.1 238.0 Animals 551,563 61,865 13,882 4,899 39.7 12.6 All 1,622,120 802,875 25,251 8,013 42.6 100.2 Table 3: Statistics of the number of occurrences, of species and of occurrences per species grouped by kingdoms and regions computed on the training set. Fusion of the predictions of the high-resolution CNN and classical punctual environmental model: As discussed in previous sections, an im- portant outcome of this evaluation is that the CNN (high res.) model performs better than the more classical RF (env.) model which means that the CNN (high res.) model is able to capture important information explaining the species dis- tribution from the high-resolution covariates. Now, an important question is to what extent this information is complementary to the information extracted by 100 RF (env.) CNN (high res.) 80 species top-30 accuracy 60 40 20 0 0 1000 2000 3000 4000 5000 6000 7000 species ranked from highest to lowest top-30 accuracy Fig. 6: Top-30 accuracy of the models over species. Species are ranked on the x-axis by the top-30 accuracy of the models on that species. First species is the best predicted by the model, last species is the worst predicted by the model. The two curves are represented on the same plot but the ranking is different for each model. 30 All 30 All Animals Animals Plants Plants 25 25 20 20 15 15 10 10 5 5 0 0 RF (env.) CNN (high res.) US FR (a) Per model. (b) Per country for CNN (high res.). Fig. 7: Top-30 accuracy per kingdom and geographical area 30 25 25.5% 23.5% 20 20.4% 15 10 5 0 RF (env.) CNN (high res.) Late fusion Fig. 8: Late fusion of RF (env.) and CNN (high res.) improves the global top-30 accuracy. the RF (env.) model. The RF (env.) model is based solely on climatic and soil variables which are classically used in ecology to model the niche of the species, i.e. their environmental preferences. It is likely that these ecological preferences may also be partially inferred by the CNN (high res.) from the high-resolution covariates. For instance, it can recognize a particular habitat from the topology, landscape, or the forest’s canopy. But it may also miss some important biocli- matic factors that cannot be inferred from the chosen high-resolution covariates. To assess how much complementary information is captured by the CNN (high res.) and RF (env.) models, we computed the accuracy of a simple fusion ap- proach consisting of averaging the predicted probabilities of each model. In prac- tice, we first had to renormalize the probability values provided in the the sub- mitted CSV files because the number of predictions was limited to 150 per obser- vation because of storage issues. The result of the fusion approach compared to each model alone is presented in Figure 8. It shows that the top-30 accuracy of the fusion approach is consistently better (25.5%). This result suggests that the models capture some distinct information and that more advanced methods for combining the high-resolution variables with the environmental rasters should be explored. 6 Discussion and Conclusion The main outcomes of the evaluation conducted in this paper are related to the comparison of two radically different approaches: one approach based on high- resolution satellite imagery and convolutional networks, CNN (high res.), and one more classical approach based on bioclimatic and soil variables processed with a random forest model, RF (env.). Beyond the slightly better performance of the CNN (high res.) model, a more in-depth study of the species predic- tions reveals important differences between the models. The top-30 accuracy per species, in particular, reveals a much larger performance gap. Moreover, by comparing the predictions species by species according to their frequency, we ob- served that the CNN is actually much better for the large majority species, and particularly for the less represented ones. This raises several points questions re- lated to model evaluation. In the context of species distribution studies, it seems particularly important to be able to predict the distribution of rare species, es- pecially for protection and conservation purposes. However, a raw evaluation of the performance of the models on occurrences is biased by the long-tailed distribution of occurrences per species. The CNN (high res.) model trained on high-resolution covariates is both better on less represented species and able to predict more species perfectly than the RF learned on environmental variables, while it is slightly less efficient on more frequent species. Random forest tends to predict mostly the most frequent species. Even if the random forest’s prediction is good on average on the test occurrences, it is clearly less relevant than the predictions of the CNN. Unfortunately, as both models were trained on different data, it is difficult to determine whether the origin of this difference lies in the model structure or the input data. It is important to note, however, that the high spatial resolution data has made it possible to learn a model capable of rivaling and even surpassing a model derived from a more classical approach learned on environmental variables. In addition, combining two approaches with late fusion produces a gain in performance, indicating that the models have captured complementary information. The use of high-resolution data seems to be an interesting way to learn models with high predictive power. None of the models submitted by participants were able to use the high and low resolution data together. However, if even simple late fusion improves performance, it is likely that a model trained on both data sources simultaneously will provide even better performance. Acknowledgement This project has received funding from the French National Research Agency under the Investments for the Future Program, referred as ANR-16-CONV-0004 and from the European Union’s Horizon 2020 research and innovation program under grant agreement No 863463 (Cos4Cloud project). References 1. Cole, E., Deneu, B., Lorieul, T., Servajean, M., Botella, C., Morris, D., Jo- jic, N., Bonnet, P., Joly, A.: The GeoLifeCLEF 2020 dataset. arXiv preprint arXiv:2004.04192 (2020) 2. Deneu, B., Servajean, M., Joly, A.: Participation of LIRMM / Inria to the GeoLife- CLEF 2020 challenge. In: CLEF working notes 2020, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece. (2020) 3. Elith, J., Leathwick, J.R.: Species Distribution Models: Ecological Explanation and Prediction Across Space and Time. Annual Review of Ecology, Evolution, and Sys- tematics (2009) 4. Evans, J.S., Murphy, M.A., Holden, Z.A., Cushman, S.A.: Modeling species distri- bution and change using random forest. In: Predictive species and habitat modeling in landscape ecology, pp. 139–159. Springer (2011) 5. Hengl, T., de Jesus, J.M., Heuvelink, G.B., Gonzalez, M.R., Kilibarda, M., Blagotić, A., Shangguan, W., Wright, M.N., Geng, X., Bauer-Marschallinger, B., et al.: Soil- grids250m: Global gridded soil information based on machine learning. PLoS one 12(2) (2017) 6. Hijmans, R.J., Cameron, S.E., Parra, J.L., Jones, P.G., Jarvis, A.: Very high res- olution interpolated climate surfaces for global land areas. International Journal of Climatology: A Journal of the Royal Meteorological Society 25(15), 1965–1978 (2005) 7. Homer, C., Dewitz, J., Yang, L., Jin, S., Danielson, P., Xian, G., Coulston, J., Herold, N., Wickham, J., Megown, K.: Completion of the 2011 national land cover database for the conterminous united states – representing a decade of land cover change information. Photogrammetric Engineering & Remote Sensing 81(5), 345– 354 (2015) 8. Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., et al.: Cross- validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8), 913–929 (2017)