Location-Based Plant Species Prediction Using
 A CNN Model Trained On Several Kingdoms -
 Best Method Of GeoLifeCLEF 2019 Challenge

    Mathilde Negri3,4 , Maximilien Servajean2,3 , Benjamin Deneu1,3 , and Alexis
                                      Joly1,3
                              1
                               Inria, Montpellier, France
                alexis.joly@inria.fr, benjamin.deneu@inria.fr
                          2
                             LIRMM, Montpellier, France
            mathilde.negri@lirmm.fr, maximilien.servajean@lirmm.fr
                    3
                      Université Paul Valéry, Montpellier, France
                        4
                          Université de Montpellier, France


        Abstract. This technical report describes the model that achieved the
        best performance of the GeoLifeCLEF challenge, the objective of which
        was to evaluate methods for plant species prediction based on their ge-
        ographical location. Our method is based on an adaptation of the In-
        ception v3 architecture initially dedicated to the classification of RGB
        images. We modified the input layer of this architecture so as to process
        the spatialized environmental tensors as images with 77 distinct chan-
        nels. Using this architecture, we did train several models that mainly
        differed in the used training data and in the predicted output classes.
        One of the main objective, in particular, was to compare the perfor-
        mance of a model trained with plant occurrences only to that obtained
        with a model trained on all available occurrences, including the species
        of other kingdoms. Our results show that the global model performs
        consistently better than the plant-specific model. This suggests that the
        convolutional neural network is able to capture some inter-dependencies
        among all species and that this information significantly improves the
        generalisation capacity of the model for any species.


1     Introduction

Predicting a list of the most likely species present at a given location can be
very useful. First, it could improve species identification processes and tools by
reducing the list of candidate species observable at a given location (whether
automated, semi-automated or based on classical field guides or flora). More
generally, it could facilitate biodiversity inventories through the development
of location-based recommendation services (typically on mobile phones) as well
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
    ber 2019, Lugano, Switzerland.
as the involvement of non-expert nature observers. Finally, it could be used
for educational purposes through biodiversity discovery applications providing
innovative features such as contextualized educational pathways.
    This challenge is related to the problem of Species Distribution Mod-
elling (SDM) in ecology. SDM goal is to predict the spatial distribution of a
species over a territory [1], in our case we used spatial positions and also environ-
mental data. SDM have become increasingly important in the last few decades
for the study of biodiversity, the ecology of conservation, landscape management,
preservation of rare and/or endangered species, measurement of human impact
or climate change on species, etc.
    Concretely, the objective of SDM is to infer the spatial distribution of a
given species, and they are often based on a set of geolocalized occurrences of
that species (collected by naturalists, field ecologists, nature observers, citizen
sciences project, etc.).
    Recently, SDM based on deep neural networks have begun to appear [2].
These experiments have shown that they can have a good predictive power, po-
tentially better than the models used conventionally in ecology such as MaxEnt
[3]. Deep neural networks can learn complex non-linear transformations in a
wide variety of fields. In addition, they provide an opportunity to learn an area
of environmental representation common to various species, which stabilizes pre-
dictions from one species to another and improves them globally. Finally, spatial
patterns of environmental variables often contain useful information for species
distribution, but are generally not considered in conventional models. On the
contrary, convolutional neural networks effectively use this information and im-
prove their predictions.
    In this paper, we present a study to evaluate a convolutional neural network
to determine the ecological preferences of species through ranges of environmen-
tal image patches provided as input (temperature, soil type, etc.) as part of the
GeoLifeCLEF challenge.

 – Section 2 gives an overview of the various data we used to build our model.
 – Section 3 provides the detailed description of our model.
 – Section 4 presents the results of the experiments and their analysis.


2     Data

2.1   Occurrences

The data set protocol is explained in the challenge protocol [4].

 – plantnet data set: composed of 2,367,145 plant species occurrences with un-
   certain identifications, because they come from the automatic identification
   of plant images, from the Pl@ntnet application.
 – glc 18: Global Biodiversity Information Facility (GBIF) data set (same data
   set as last year challenge), it is composed of 291,392 occurrences of 3,336
   plant species observed on French territory between 1835 and 2017.
 – trusted data set: a sample of plantnet data set without uncertainty.
 – no plant: 10,618,839 species occurrences from the GBIF database, of other
   kingdoms (such as mammals, birds, amphibians, insects, fungus etc.). We
   have removed occurrences that didn’t match any environmental rasters, for
   example birds in the middle of the sea.
Then, we filter the database to get only plant species that were given by the
challenge.
We used different methods to structure our train data set from each data set.
 – full noplant: we train our model with all the data (plantnet, glc 18 and
   no plant) we had including animal observations and species from other king-
   doms.
 – full prediction: we only used the Pl@ntnet data set and glc 18 data set.
 – filter predictions: we try a random weighted selection scheme, the same as the
   one explained in the challenge notes for the test set. For each occurrence si in
   our data set, we compute a weight wi which corresponds to the importance of
   this plant over an area of 2 kilometres. Then we did a random sampling on the
   zi number between 0 and the maximum weight, and we kept the occurrence
   only if zi < wi . We did this sampling over 2 data set: the plantnet data set
   with uncertain identification set at 70%, we had a train set around 32,000
   occurrences after sampling, and another with the data set pl trusted, we had
   around 27,000 occurrences after sampling.

2.2    Environmental rasters
We have associated 33 rasters windows cropped into each corresponding global
environmental rasters provided by the challenge as input features to each occur-
rence.[4] An extraction protocol is given by the challenge.5
    These environmental rasters were constructed from various open data sets
including Chelsea Climate, ESDB soil pedological data, Corine Land Cover 2012
soil occupation data, CGIAR-CSI evapotranspiration data, USGS Elevation data
(Data available from the U.S. Geological Survey.) and BD Carthage hydrologic
data. All these data try to best represent the environment where the plant is
observed. We print them on a given area in the Figure 1, we can see the diversity
of information we have for one occurrence near Montpellier.
    In the following, we generally denote to x X an occurrence, each x being
associated to a spatial position p(x) in the spatial domain D, a species label
y(x) and an environmental tensor g(x) of size 64x64x33. We denote as P the
set of all spatial positions p covered by X. It is important to note that a given
spatial position p0 ∈ P usually corresponds to several occurrences xj ∈ X,
p(xj ) = p0 observed at that location. In the training set, up to several hundreds
of occurrences can be located in the same place.
    The environmental data provided [4] is composed of tensors of 64x64X33
pixels. The corresponding 64x64 pixel matrices can be processed as classical
5
    http://www.github.com/maximiliense/GLC19
                          Fig. 1. Environmental rasters


image channels provided as CNN input. Most of them are continuous variables
such as average temperature, altitude or distance to water as we can see them
in the Figure 1. Thus, the corresponding 64x64 pixel matrices can be processed
as classical image channels provided as CNN inputs. Some of the variables are
ordinal type (such as ESDB v2), they can be considered as additional channels
of the CNN in the sense that the order of pixel values is not significant.
For categorical variables like the Corine Land Cover soil type variable provided
within GeoLifeCLEF. This variable can take up to 48 different categorical values,
but 3 of them are not used therefore we have kept 45 categorical variables. The
order of these values has no meaning. Consequently, we preferred unstacking
the corresponding channel into 45 different binary channels, then concatenate it
with the other 32 tensors. Thus, instead of having a tensor of 64x64X33 pixels,
we do have a tensor of 64x64x77 pixels.


3   Convolutional Neural Network
It has been previously shown in [2] that Convolutional Neural Networks (CNN)
may reach better predictive performance than classical models used in ecology.
    Our model attempts to predict the most likely species to be observed based
on environmental features learned. Our structure is very different from last year
challenge [5]. We used an Inception V3 convolutional neural network [6], it is
made up of symmetric and asymmetric building blocks, including convolutions,
average pooling, max pooling, concatenates, dropouts, and fully connected lay-
ers. In addition to this model, we added a first layer of 77 layers, to match the
size of the environmental raster and a dropout to avoid over fitting.
    For the last layer we used a softmax and a cross entropy loss. The softmax
layer is computed as:
                                         ehj
                                 pj = PK                                      (1)
                                              hk
                                        k=1 e
where yk are the scores inferred by the net for each class, and K the number of
different classes that means the number of species we trained our model with,
the softmax activation for a class yj depends on all the scores in y.
Then we used a cross entropy loss [7]:
                                       K
                                       X
                            Loss = −         yk log(pk )                      (2)
                                       k=1

 with yk a binary indicator if class label k is the correct classification for the
observation and pk the probability of the observation is of class k.
    Learning set up and parameters: all our experiments were conducted
using PyTorch deep learning framework 4 and were run on a single computing
node equipped with 4 Nvidia GTX 1080 ti GPU. We used the Stochastic Gra-
dient Descent optimization algorithm with a learning rate of 0.1, a momentum
of 0.9 and a batch size of 300.
We trained our model over multiple epochs, then to export our result we took
the epoch where the model gets the highest Top 1 score in our validation set.
For our best runs (full noplant predictions and full predictions) we trained only
over 20 epochs due to the number of data and for the two other runs we trained
our model over 180 epochs.


4   Result
For GeoLifeCLEF 2019 they decided that the main evaluation criteria will be
the accuracy based on the first 30 answers, also called Top 30, i.e. the function
scoring 1 when the right specie is in the 30 first answers, and 0 if not. And as a
Rank runId top30 runName                                        participant
1    27007 0.1769 full noplant predictions                      LIRMM
2    27086 0.1687 RUN2 GRINNELL FULL INITIAL                    SaraSi
3    27087 0.1653 RUN3 GRINNELL TRANS                           SaraSi
4    27088 0.1646 RUN1 GRINNELL TESTSPECIES INITIAL             SaraSi
5    27006 0.1364 full predictions                              LIRMM
6    26997 0.1342 submit xgb spatial 4x4 all                    SSN CSE
7    26996 0.1288 submit xgb spatial allnoclc                   SSN CSE
8    27013 0.1273 submit xgb dep3 1                             SSN CSE
9    27069 0.1268 submission sel 4x4                            SSN CSE
10   27012 0.1263 submit xgb 4x4 all dep3                       SSN CSE
11   27070 0.1227 submission sel 1                              SSN CSE
12   27064 0.1198 submission1x1                                 SSN CSE
13   27067 0.1135 submission4x4                                 SSN CSE
14   27124 0.1135 Lot Of Lof 2                                  Lot of Lof
15   27089 0.1110 RUN4 ELTON TRANS                              SaraSi
16   27082 0.1090 RUN0 ELTON FULL INITIAL TESTSPECIES SaraSi
17   26988 0.1063 submit xgb spatial                            SSN CSE
18   27123 0.0984 Lot Of Lof 3                                  Lot of Lof
19   27063 0.0864 Lot Of Lof 1                                  Lot of Lof
20   26875 0.0844 submission                                    SSN CSE
21   27102 0.0834 rfspatial                                     SSN CSE
22   26821 0.0570 submit                                        SSN CSE
23   27004 0.0470 plcomplete predictions                        LIRMM
24   27005 0.0465 inception glc19 filter predictions            LIRMM
25   26968 0.0205 run 14                                        sergiu atodiresei
26   26964 0.0191 run 10                                        sergiu atodiresei
27   26961 0.0190 run 7                                         sergiu atodiresei
28   26971 0.0184 run 17                                        sergiu atodiresei
29   26967 0.0180 run 13                                        sergiu atodiresei
30   26960 0.0168 run 6                                         sergiu atodiresei
31   27062 0.0159 run 20                                        sergiu atodiresei
32   26958 0.0146 run 3                                         sergiu atodiresei
33   26970 0.0102 run 16                                        sergiu atodiresei
34   26969 0.0099 run 15                                        sergiu atodiresei
35   26972 0.0089 run 18                                        sergiu atodiresei
36   26963 0.0079 run 9                                         sergiu atodiresei
37   26965 0.0068 run 11                                        sergiu atodiresei
38   26926 0.0067 run 4                                         sergiu atodiresei
39   26973 0.0064 run 19                                        sergiu atodiresei
40   26959 0.0063 run 5                                         sergiu atodiresei
41   26962 0.0062 run 8                                         sergiu atodiresei
42   26957 0.0061 run 2                                         sergiu atodiresei
43   26966 0.0058 run 12                                        sergiu atodiresei
44   26956 0.0058 run 1                                         sergiu atodiresei
            Table 1. Table of Result of GeoLifeCLEF 2019 challenge
      Fig. 2. Top30 accuracy score of every participant of GeoLifeCLEF 2019


second metric they use the Mean Reciprocal Rank (MRR) to compare with the
GeoLifeCLEF 2018 challenge.
For this challenge we submitted 4 different runs. We can see the global result in
Figure 2 and Table 1.

 – The first one: full noplant predictions is a CNN trained with every data set
   (plantnet, no plant, glc 18). We used pl@ntnet database without any filters
   (2,367,145 plant species occurrences), train and test occurrences from last
   year’s challenge (file GLC 2018.csv) and occurrences from other kingdoms
   (such as mammals, birds, amphibians, insects, fungus etc.) from the GBIF
   database (file noPlant.csv). There are 34,375 different classes, so this CNN
   has a larger last layer output, but for the test we let it choose only the plant
   classes which are provided by the challenge. This method showed the best
   result with a Top 30 of 0.1769 and a MRR of 0.031.
 – The second submission, full predictions is a CNN trained with plantnet data
   set and the data set of last year challenge. This means that we only have
   plants to train our model, we have 3,859 classes. It is the 5 th submission of
   the challenge, after SaraSi models with a top 30 of 0.1364.
 – The two other submissions were not as good (plcomplete predictions and
   inception glc19 filter predictions), we sampled our train set as explained in
    the challenge protocol for the test set. Therefore, we have reduced our data,
    our model didn’t capture as much information as the other models.

    In this case, it seems that the more data available to the CNN, the better
the results. It seems that our model is learning from the species from another
kingdoms that live in the same area. Indeed, our model train with more than
only plant occurrences showed a better result for predicting plants. Here, the
CNN has more classes but can classify plants even better. We can deduct from
this that our CNN network not only captures environmental information but
also, the interaction between different species, we can imagine that it can also
learn from the information of the species that live around the plants and perhaps
from the ecological niche of an occurrence.


5    Conclusion
This paper describes our participation in GeoLifeCLEF challenge to evaluate
location-based species prediction models. We compared Convolutional Neural
Network trained with different data sets. The main conclusion of our study is
that the convolutional neural network model is the most efficient model and it can
learn information not only from environmental rasters but also from interaction
between species of other kingdoms. Indeed, it achieved the best performance
of the whole GeoLifeCLEF challenge when we trained our model with other
kingdom observations.
    In future work, we will attempt to better understand what information the
CNN does capture from different data sets and how it could be improved.


References
1. Kearney, M., Porter, W.: Mechanistic niche modelling: combining physiological and
   spatial data to predict species’ ranges. Ecology Letters 12(4) (2009) pp.334–350
2. Botella, C., Joly, A., Bonnet, P., Monestiez, P., Munoz, F.: A deep learning approach
   to species distribution modelling. (2018)
3. Phillips, S.J., Anderson, R.P., Schapire, R.E.: Maximum entropy modeling of species
   geographic distributions. Ecological Modelling 190(3) (2006) pp. 231–259
4. Botella, C., Servajean, M., Bonnet, P., Joly, A.: Overview of geolifeclef 2019: plant
   species prediction using environment and animal occurrences. In: CLEF working
   notes 2019. (2019)
5. Deneu, B., Servajean, M., Botella, C., Joly, A.: Location-based species recom-
   mendation using co-occurrences and environment-GeoLifeCLEF 2018 challenge. In:
   CLEF: Conference and Labs of the Evaluation Forum. Volume CEUR Workshop
   Proceedings., Avignon, France (September 2018)
6. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception
   architecture for computer vision. CoRR abs/1512.00567 (2015)
7. de Boer, P.T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-
   entropy method. Annals of Operations Research 134 (Feb 2005) pp. 19–67