<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Plant identi cation based on noisy web data: the amazing performance of deep learning (LifeCLEF 2017)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Herve Goeau</string-name>
          <email>herve.goeau@cirad.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre Bonnet</string-name>
          <email>pierre.bonnet@cirad.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexis Joly</string-name>
          <email>alexis.joly@inria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CIRAD, UMR AMAP</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inria ZENITH team</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LIRMM</institution>
          ,
          <addr-line>Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The 2017-th edition of the LifeCLEF plant identi cation challenge is an important milestone towards automated plant identi cation systems working at the scale of continental oras with 10.000 plant species living mainly in Europe and North America illustrated by a total of 1.1M images. Nowadays, such ambitious systems are enabled thanks to the conjunction of the dazzling recent progress in image classi cation with deep learning and several outstanding international initiatives, such as the Encyclopedia of Life (EOL), aggregating the visual knowledge on plant species coming from the main national botany institutes. However, despite all these e orts the majority of the plant species still remain without pictures or are poorly illustrated. Outside the institutional channels, a much larger number of plant pictures are available and spread on the web through botanist blogs, plant lovers web-pages, image hosting websites and on-line plant retailers. The LifeCLEF 2017 plant challenge presented in this paper aimed at evaluating to what extent a large noisy training dataset collected through the web and containing a lot of labelling errors can compete with a smaller but trusted training dataset checked by experts. To fairly compare both training strategies, the test dataset was created from a third data source, i.e. the Pl@ntNet mobile application that collects millions of plant image queries all over the world. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.</p>
      </abstract>
      <kwd-group>
        <kwd>LifeCLEF</kwd>
        <kwd>plant</kwd>
        <kwd>leaves</kwd>
        <kwd>leaf</kwd>
        <kwd>ower</kwd>
        <kwd>fruit</kwd>
        <kwd>bark</kwd>
        <kwd>stem</kwd>
        <kwd>branch</kwd>
        <kwd>species</kwd>
        <kwd>retrieval</kwd>
        <kwd>images</kwd>
        <kwd>collection</kwd>
        <kwd>species identi cation</kwd>
        <kwd>citizen-science</kwd>
        <kwd>negrained classi cation</kwd>
        <kwd>evaluation</kwd>
        <kwd>benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Thanks to the long term e orts made by the biodiversity informatics community,
it is now possible to aggregate validated data about tens of thousands species
world wide. The international initiative Encyclopedia of Life (EoL), in particular,
is one of the largest repository of plant pictures. However, despite these e orts,
the majority of plant species living on earth are still very poorly illustrated
with typically only few pictures of a single specimen or herbarium scans. More
pictures are available on the Web through botanist blogs, plant lovers web-pages,
image hosting websites and on-line plant retailers. But this data is much harder
to be structured and contains a high degree of noise. The LifeCLEF 2017 plant
identi cation challenge proposes to study to what extent such noisy web data is
competitive with a relatively smaller but trusted training set checked by experts.
As a motivation, a previous study conducted by Krause et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] concluded that
training deep neural networks on noisy data was unreasonably e ective for
negrained recognition. The PlantCLEF challenge completes their work in several
points:
1. it extends their result to the plant domain whose speci city is that the
available data on the web is scarcer, the risk of confusion higher and, nally,
the degree of noise higher.
2. it scales the comparison between trusted and noisy training data to 10K of
species whereas the trusted training sets used in their study were actually
limited to few hundreds of species.
3. it uses a third-party test dataset that is not a subset of either the noisy
dataset or the trusted dataset. This allows a more fair comparison. More
precisely, the test data is composed of images submitted by the crowd of users
of the mobile application Pl@ntNet[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Consequently, it exhibits di erent
properties in terms of species distribution, pictures quality, etc.
      </p>
      <p>In the following subsections, we synthesize the resources and assessments of
the challenge, summarize the approaches and systems employed by the
participating research groups, and provide an analysis of the main outcomes.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>
        To evaluate the above mentioned scenario at a large scale and in realistic
conditions, we built and shared three datasets coming from di erent sources. As
training data, in addition to the data of the previous PlantCLEF challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
we provided two new large data sets both based on the same list of 10,000 plant
species (living mainly in Europe and North America):
Trusted Training Set EoL10K : a trusted training set based on the
online collaborative Encyclopedia Of Life (EoL). The 10K species were selected
as the most populated species in EoL data after a curation pipeline (taxonomic
alignment, duplicates removal, herbaria sheets removal, etc.). The training set
contains 256,287 pictures in total but has a strong class imbalance with a
minimum of 1 picture for Achillea lipendulina and a maximum of 1245 pictures for
Taraxacum laeticolor.
      </p>
      <p>Noisy Training Set Web10K : a noisy training set built through Web crawlers
(Google and Bing image search engines) and containing 1.1M images. This
training set is also imbalanced with a minimum of 4 pictures for Plectranthus
sanguineus and a maximum of 1732 pictures for Fagus grandifolia.</p>
      <p>The main objective of providing both datasets was to evaluate to what extent
machine learning techniques can learn from noisy data compared to trusted
data (as usually done in supervised classi cation). Pictures of EoL are
themselves coming from di erent sources, including institutional databases as well as
public data sources such as Wikimedia, iNaturalist, Flickr or various websites
dedicated to botany. This aggregated data is continuously revised and rated by
the EoL community so that the quality of the species labels is globally very
good. On the other side, the noisy web dataset contains more images but with
several types and levels of noise: some images are labeled with the wrong species
name (but sometimes with the correct genus or family), some are portraits of
a botanist specialist of the targeted species, some are labeled with the correct
species name but are drawings or herbarium sheets, etc.</p>
      <p>Pl@ntNet test set: the test data to be analyzed within the challenge is a
large sample of the query images submitted by the users of the mobile
application Pl@ntNet (iPhone4 &amp; Androd5). It contains a large number of wild plant
species mostly coming from the Western Europe Flora and the North American
Flora, but also plant species used all around the world as cultivated or
ornamental plants including some endangered species.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Task Description</title>
      <p>Based on the previously described testbed, we conducted a system-oriented
evaluation involving di erent research groups who downloaded the data and ran their
system.</p>
      <p>Each participating group was allowed to submit up to 4 run les built from
di erent methods (a run le is a formatted text le containing the species
predictions for all test items). Semi-supervised, interactive or crowdsourced approaches
were allowed but had to be clearly signaled within the submission system. None
of the participants employed such methods.</p>
      <p>The main evaluation metric is the Mean Reciprocal Rank (MRR), a statistic
measure for evaluating any process that produces a list of possible responses to
a sample of queries ordered by probability of correctness. The reciprocal rank
of a query response is the multiplicative inverse of the rank of the rst correct
4 https://itunes.apple.com/fr/app/plantnet/id600547573?mt=8
5 https://play.google.com/store/apps/details?id=org.plantnet
answer. The MRR is the average of the reciprocal ranks for the whole test set:
M RR =
1 XQ</p>
      <p>1
jQj i=1 ranki
where jQj is the total number of query occurrences in the test set.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Participants and methods</title>
      <p>
        80 research groups registered to the LifeCLEF plant challenge 2017. Among this
large raw audience, 8 research groups nally succeeded in submitting run les.
Details of the used methods and evaluated systems are synthesized below and
further developed in the working notes of the participants (CMP [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], FHDO
BCSG [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], KDE TUT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Mario MNB [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Sabanci Gebze[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], UM [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and
UPB HES SO [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]). Table 1 reports the results achieved by each run as well
as a brief synthesis of the methods used in each of them. Complementary, the
following paragraphs give a few more details about the methods and the overall
strategy employed by each participant.
      </p>
      <p>
        CMP, Czech Republic, 4 runs, [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]: this participant based his work on the
Inception-ResNet-v2 architecture [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] which introduces inception modules with
residual connections. An additional maxout fully-connected layer with batch
normalization was added on top of the network, before the classi cation
fullyconnected layer. Hard bootstrapping was used for training with noisy labels. A
total of 17 models were trained using di erent training strategies such as: with
or without maxout, with or without pre-training on ImageNet, with or without
bootstrapping, with and without ltering of the noisy web dataset. CMP Run 1
is the combination of all the 17 networks by averaging their results. CMP Run
3 is the combination of the 8 networks that were trained on the trusted EOL
data solely. CMP Run2 and CMP Run 4 are post-processings of CMP Run1
and CMP Run 3 aimed at compensating the asymmetry of class distributions
between the test set and the training sets.
      </p>
      <p>
        FHDO BCSG, Germany, 4 runs, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]: this participant also used
InceptionResNet-v2 architecture. The Run 1 is based exclusively on the trusted EOL
dataset following a two phases ne-tuning approach: in a rst phase only the
last output layer is trained for a few epochs with a small learning rate starting
from randomly initialized weights, and in a second phase the entire network is
trained with numerous epochs and a larger learning rate. For the test set, they
used an oversampling technique for increasing the number of test samples with
10 crops (1 center, 4 corners and the mirrored crops). For the Run 2, they kept
the same architecture but extended the trusted dataset with a ltered subset of
the noisy dataset: images from the web were added if their species label were in
the top-5 predictions from the model used in Run 1. Run 3 is the combination
of Run 1&amp;2.
KDE TUT, Japan, 4 runs, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: this participant introduced a modi ed
version of the ResNet-50 model. Three of the intermediate convolutional layers used
for downsampling were modi ed by changing the stride value from 2 to 1 and
preceding it by max-pooling with a stride of 2, to optimize the coverage of the
inputs. Additionally, they switched the downsampling operation with the
convolution for delaying the downsampling operation. This has been shown to improve
performance by the authors of the ResNet architecture themselves. During the
training they used data augmentation based on random crops, rotations and
optional horizontal ipping. Test images were also augmented through a single ip
operation and the resulting predictions averaged. Since the original ResNet-50
architecture was modi ed, no ne-tuning was used and the weights were learned
from scratch starting with a big learning rate value of 0.1. The learning rates
were dropped twice (to 0.01 and then 0.01) over 100 epochs according to a
schedule ratio 4:2:1 indicating the number of iterations using the same learning rate
(limited to a total number of 350 000 iterations in the case of the big noisy
dataset due to technical limitations). Run 1, 2, 3 were trained respectively on
the trusted dataset, noisy dataset, and both datasets. The nal run 4 is a
combination the outputs of the 3 runs.
      </p>
      <p>
        PlantNet, France, 1 run: The PlantNet team provided a baseline for the
task with the system used in Pl@ntNet app, based on a slightly modi ed version
of inception v1 model [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] as described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The system also includes a number
of thresholding and rejection mechanisms that are useful within the mobile app
but that also degrade the raw classi cation performance. This team submitted
only one run trained on the trusted EOL dataset.
      </p>
      <p>
        Sabanci-Gebze, Turkey, 4 runs, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: inspired by the good results achieved
last year with a combination of a GoogleLeNet [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and a VGGNet [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], this
team ran an ensemble classi er of 9 VGGNets. Each network was trained with
data augmentation techniques using random crops and random horizontal,
focusing only the two last layers for ne-tuning due to technical limitations. The
submitted Run 2 used models learned only on the EOL trusted dataset. For the
remaining runs, the models were trained for supplementary epochs introducing
complementary training images selected from the noisy dataset (about 60.000
images which were matching the ground truth according to the models trained
for the Run 2). Run 1, 3, 4 used respectively a Borda count, a maximum con
dent rule and a weighted combination of the output of the classi ers.
Mario TSA Berlin, Germany, 4 runs, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]: this participant used ensembles
of ne-tuned CNNs pre-trained on ImageNet based on 3 architectures (GoogLeNet,
ResNet-152 and ResNeXT) each trained with bagging techniques. Data
augmentation techniques multiplied by 5 the number of training images with random
cropping, horizontal ipping, variations of saturation, lightness and rotation,
these three last transformations following a decreasing degree correlated the
diminution of the learning rate during training to let the CNNs see patches
closer to the original image at the end of each training process. Test images
were also augmented and the resulting predictions averaged. MarioTsaBerlin
Run 1 results from the combination of the 3 architectures trained on the trusted
datasets only (EOL and PlantCLEF2016). Run 2 exploited both the trusted and
the noisy dataset to train four GoogLeNet's, one ResNet-152 and one ResNeXT.
In Run3, two additional GoogLeNet's and one ResNeXT were trained using a
ltered version of the web dataset and images of the test set that received a
probability higher than 0.98 in Run1. The last and "winning" run MarioTsaBerlin
Run 4 nally combined all the 12 trained models.
      </p>
      <p>
        UM, Malaysia &amp; UK, 4 runs, [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]: this participant proposed an original
architecture called Hybrid Generic-Organ (HBO-CNN) that was trained on the
trusted dataset (UM Run 1). Unfortunately, it performed worst than a standard
VGGNet model learned on the noisy dataset (UM Run 2). This can be partially
explained by the fact that the HBO-CNN model need tagged images ( ower,
fruit, leaf,...), a missing information for the noisy dataset and partially available
for the trusted dataset.
      </p>
      <p>
        UPB HES SO, Switzerland, 4 runs, [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]: this team trained the
historical AlexNet model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] by using exclusively the trusted training dataset, and
focused the experiments on the solver part. Run 1 didn't used weight decay for
the regularization. Run 2 applied an important learning rate factor on the last
layer without updating this value during the training. Run 3 and 4 used a usual
learning rate schedule.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>We report in Figure 1 the performance achieved by the 29 collected runs. Table
1 provides the results achieved by each run as well as a brief synthesis of the
methods used in each of them.</p>
      <p>
        Trusted or noisy ? As a rst noticeable remark, the measured performances
are very high despite the di culty of the task with a median Mean
Reciprocal Rank (MRR) around 0.8, and a highest MRR of 0.92 for the best system
Mario MNB Run 4. A second important remark is that the best results were
obtained mostly by systems that were learned on both the trusted and the noisy
datasets. Only two runs (KDE TUT Run 2 and UM Run 2) used exclusively
the noisy dataset but gave better results than most of the methods using only
the trusted dataset. Several teams also tried to lter the noisy dataset, based on
the prediction of a preliminary system trained only on the trusted dataset (i.e.
by rejecting pictures whose label is contradictory with the prediction). However,
this strategy did not improve the nal predictor and even degraded the results.
For instance Mario MNB Run 2 (using the raw Web dataset) performed better
than Mario MNB Run 3 (using the ltered Web dataset).
Succeeding strategies with CNN models: Regarding the used methods, all
submitted runs were based on Convolutional Neural Networks (CNN) con
rming de nitively the supremacy of this kind of approach over previous methods.
A wide variety of popular architectures were trained from scratch or ne-tuned
from pre-trained weights on the ImageNet dataset: GoogLeNet[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and its
improved inception v2[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and v4 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] versions, inception-resnet-v2[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], ResNet-50
and ResNet-152 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], ResNeXT[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], VGGNet[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and even the AlexNet[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. One
can notice that inception v3 was not experimented despite the fact that it is a
recent model giving state of art performances in other image classi cation
benchmarks. It is important to notice that the best results of each team were obtained
with classi er ensembles (in particular Mario TSA Run 4, KDE TUT Run 4 and
CMP Run 1). Bootstrap aggregating (bagging) was very e cient in this context
to extend the number of classi ers by learning several models with the same
architecture but on di erent training and validation subsets. This is the case of
the best run Mario TSA Run 4 that combined 7 GoogLeNet, 2 ResNet-152, 3
ResNeXT trained on di erent datasets. The CMP team also combined numerous
models (up to 17 in Run 1) with various subsets of the training data and bagging
strategies, but all with the same inception-resnet-v2 architecture. Another key
for succeeding the task was the use of data augmentation with usual
transformations such as random cropping, horizontal ipping, rotation, for increasing
arti cially the number of training samples and helping the CNNs to generalize
better. The two best teams used data augmentation in both the training and
the test phase. Mario MNB team added two more interesting transformations,
i.e. color saturation and lightness modi cations. They also correlated the
intensity of these transformations with the diminution of the learning rate during
training to let the CNNs see patches closer to the original image at the end
of each training process. Last but not least, Mario MNB is the only team who
used exactly the same transformation in the training and test phase. Besides
the use of ensemble of classi ers, some teams also tried to propose modi cations
of existing models. KDE TUT, in particular, modi ed the architecture of the
rst convolutional layers of ResNet-50 and report consistent improvements in
their validation experiments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. CMP also reported slight improvements on the
inception-resnet-v2 by using a maxout activation function instead of RELU. The
UM team proposed an original architecture called Hybrid Generic-Organ learned
on the trusted dataset (UM Run 1). Unfortunately, it performed worst than a
standard VGGNet model learned on the noisy dataset (UM Run 2). This can be
partially explained by the fact that the HBO-CNN model need tagged images
( ower, fruit, leaf,...), a missing information for the noisy dataset and partially
available for the trusted dataset.
      </p>
      <p>
        The race for the most recent model?: one can suppose that the most recent
models such as inception-resnet-v2 or inception-v4 should lead to better results
than older ones such as AlexNet, VGGNet and GoogleNet. For instance, the
runs with GoogleNet and VGGNet by Sabanci [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], or with a PReLU version of
inception-v1 by the PlantNet team, or with the historical AlexNet architecture
by the UPB HES SO team [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] performed the worst results. However, one can
notice that the "winning" team used also numerous GoogLeNet models, while the
old VGGNet used in UM run 2 gave quite high and intermediate results around a
MRR of 0.8. This highlights how much the training strategies are important and
how ensemble classi ers, bagging and data augmentation can greatly improve the
performance even without the most recent architectures from the state of the art.
The race for GPUs: Like discussed above, best performances were obtained
with ensembles of very deep networks trained over millions of images and heavy
data augmentation techniques. In the case of the best run Mario MNB Run 4,
test images were also augmented so that the prediction of a single image nally
relies on the combination of 60 probability distributions (5 patches x 12 models).
Overall, the best performing system requires a huge GPU consumption so that
their use in data intensive contexts is limited by cost issues (e.g. the Pl@ntNet
mobile application accounts for millions of users). A promising solution towards
this issue could be to rely on knowledge distilling [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Knowledge distilling
consists in transferring the generalization ability of a cumbersome model to a small
model by using the class probabilities produced by the cumbersome model as soft
targets for training the small model. Alternatively, more e cient architectures
and learning procedures should be devised.
Performances by organs: the main idea here is to evaluate which kind of
organ and association of organs provide the best performances. The gure 2 gives
the min, max and average MRR scores for all runs detailed by the 10 most
representative organ combinations (with at least 100 observations in the test dataset).
Surprisingly, the graph reveals that the majority of organ combinations share
more or less the same MRR scores around 0.7 on average, highlighting how much
the systems based on CNNs tend to be robust to any combination of pictures.
However, as we yet noticed the previous years in LifeCLEF, the majority o f
the systems performed clearly better when a test observation contains one or
several picture of owers exclusively. Using at least a picture of ower in a test
observation with other types of organs guaranties in a sense good identi cation
performances if we look at the three next organ combinations ( ower-fruit-leaf,
ower-leaf and ower-leaf-stem). On the opposite side, the systems have more
di culties when a test observation contains pictures of leaves without any
owers. It is getting worse when an observation combines only pictures of leaves and
stems. This could be explained by the fact that stems are visually very di erent
from leaves and both these two kinds of pictures produce dissimilar and non
complementary sets of species on the outputs of the CNNs. We can notice as
a complementary remark, that generally combining several pictures from
different types of organs causes wider ranges of min and max scores, highlighting
how much can be sensitive the combination of organs with an inappropriate rule.
Biodiversity-friendly evaluation metric: Like in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the main idea here is to
evaluate how automated identi cation system deals with the long tail problem,
i.e how much an automated system performs along the long-tailed distribution,
where basically very few species are well populated in terms of observations while
the vast majority of species contain few images. We therefore split the species into
3 categories according to the number of observations populating these species
in the datasets. The gure 3 gives the detailed MRR scores of three categories
of species: with a low, intermediate and high number of (trusted and noisy)
training images, respectively between 4 and 161 images, between 162 and 195
images, and between 196 and 1583 images (the three categories are balanced in
terms of total number of training images). First we can notice that, as we can
expect, for the majority of the systems, the performances are clearly lower on
the "intermediate" species than the "high" species, and even more on the "low"
species category. For instance, the FHDO BCSG Run3 is very a ected by the
long tail distribution problem with a di erence of MRR scores about 0.5 between
the "high" and the "low" categories. However, on the opposite side, some runs
like Mario TSA Berlin runs 2&amp;4, KDE TUT runs 2&amp;3, or to a lesser extent UM
run 2, are de nitely "biodiversity-friendly" since they are quite few a ected by
the long tail distribution and are able to maintain more or less equivalent MRR
scores for the three species categories. We can speci cally highlight the Run 2
from KDE TUT which, while it is using "only" three ResNet-50 models learned
from scratch, is able to guaranty a MRR score around 0.79 0.4 almost
independently from the number of training images by species. Moreover, we can notice
that all these remarkable runs produced by KDE TUT, Mario TSA Berlin and
UM teams, share the fact they used all the entire noisy datasets without any
ltering process. All the attempts of ltering the noisy dataset seem to degrade
the performances on the "intermediate" and "low" categories, like for instance
for the Mario TSA Berlin Run 2 and 3 (resp. without an with ltering).
7
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>This paper presented the overview and the results of the LifeCLEF 2017 plant
identi cation challenge following the six previous ones conducted within CLEF
evaluation forum. This year the task was performed on the biggest plant
images dataset ever published in the literature. This dataset was composed of two
distinct sources: a trusted set built from the online Encyclopedia of Life and
a noisy dataset illustrating the same 10K species with more than 1M images
crawled from the web without any ltering. The main conclusion of our
evaluation was that convolutional neural networks (CNN) appear to be amazingly
e ective in the presence of noise in the training set. All networks trained solely
on the noisy dataset did outperform the same models trained on the trusted
data. Even at a constant number of training iterations (i.e. at a constant
number of images passed to the network), it was more pro table to use the noisy
training data. This means that diversity in the training data is a key factor to
improve the generalization ability of deep learning. The noise itself seems to act
as a regularization of the model. Beyond technical aspects, this conclusion is of
high importance in botany and biodiversity informatics in general. Data
quality and data validation issues are of crucial importance in these elds and our
conclusion is somehow disruptive.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. A ouard, A.,
          <string-name>
            <surname>Goeau</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lombardo</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Pl@ntnet app in the era of deep learning</article-title>
          .
          <source>In: 5th International Conference on Learning Representations (ICLR</source>
          <year>2017</year>
          ),
          <source>April 24-26</source>
          <year>2017</year>
          , Toulon, France (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Atito</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yanikoglu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aptoula</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Plant identi cation with large number of classes: Sabanciu-gebzetu system in plantclef 2017</article-title>
          . In: Working notes of CLEF 2017 conference (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Goeau, H.,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Plant identi cation in an open-world (lifeclef 2016)</article-title>
          . In: Working Notes of CLEF 2016-
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Evora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          . pp.
          <volume>428</volume>
          {
          <issue>439</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hang</surname>
            ,
            <given-names>S.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aono</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Residual network with delayed max pooling for very large scale plant identi cation</article-title>
          .
          <source>In: Working notes of CLEF 2017 conference (</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>770</volume>
          {
          <issue>778</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distilling the knowledge in a neural network</article-title>
          .
          <source>arXiv preprint arXiv:1503.02531</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Io e, S.,
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          .
          <source>CoRR abs/1502</source>
          .03167 (
          <year>2015</year>
          ), http://arxiv. org/abs/1502.03167
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Barbe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Selmi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Champ</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dufour-Kowalski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , A ouard, A.,
          <string-name>
            <surname>Carre</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molino</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          , et al.:
          <article-title>A look inside the pl@ ntnet experience</article-title>
          .
          <source>Multimedia Systems</source>
          pp.
          <volume>1</volume>
          {
          <issue>16</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spampinato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          , Fisher, R., Muller, H.:
          <article-title>Are species identi cation tools biodiversity-friendly?</article-title>
          <source>In: Proceedings of the 3rd ACM International Workshop on Multimedia Analysis for Ecological Data</source>
          . pp.
          <volume>31</volume>
          {
          <fpage>36</fpage>
          . MAED '14,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2014</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/2661821.2661826
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sapp</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toshev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duerig</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Philbin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>FeiFei</surname>
          </string-name>
          , L.:
          <article-title>The unreasonable e ectiveness of noisy data for ne-grained recognition</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          . pp.
          <volume>301</volume>
          {
          <fpage>320</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>1097</volume>
          {
          <issue>1105</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lasseck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Image-based plant species identi cation with deep convolutional neural networks</article-title>
          .
          <source>In: Working notes of CLEF 2017 conference (</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Y.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chan</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          :
          <article-title>Lifeclef 2017 plant identi cation challenge: Classifying plants using generic-organ correlation features</article-title>
          .
          <source>In: Working notes of CLEF 2017 conference (</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ludwig</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piorek</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelch</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rex</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koitka</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          :
          <article-title>Improving model performance for plant image classi cation with ltered noisy images</article-title>
          .
          <source>In: Working notes of CLEF 2017 conference (</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sulc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matas</surname>
          </string-name>
          , J.:
          <article-title>Learning with noisy and trusted labels for ne-grained plant recognition</article-title>
          .
          <source>In: Working notes of CLEF 2017 conference (</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , Io e, S.,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Inception-v4, inception-resnet and the impact of residual connections on learning</article-title>
          .
          <source>arXiv preprint arXiv:1602.07261</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sermanet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anguelov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabinovich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Toma</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefan</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Upb hes so @ plantclef 2017: Automatic plant image identi cation using transfer learning via convolutional neural networks</article-title>
          .
          <source>In: Working notes of CLEF 2017 conference (</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>