<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bird Species Recognition via Neural Architecture Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Markus Muhling</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakob Franz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolaus Korfhage</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernd Freisleben</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Marburg</institution>
          ,
          <addr-line>Hans-Meerwein-Stra e 6, D-35032 Marburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the winning approach of the BirdCLEF 2020 challenge. The challenge is to automatically recognize bird sounds in continuous soundscapes. In our approach, a deep convolutional neural network model is used that directly operates on the audio data. This neural network architecture is based on a neural architecture search and contains multiple auxiliary heads and recurrent layers. During the training process, scheduled drop path is used as a regularization method and extensive data augmentation is applied to the audio input. Furthermore, species location lists are used in the post-processing step to reject unlikely classes. Our best run on the test set obtains a classi cation mean average precision score (cmap) of 13:1% and a retrieval mean average precision score (rmap) of 19:2%.</p>
      </abstract>
      <kwd-group>
        <kwd>Bird Species Recognition</kwd>
        <kwd>BirdClef 2020</kwd>
        <kwd>Neural Architec- ture Search</kwd>
        <kwd>Gabor Wavelet Layer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Automatically identifying bird species in continuous sound recordings is an
important task for monitoring the populations of di erent bird species in forest
ecosystems. In the BirdCLEF 2020 challenge [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is part of the LifeClef
challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], participants have to identify 960 bird species in 5 seconds
snippets of continuous audio recordings. The test data contains 153 soundscapes of
a length of 10 minutes recorded at four di erent locations in Peru (Concesion de
Ecoturismo Inka Terra), the USA (Sierra Nevada/High Sierra in California and
Sapsucker Woods/Ithaca in New York), and Germany (Laubach). Each
soundscape contains high quantities of (overlapping) bird vocalizations. In contrast to
the BirdCLEF 2019 challenge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], no other than the provided training data is
allowed to build the recognition system. This prohibits ne-tuning convolutional
neural networks (CNNs) pretrained on other datasets, as applied in the best
approaches of previous BirdClef challenges. This is reasonable, since ne-tuning
a CNN pretrained for the task of image classi cation on the ILSVRC dataset
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] has proven to yield excellent results for many tasks, even for the task of bird
recognition using spectogram images.
      </p>
      <p>This restriction makes it even more important to nd an optimal neural
network architecture for the BirdClef 2020 challenge. For this purpose, we applied a
network architecture search (NAS) approach. NAS is a current eld of research.
It allows to automatically learn an optimal neural network architecture for a
speci c problem and o ers an alternative to the time-consuming task of manual
architecture optimization.</p>
      <p>
        The designed architecture is directly applied to the audio input using a
Gabor wavelet transformation, similar to Zeghidour et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] in speech recognition.
This transformation is integrated into the neural network architecture as a
complex 1-D convolutional layer.
      </p>
      <p>The contributions of the paper are as follows:
{ A novel neural architecture search approach based on a memetic algorithm
is used to nd an optimal neural network architecture for the task of bird
sound recognition.
{ The proposed neural network architecture directly operates on the audio
input.</p>
      <p>The paper is organized as follows. Section 2 describes the bird recognition
approach including data pre-processing, data augmentation, the construction of
the neural network architecture, and details of the training process. Experimental
results are presented in Section 3. Section 4 concludes the paper and outlines
areas for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>In this section, the proposed system for bird sound recognition is presented.
Section 2.1 describes the preprocessing steps. The data augmentation methods used
during the training process are speci ed in Section 2.2. The design of the neural
network architecture is explained in Section 2.3, including the Gabor wavelet
layer, the NAS approach, and the composition of the neural network using
multiple auxiliary heads and recurrent layers. Section 2.4 provides information about
the training process.
2.1</p>
      <sec id="sec-2-1">
        <title>Data Pre-processing</title>
        <p>
          First, the audio recordings are split into 5 seconds segments with an overlap of
0.25 seconds. Then, these snippets are classi ed as either bird sound or noise
(i.e., no audible bird sounds) using the heuristic of the BirdCLEF 2018 baseline
system [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The sets are called Tsignal (bird sounds) and Tnoise (noise segments).
Finally, the recordings are normalized to -3 db and resampled to 22,050 Hz.
(a) No augmentation
(b) Pitch augmentation
(c) Mask augmentation
(d) Noise augmentation
(e) Loudness augmentation
(f) All previous augmentations
(g) Noise snippets augmentation
(h) All augmentations
To avoid over tting and to improve the generalization capabilities of the neural
network model to various recording conditions, the following data augmentation
methods are applied to the audio segments (see Figure 1):
{ Pitch augmentation: Shifting pitch by one to three semitones.
{ Mask augmentation: A randomly chosen segment of 0.5 seconds duration is
masked with zeros.
{ Noise augmentation: White noise is added to the training snippet.
{ Loudness augmentation: The volume of a training snippet is increased by a
randomly chosen factor within range [0:25; 4] (0.25 leads to a decrease by
factor 4, 4 leads to an increase by factor 4).
{ Noise snippets augmentation: Each training snippet is a randomly weighted
sum of one augmented bird snippet from Tsignal and four noise snippets from
Tnoise.
        </p>
        <p>First, pitch, mask, noise, and loudness augmentation are applied to the
training segments of Tsignal. For this purpose, between one and four augmentation
methods are randomly selected and applied in random order. Second, noise
snippet augmentation is used. Figure 1 visualizes the impact of the di erent data
augmentation methods on the resulting audio spectograms.
2.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Neural Network Architecture</title>
        <p>
          A NAS approach based on a memetic algorithm is used to nd an optimal
convolutional neural network architecture for the task of bird sound recognition. The
overall network architecture is based on a Gabor wavelet layer directly operating
on the audio input, similar to Zeghidour et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for speech recognition. The
optimal cell structure found by NAS and multiple architecture heads including
recurrent layers are used to take further advantage of temporal information.
Gabor Wavelet Layer. The extraction of the audio spectograms is integrated
into the neural network architecture using a Gabor wavelet layer. For this
purpose, a complex 1-D convolution with n lters is applied to the audio input where
n is the number of frequencies. The weights of the complex kernels are created
using Gabor wavelets. The complex 1-D convolution is followed by applying the
logarithm and a zero centering normalization.
        </p>
        <p>
          Furthermore, we experimented with audio spectograms generated by
applying the FFT, as provided by the baseline system of the BirdClef challenge 2018
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. While our experiments on this dataset showed that using the Gabor wavelet
layer did not lead to quality improvements compared to using audio spectograms
generated by the FFT, the implemented Gabor wavelet layer has some
advantages. First, data augmentation methods can be also exibly applied to the audio
input. Second, arbitrary frequencies can be extracted in contrast to the FFT with
equidistant spacing. This allows us to directly use mel-scaled frequencies.
Neural Architecture Search. NAS is a recent eld of research. The aim of
NAS is to automatically nd an optimal neural network architecture on a
speci c task. The NASNet architecture [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] is the rst design that outperformed
handcrafted, manually optimized architectures for the task of image classi
cation. The main idea is to break down the search space and search for cells that
form the building blocks for the overall network architecture. Other approaches
primarily exhibit runtime improvements [
          <xref ref-type="bibr" rid="ref1 ref10 ref14 ref2 ref9">14, 9, 1, 10, 2</xref>
          ]. While the approaches of
Chen et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and Real et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] use an evolutionary search method, Dong et
al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] start with a huge, over-parameterized network which is then shrunk and
locally optimized step by step.
        </p>
        <p>
          In the following, we describe an NAS approach to nd an optimized
architecture for bird sound recognition. The approach uses a memetic search algorithm
that combines local optimization with an evolutionary algorithm. Like Zoph et
al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], we do not search for full network architectures but instead for relatively
small structures called cells that are later stacked in a prede ned manner to build
the full network. NAS approaches are typically categorized according to the used
search space, the search strategy, and the performance estimation/evaluation
strategy, which are described in the following paragraphs.
        </p>
        <p>
          Search Space. As previously described, we search for cell structures that are
later upscaled to the full network architecture. Like the NASNet search space
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], a cell consists of blocks and operations. While a cell in the NASNet search
space consists of a xed number of blocks and operations, our cell structure is
more exible. A cell can be composed of a variable number of blocks, and each
block can contain a variable number of operations. The set of allowed operations
to perform is as follows:
{ identity
{ depth-wise separable convolution
{ normal convolution
{ max pooling
{ average pooling
        </p>
        <p>Kernel sizes for the convolution and pooling operations are restricted to the
following sizes: f3 3; 5 5; 7 7g.</p>
        <p>The outputs of the operations within a block are merged with an add layer
that forms the output of the corresponding block. Therefore, each operation
contains some extra layers to satisfy shape constraints. The output of blocks
that have not been used as input to any other operation within the cell are
merged using either an add or a concatenate layer to form the output of the cell.
The input of an operation can be either the output of one of the previous two
cells or the output of a preceding block of the current or previous cell.
Performance Estimation. For performance estimation, a network architecture
is constructed from the cell. First, a normal and a reduction cell are derived
from the cell structure. While the normal cell retains the spatial size of its
predecessor cell (i.e., every convolution or pooling operation has stride 1 1 and
appropriate padding), the reduction cell reduces the spatial size by a factor of 2
Algorithm 1: Neural architecture search</p>
        <p>Input: initial population P , number of cells to visit r
for cell in P do
// transform cell to full network
cell:network := cell-to-cnn(cell, F =8);
cell:accuracy := train-and-eval(cell:network);
cell:f lops := compute- ops(cell:network);
end
// compute fitness values of population members
compute- tness(P );
for round = 1 to r do
parents := select-parents(P );
// perform the crossover operation
child-cell := create-child(parents);
// perform mutation operations
child-cell := mutate-child(child-cell );
// add child cell to population
P .append(child-cell );
child cell:network := cell-to-cnn(child cell, F =8);
child cell:accuracy := train-and-eval(child cell:network);
child cell:f lops := compute- ops(child cell:network);
compute- tness(P );
end
return max(P , key=lambda cell : cell:f itness);
(i.e., convolutions and pooling operations have stride 2 2). Second, the overall
network is stacked, as shown in Figure 2.</p>
        <p>The output head is a global average pooling layer, followed by a densely
connected layer with softmax activation. The convolutions of the rst reduction
cell have F lters. The number of lters is then doubled after every reduction
cell.</p>
        <p>To determine the quality (or tness) of a cell, a full network is derived from
each cell with F = 8 initial lters. Then, the model is trained and evaluated on
a subset of the BirdCLEF 2020 training data set. For this purpose, 300 training
samples are randomly selected for each of 50 randomly chosen classes. The
resulting data set with 15; 000 audio samples is split into training and validation
set with a validation split of 0:1. Samples of the same audio le are either all in
the training set or in the validation set. The model is trained for 20 epochs using
the ADAM optimizer and a cosine learning rate scheduler. Finally, the accuracy
of the model is calculated on the validation set.</p>
        <p>Search Strategy. The idea of the search algorithm is to use an evolutionary
algorithm and local optimization. This memetic search algorithm starts with a
population of cells randomly drawn from the search space.</p>
        <p>gabor wavelet
layer
reduction cell normal cell reduction cell normal cell reduction cell normal cell
reduction cell</p>
        <p>normal cell
normal head
recurrent head
normal head
recurrent head
where c:accuracy is the performance estimation and c:f lops is the number of
oating point operations of cell c. accmean, accstd, f lopsmean and f lopsstd are
the mean and the standard deviation of the validation accuracy and ops,
respectively, of all cells of the current population.</p>
        <p>
          The tournament selection method is used to select two cells of the current
population based on their tness values. Cells with high tness values have a
higher chance to be chosen. The two selected cells are used to create a new cell
by merging all blocks. Considering the order of the blocks, all operations are
joined, so that block i of the new cell contains all operations of the i-th blocks of
the parents whereby the input connections of the operations are preserved. In a
local optimization step, the most important operations per block are identi ed
similar to the approach of Dong et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. For this purpose, a network is derived
from the child cell, trained for two epochs and shrunk to the most important
operations.
        </p>
        <p>Afterwards, mutation operations are applied to the new cell, for example,
changing operation types, input connections or adding and removing blocks.
Finally, the cell is added to the population, and the tness values are recomputed.
This procedure is repeated for a certain number of rounds, and the cell with the
best tness value is returned. The search algorithm is described in Algorithm 1.
Multi-Head Model. In contrast to natural images, audio spectrograms
contain temporal information. It seems reasonable to use this information, since the
individual sounds of a bird's song may follow a certain chronological order. For
this reason, we added output heads with recurrent layers to the network
architecture. Altogether, the nal network architecture contains two output heads, each
with and without recurrent layers: one pair at the end of the network and one
pair at an intermediate stage. Output heads without a recurrent layer contain a
global average pooling layer followed by a densely connected layer with softmax
activation, recurrent output heads contain two consecutive GRU layers with 128
units each, followed by a global average pooling layer and a densely connected
layer with softmax activation. Each output head has its own loss function in
the training phase, whereas for inference the output heads are merged using a
concat layer followed by global average pooling. Figure 3 shows the structure of
the network including all output heads.
2.4</p>
      </sec>
      <sec id="sec-2-3">
        <title>Training Methodology</title>
        <p>
          We trained our models using a weighted loss function consisting of a softmax log
loss for each output head. The loss terms of the intermediate output heads are
weighted by a factor of 0:6, whereas the loss terms of the output heads at the
end of the network are weighted by a factor of 1. The ADAM optimizer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is
used for the training process with a cosine learning rate scheduler. Furthermore,
scheduled drop path [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] where the drop rate dr is linearly increased to 0:4
throughout the training process is used as a regularization method. Within the
scheduled drop path, each operation of each cell in the network gets dropped
(i.e., its output is set to zero) by the probability dr nc if the network has n cells
and if we consider an operation of the c-th cell of the network.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>In this section, we present the results of the two o cial and two post challenge
submissions.
3.1</p>
      <sec id="sec-3-1">
        <title>Evaluation Metrics</title>
        <p>
          The task of the challenge is to identify all audible bird species in 5 second
snippets of continuous soundscape recordings from four di erent locations. With
the BirdClef challenge 2020 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the following two metrics are used to measure
the performance of a submitted run:
{ Classi cation Mean Average Precision (cmap):
where C is the total number of classes and AveP (c) is the average precision
of class c.
{ Retrieval Mean Average Precision (rmap):
cmap =
        </p>
        <p>PC
c=1 AveP (c)</p>
        <p>C
rmap =</p>
        <p>PX
x=1 AveP (x)</p>
        <p>X
where X is the total number of audio snippets of all test les and AveP (x)
is the average precision of snippet x.
max 3x3
identity
conv 3x3
sep 5x5
+
sep 3x3
max 3x3
max 3x3
+
The training set consists of more than 70,000 recordings across 960 bird species
classes contributed by the Xeno-canto community. Exactly one foreground bird
species is assigned to each audio recording le. Additionally, metadata such as
recording location, recording date, elevation, and recording quality is provided.</p>
        <p>The validation set contains 12 soundscape les recorded at two di erent
locations (one in Peru, one in the USA). Each soundscape le has a duration of
10 minutes and is divided into 5 second snippets. For our evaluation, a list of
audible bird species is assigned to each 5 seconds snippet of the validation data.</p>
        <p>The test set consists of 153 soundscape les of 10 minutes duration recorded
at four di erent locations (one in Peru, two in the USA, and one in Germany).
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>Two network architectures found by the NAS method described in Section 2.3
are evaluated. Altogether, we submitted ve runs: two o cial runs with the rst
architecture and three post challenge runs with the second architecture.</p>
        <p>The rst network architecture was found by running the search algorithm
described in Section 2.3 for 100 rounds. Figure 4 shows the cell architecture of
this network. The overall network was constructed as described in Section 2.3
with the number of initial lters set to F = 16 and trained for 30 epochs, as
described in Section 2.4. We used 128 mel-scaled frequencies in the range of 170
Hz to 10,000 Hz. The width of the kernel of the Gabor wavelet layer was set to
w = 1024, resulting in spectograms with a resolution of 256x128. The training
took 65 hours on four Nvidia TITAN XP GPUs.</p>
        <p>The trained network takes about 43 seconds to analyze a 10 minutes
soundscape le. However, most of the time ( 40 s) is consumed by the non-optimized
add</p>
        <p>+
max 3x3
sep 7x7
+</p>
        <p>avg 3x3
single-threaded pre-processing step. The inference time of the network is only
3 s, which corresponds to 40 snippets per second.</p>
        <p>The two o cial runs di er only in the post-processing step.</p>
        <p>O cial Run 1. In the rst run, the species location lists provided by the
challenge organizers are used to reject impossible species from the model's
predictions. This run obtains a cmap of 12:8% and a rmap of 19:3%, which is 8:6%
better than the cmap of the second best submission (4:2%) and thus wins the
challenge.</p>
        <p>O cial Run 2. For the second run, we used species lists per location and
season, derived from the training data to remove unlikely species from the model's
predictions. This led to a slight decrease in cmap and a slight increase in rmap,
resulting in a cmap value of 12:7% and a rmap value of 19:8%.</p>
        <p>After the challenge, we ran the NAS algorithm for another 50 rounds and found a
new promising cell architecture. The network corresponding to the newly found
cell with initial lters set to F = 16 was used to submit three further runs.
Figure 5 shows how two consecutive cells of the network are connected. While in
the Post-Submission Runs 1 and 2 the species lists per location and season are
used, Post-Submission Run 3 only applies the species lists per location in the
post-processing step.</p>
        <p>Post-Submission Run 1. The network used in this run was trained in the same
way as the network of Run 1 but only for 10 instead of 30 epochs, since this
seemed to be su cient. This run obtains a cmap of 12:6% and a rmap of 20:3%.
Post-Submission Run 2. For this run, the width of the kernel of the Gabor
wavelet layer was reduced from w = 1024 to w = 512, resulting in higher
resolution spectrograms with 428x128 pixels. The network was initialized with the
weights obtained by Post-Submission Run 1 and ne-tuned for another 5 epochs.
This resulted in a small improvement of the cmap and rmap values (cmap =
12:7%, rmap = 20:6%).</p>
        <p>Run 1
Run 2</p>
        <sec id="sec-3-2-1">
          <title>Post-Submission Run 1</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Post-Submission Run 2</title>
          <p>Post-Submission Run 3
validation</p>
          <p>test
cmap</p>
          <p>Post-Submission Run 3. In contrast to Post-Submission Run 1, this run uses a
di erent sampling strategy. Instead of sampling from the preclassi ed bird sound
snippets (Tsignal), the batches are sampled from the audio les extracting a
random ve second snippet per le. In this case, the number of training samples
per epoch equals the number of mono-species audio les. Due to the smaller
number of samples per epoch, the network was trained for 100 epochs. This run
obtains a cmap of 13:1% and a rmap of 19:2%, which is the best result of the
challenge in terms of cmap.
The presented approach won the BirdCLEF 2020 challenge with a cmap of 12:8%
and a rmap of 19:3%. In the post-challenge phase, the scores have been further
improved to 13.1% cmap and 20.6% rmap. However, the task of recognizing all
bird species in soundscapes is far from being solved.</p>
          <p>One possible reason could be the discrepancy between training and test data.
The training data consists of sound les of various lengths. Only the most audible
bird is labeled in each sound le. This means that there could be samples in the
training data where multiple birds or even a bird other than the labeled one
is audible. This can be misleading in the training process. However, the task
for validation and testing is to identify all audible birds in a snippet. This is a
much harder task than just recognizing the most audible bird in a recording. A
more ne-grained multi-label annotation of the training data could signi cantly
improve the quality of the bird sound recognition models.</p>
          <p>
            There are several possibilities to improve our approach with the existing
training data. There is room for improvement in the data augmentation step
that has proven to be very important in previous challenges. For example, we
could apply further data augmentation methods on the audio data or even apply
some augmentation on the spectrograms produced by the Gabor wavelet layer.
Another option is to learn the weights of the complex 1-D convolution kernel
of the Gabor wavelet layer to produce better spectrograms. or to use a
selfsupervised pre-training approach to learn a more suitable audio representation
[
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] and adapt it for birdcall identi cation. Furthermore, it may be bene cial to
run the neural network search algorithm for a longer period of time to nd a
network that works even better on this task. Finally, the performance could be
improved at the cost of longer training and inference runtimes by scaling up the
network's capacity and using stronger regularization.
5
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>This work is funded by the Hessian State Ministry for Higher Education,
Research and the Arts (HMWK) (LOEWE Natur 4.0).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Q.</given-names>
            ,
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          :
          <article-title>Reinforced evolutionary neural architecture search (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Searching for a robust neural architecture in four gpu hours</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on computer vision and pattern recognition</source>
          . pp.
          <volume>1761</volume>
          {
          <issue>1770</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deneu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Servajean</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cole</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Picek</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , Ruiz De Castan~eda, R., e, Lorieul,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Champ</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.P.</surname>
          </string-name>
          , Stoter,
          <string-name>
            <given-names>F.R.</given-names>
            ,
            <surname>Dorso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Eggel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          , Muller, H.:
          <article-title>Overview of lifeclef 2020: a system-oriented evaluation of automated species identi cation and species distribution prediction</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2020</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2020</year>
          , Thessaloniki,
          <string-name>
            <surname>Greece.</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clapp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hopping</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of birdclef 2020: Bird sound recognition in complex acoustic environments</article-title>
          .
          <source>In: CLEF task overview</source>
          <year>2020</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2020</year>
          , Thessaloniki,
          <string-name>
            <surname>Greece.</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Stoter,
          <string-name>
            <given-names>F.R.</given-names>
            , Goeau, H.,
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Planque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.P.</given-names>
            ,
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          : Overview of BirdCLEF 2019:
          <article-title>Large-Scale Bird Recognition in Soundscapes</article-title>
          . In: Working Notes of CLEF 2019 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          . vol.
          <source>CEUR Workshop Proceedings</source>
          , pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          .
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          (Sep
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilhelm-Stein</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klinck</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kowerko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eibl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Recognizing birds from sound - the 2018 birdclef baseline system</article-title>
          .
          <source>arXiv preprint arXiv:1804</source>
          .
          <volume>07177</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilhelm-Stein</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klinck</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kowerko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eibl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Recognizing birds from sound - the 2018 birdclef baseline system</article-title>
          .
          <source>arXiv preprint arXiv:1804</source>
          .
          <volume>07177</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization in: Proceedings of international conference on learning representations (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zoph</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hua</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuille</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Progressive neural architecture search</article-title>
          .
          <source>In: Proceedings of the European Conference on Computer Vision (ECCV)</source>
          . pp.
          <volume>19</volume>
          {
          <issue>34</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Real</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>Regularized evolution for image classi er architecture search</article-title>
          33, 4780{
          <fpage>4789</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Russakovsky</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>
          .
          <source>International Journal of Computer Vision</source>
          (IJCV)
          <volume>115</volume>
          (
          <issue>3</issue>
          ),
          <volume>211</volume>
          {
          <fpage>252</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baevski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Collobert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auli</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>wav2vec: Unsupervised pretraining for speech recognition</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>05862</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Zeghidour</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Usunier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kokkinos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schatz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Synnaeve</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dupoux</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Learning lterbanks from raw speech for phone recognition</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Zoph</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasudevan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>Learning transferable architectures for scalable image recognition</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>8697</volume>
          {
          <issue>8710</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>