<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Audio Based Bird Species Identi cation using Deep Learning Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elias Sprengel</string-name>
          <email>elias.sprengel@alumni.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Jaggi</string-name>
          <email>jaggi@inf.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yannic Kilcher</string-name>
          <email>yannic.kilcher@inf.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Hofmann</string-name>
          <email>thomas.hofmann@inf.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eidgenossische Technische Hochschule (ETH) Zurich</institution>
          ,
          <addr-line>Ramistrasse 101, 8092 Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present a new audio classi cation method for bird species identi cation. Whereas most approaches apply nearest neighbour matching [6] or decision trees [8] using extracted templates for each bird species, ours draws upon techniques from speech recognition and recent advances in the domain of deep learning. With novel preprocessing and data augmentation methods, we train a convolutional neural network on the biggest publicly available dataset [5]. Our network architecture achieves a mean average precision score of 0.686 when predicting the main species of each sound le and scores 0.555 when background species are used as additional prediction targets. As this performance surpasses current state of the art results, our approach won this years international BirdCLEF 2016 Recognition Challenge [3,4,1].</p>
      </abstract>
      <kwd-group>
        <kwd>Bird Identi cation</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Convolution Neural Network</kwd>
        <kwd>Audio Processing</kwd>
        <kwd>Data Augmentation</kwd>
        <kwd>Bird Species Recognition</kwd>
        <kwd>Acoustic classi cation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <sec id="sec-2-1">
        <title>Motivation</title>
        <p>
          Large scale, accurate bird recognition is essential for avian biodiversity
conservation. It helps us quantify the impact of land use and land management on bird
species and is fundamental for bird watchers, conservation organizations, park
rangers, ecology consultants, and ornithologists all over the world. Many books
have been published [
          <xref ref-type="bibr" rid="ref10 ref11 ref2">10,2,11</xref>
          ] to help humans determine the correct species and
dedicated online forums exist where recordings can be shared and discussed [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Nevertheless, because recordings, spanning hundreds of hours, need to be
carefully analysed and categorised, large scale bird identi cation remains almost an
impossible task to be done manually. It, therefore, seems natural to look at ways
to automate the process. Unfortunately a number of challenges have made this
task extremely di cult to tackle. Most prominent are:
{ Background noise
{ Multiple birds singing at the same time (multi-label)
{ Di erence between mating calls and songs
{ Inter-species variance [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
{ Variable length of sound recordings
{ Large number of di erent species
        </p>
        <p>
          Because of these, most systems are developed to deal with only a small
number of species and require a lot of re-training and ne-tuning for each new species.
In this paper, we describe a fully automatic, robust machine learning method
that is able to overcome these issues. We evaluated our method on the biggest
publicly available dataset which contains over 33'000 recordings of 999 di erent
species. We achieved a mean average precision (MAP) score of 0.69 and an
accuracy score of 0.58 which is currently the highest recorded score. Consequently our
approach won the international BirdCLEF 2016 Recognition Challenge [
          <xref ref-type="bibr" rid="ref1 ref3 ref4">3,4,1</xref>
          ].
1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Approach</title>
        <p>We use a convolutional neural network with ve convolutional and one dense
layer. Every convolutional layer uses a rectify activation function and is followed
by a max-pooling layer. For preprocessing, we split the sound le into a
signal part where bird songs/calls are audible and a noise part where no bird is
singing/calling (background noise is still present in these parts). We compute
the spectrograms (Short Time Fourier Transform) of both parts and split each
spectrogram into equally sized chunks. Each chunk can be seen as the
spectrogram of a short time interval (typically around 3 seconds). As such, we can use
each chunk from the signal part as a unique training/testing sample for our
neural network. A detailed description of every step will be provided in the next
chapters.</p>
        <p>Figure 1 and Figure 2 give an overview of our training / testing pipeline.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Feature Generation</title>
      <p>The generation of good input features is vital to the success of the neural
network. There are three main stages. First, we decide which parts of the sound
le correspond to a bird singing/calling (signal parts) and which parts contain
noise or silence (noise parts). Second, we compute the spectrogram for both
signal and noise part. Third, we divide the spectrogram of each part into equally
sized chunks. We can then use each chunk from the signal spectrogram as a
unique sample for training/testing and augment it with a chunk from the noise
spectrogram.
2.1</p>
      <sec id="sec-3-1">
        <title>Signal/Noise Separation</title>
        <p>To divide the sound le into a signal and a noise part, we rst compute the
spectrogram of the whole le. Note that all spectrograms in this paper are
computed in the same way. First the signal is passed through a short-time Fourier</p>
        <sec id="sec-3-1-1">
          <title>Load Separate Compute</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Sound File Signal/Noise Spectrogram</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Split into</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Chunks</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Store</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>Samples</title>
          <p>Network Training</p>
        </sec>
        <sec id="sec-3-1-7">
          <title>Load multiple random signal Load multiple random samples of the same class noise samples</title>
        </sec>
        <sec id="sec-3-1-8">
          <title>Additively combine to create new sample</title>
        </sec>
        <sec id="sec-3-1-9">
          <title>Additional augmentation (Time/Pitch Shift)</title>
        </sec>
        <sec id="sec-3-1-10">
          <title>Train CNN (batches of size 16 or 8)</title>
          <p>Fig. 1: Overview of the pipeline for training the neural network. CNN stands for
convolutional neural network. During training, we use a batch size of 16 training
examples per iteration. However, due to memory limitations of the GPU, we
sometimes have to fall back to batches of size 8.</p>
          <p>Feature Generation</p>
          <p>Load Separate Compute
Sound File Signal/Noise Spectrogram
Split into
Chunks</p>
          <p>Store
Samples
Network Testing</p>
          <p>Load all samples corresponding to one sound file</p>
          <p>Get predictions from neural network</p>
          <p>Average predictions and rank them by probability
transform (STFT), this is done using a Hanning window function (size 512, 75%
overlap). Then the logarithm of the amplitude of the STFT is taken. However,
the signal/noise separation is the exception to this rule because here, we do
not take the logarithm of the amplitude but instead divide every element by
the maximum value, such that all values end up in the interval [0; 1]. With the
spectrogram at hand, we are now able to look for the signal/noise intervals.</p>
          <p>
            For the signal part we follow [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] quite closely. We rst select all pixels in
the spectrogram that are three times bigger than the row median and three
times bigger than the column median. Intuitively, this gives us all the important
parts of the spectrograms, because a high amplitude usually corresponds to a
bird singing/calling. We set these pixels to 1 and everything else to 0. We apply
a binary erosion and dilation lter to get rid of the noise and join segments.
Experimentally we found that a 4 by 4 lter produced the best results. We
create a new indicator vector which has as many elements as there are columns
in the spectrogram. The i-th element in this vector is set to 1 if the i-th column
contains at least one 1, otherwise it is set to 0. We smooth the indicator vector
by applying two more binary dilation lters ( lter size 4 by 1). Finally we scale
our indicator vector to the length of the original sound le. We can now use it
as a mask to extract the signal part. Figure 3 shows a visual representation of
each step.
          </p>
          <p>For the noise part we follow the same steps but instead of selecting the pixels
which are three times bigger than row and column median, we select all pixels
which are 2.5 times bigger than the row and column median. We then proceed as
described above but invert the result at the very end. Note that, by construction
of our algorithm, a single column should never belong to both signal and noise
part. On the other hand, it can happen that a column is not part of either noise
nor signal part because we use di erent thresholds (3 versus 2.5). This is intended
as it provides a safety margin for our selection process. The reasoning is that
everything that was not selected as either signal nor noise, provides almost no
information to the neural network. The bird is either barely audible/distorted
or the sound does not match our concept of background noise very well.</p>
          <p>The signal and noise masks split the sound le into many short intervals. We
simply join these intervals together to form one signal- and one noise-sound- le.
Everything that is not selected is disregarded and not used in any future steps.
The transition marks, that occur when two segments are joined together, are
usually not audible because the cuts happen when no bird is calling/singing.
Furthermore, the use of the dilation lters, as described earlier, ensures that we
keep the number of generated intervals to a minimum when applying the masks.
From the two resulting sound les we can now compute a spectrogram for both
signal and noise part. Figure 4 shows an example.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Dividing the Spectrograms into Chunks</title>
        <p>As described in the last section, we compute a spectrogram for both the signal
and noise part of the sound le. Afterwards we split both spectrograms into
chunks of equal size (we use a length of 512). The splitting is done for three
1
0
1
0
1
0</p>
        <p>Original Spectrogram</p>
        <p>Selected Pixels</p>
        <p>Selected Pixels after Erosion
Selected Pixels after Erosion and Dilation</p>
        <p>Selected Columns</p>
        <p>Selected Columns after first Dilation</p>
        <p>Selected Columns after second Dilation</p>
        <p>STFT</p>
        <p>Sound File</p>
        <p>Noise Part</p>
        <p>STFT
reasons. For one, we need a xed sized input for our neural network architecture.
We could pad the input but the large variance in the length of the recordings
would mean that some samples would contain over 99% padding. We could
also try to use varying step sizes of our pooling layers but this would stretch
or compress the signal in the time dimension. In comparison, chunks allow us
to pad only the last part and keep our step size constant. Second, thanks to
our signal/noise separation method we do not have to deal with the issue of
empty chunks (without a bird calling/singing) which means we can use each
chunk as a unique sample for training/testing. Third, we can let the network
make multiple predictions per sound le (one prediction per chunk) and average
them to generate a nal prediction. This makes our predictions more robust and
reliable. As an extension, one could try to merge multiple predictions in a more
sophisticated way but, so far, no extensive testing has been done.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Data Augmentation</title>
      <p>Because the number of sound les is quite small, compared to the number of
classes (the training set (of 24'607 les) contains an average of only 25 sound
les per class), we need additional methods to avoid over tting. Apart from
drop-out, data augmentation was one of the most important ingredients to
improve the generalization performance of the system. We apply four di erent data
augmentation methods. For an overview of the the impact each data
augmentation method has, consult Table 1.
Every time we present the neural network with a training example, we shift it
in time by a random amount. In terms of the spectrogram this means that we
cut it into two parts and place the second part in front of the rst (wrap around
shifts). This creates a sharp corner where the end of the second part meets
the beginning of the rst part but all the information is preserved. With this
augmentation we force the network to deal with irregularities in the spectrogram
and also, more importantly, teach the network that bird songs/calls appear at
any time, independent of the bird species.
3.2</p>
      <sec id="sec-4-1">
        <title>Pitch Shift</title>
        <p>
          In a review of di erent augmentation methods [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] showed that pitch shifts
(vertical shifts) also helped reducing the classi cation error. We found that,
while a small shift (about 5%) seemed to help, a larger shift was not bene cial.
Again we used a wrap-around method to preserve the complete information.
3.3
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Combining Same Class Audio Files</title>
        <p>
          We follow [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and add sound les that correspond to the same class. Adding is
a simple process because each sound le can be represented by a single vector. If
one of the sound les is shorter than the other we repeat the shorter one as many
times as it is necessary. After adding two sound les, we re-normalize the result
to preserve the original maximum amplitude of the sound les. The operation
describes the e ect of multiple birds (of the same species) singing at the same
time. Adding les improves convergence because the neural network sees more
important patterns at once, we also found a slight increase in the accuracy of
the system (see Table 1).
3.4
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Adding Noise</title>
        <p>One of the most important augmentation steps is to add background noise. In
Section 2.1 we described how we split each le into a signal and noise part.
For every signal sample we can choose an arbitrary noise sample (since the
background noise should be independent of the class label) and add it on top
of the original training sample at hand. As for combining same class audio les,
this operation should be done in the time domain by adding both sound les
and repeating the smaller one as often as necessary. We can even add multiple
noise samples. In our test we found that three noise samples added on top of the
signal, each with a dampening factor of 0.4 produces the best results. This means
that, given enough training time, for a single training sample we eventually add
every possible background noise which decreases the generalization error.
8
2
1
1</p>
        <p>Input Image
...</p>
        <p>8
256 8
.
.
.</p>
        <p>.
.</p>
        <p>.</p>
        <p>Dense Layer SoftMax Layer
with 1024 units with 1000 units
We use batches of 8 or 16 training examples. We found that using 16 training
samples per batch produced slightly better results but, due to memory
limitations of the GPU, some models were trained with only 8 samples per batch.
If many samples, from the same sound le, are present in a single batch, the
performance of the batch normalization function drops considerably. We,
therefore, select the samples for each batch uniform at random without replacement.
Normalizing the sound les beforehand might be an alternative solution.
We use the Nesterov momentum method to compute the updates for our weights.
The momentum is set to 0.9 and the initial learning rate is equal to 0.1. After 4
days of training (around 100 epochs) we reduce the learning rate to 0.01.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>We evaluate our results locally by splitting the original training set into a training
and validation set. To preserve the original label distribution we group les
by their class id (species) and used 10% of each group for validation and the
remaining 90% for training. Note that, even for our contest submissions, we
never trained on the validation set. Our contest results would probably improve,
if training would be performed on both training and validation set.</p>
      <p>Training the neural network takes a lot of time. We, therefore, choose a
subset of the training set, containing 50 di erent species, to ne tune parameters.
This (20 times smaller) dataset enabled us to test over 500 di erent network
con gurations. Our nal con guration was then trained on the complete
training set (considering all 999 species) and reached an accuracy score of 0.59 and
a mean average precision (MAP) score of 0.67 on the local validation set (999
species). On the remote test set our best run reached a MAP score of 0.69 when
considering only the main (foreground) species, 0.55 when considering the
background species as well and 0.08 when only background species were considered.
This means our approach outperformed the next best contestant by 17% in the
category where background species were ignored. Figure 6 shows a visual
comparison of the scores for all participants. As seen in Figure 6 we submitted a total
of four runs. The rst run \Cube Run 1" was an early submission where
parameters had not yet been tuned and the model was only trained for a single day.
The second and third run were almost identical but \Cube Run 2" was trained
on spectrograms that were resized by 50% while \Cube Run 3" was trained on
the original sized spectrograms. Both times the model was rst trained for 4
days, using the Nesterov momentum method (momentum = 0.9, learning rate
= 0.1) and then trained for one more day with a decreased learning rate of 0.01.
Furthermore, \Cube Run 3" was trained with a batch size of 8 because of the
limited GPU memory, while \Cube Run 2" was able to use batches of size 16
(scaled spectrograms). Finally, \Cube Run 4" was created by simply averaging
the predictions from \Cube Run 2" and \Cube Run 3". We can see that \Cube
Run 4" outperformed all other submission which means that an ensemble of
neural networks could increase our score even further.
5.1</p>
      <sec id="sec-5-1">
        <title>Discussion</title>
        <p>
          Our approach surpassed state of the art performance when targeting the
dominant foreground species. When background species were taken into account,
other approaches performed almost as well as ours. When no foreground species
was present one other approach was able to outperform us. This should not
surprise us, considering our data augmentation and preprocessing method. First of
all, we were cutting out the noise part, focusing only on the signal part. In theory
this should help our network to focus on the important parts but in practice we
might disregard less audible background species. Second, we are augmenting our
data by adding background noise from other les on top of the signal part. As
shown in Table 1, the score for identifying background species increases if we
train without this data augmentation techniques. That means, even though, we
do not use any data augmentation method when dealing with the test set, the
network is still trained to ignore everything that happens in the background.
One possible solution would be to alter the cost function and target background
species as well. An other solution could be to employ a preprocessing step that
tries to split the original les into di erently sized parts, each part containing
only one bird call/song. This is similar to [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] who compares single bird calls/songs
instead of complete sound les.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Unsuccessful approaches we tested</title>
        <p>We tested a lot of di erent ideas and not all of them worked, we will brie y list
them in this chapter to give a complete picture.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Bi-directional LSTM Recurrent Neural Networks: We tried di erent cost</title>
        <p>functions and parameters but were not able to match the performance of our
convolutional neural network.</p>
        <p>Regularization: We tested L1 and L2 regularization of all weights but found
that our generalization error did not decrease. Furthermore, adding these extra
terms made training considerably slower.</p>
        <p>Non-Square-Filters: For the convolutional layers we tried to use non square
lters because we wanted to treat the time dimension di erently than the
frequency dimension. We found, however, that small variations did not change the
performance while an attempt with a 1D (height of lter equals height of
spectrogram) convolution produced worse results.</p>
        <p>
          Deeper-Networks: We tried to add more layers to our neural network but the
performance dropped after adding the 5th layer. This seems to be a common
problem and many solutions have been proposed, for example, using highway
networks [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. We have not tested any of these proposed solutions but they
might be an important ingredient in an attempt to increase the accuracy of the
system even further.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Outlook</title>
      <p>We already mentioned a few improvements that could be made. One idea is to
use an ensemble of neural networks. An other idea is to modify our cost function
to consider the background species or present single bird calls/songs instead of
the currently xed sized samples. One problem with the current approach is that
longer les, as they generate more chunks, seem more important to the network.
To combat this, we could show the same number of chunks for each class by
repeating chunks from classes with a lower number of chunks / shorter les.
Finally, the dataset provides us with a lot of meta-data: Date, time and location
to name a few. We are currently only relying on the sound les but incorporating
these values could greatly increase our score because we could narrow down the
total number of species which we need to consider. While testing parameters,
we found, for example, that with only 50 di erent species, we were able to reach
a MAP score around 0.84 (compared to 0.67, our best score on the validation
dataset). Training models for di erent regions / species and combining them
using the meta-data, therefore, seems like a natural extension to the current
approach.</p>
      <p>Acknowledgements. We would like to thank Amazon AWS for providing
us with the computational resources, Ivan Eggel, Herve Glotin, Herve Goeau,
Alexis Joly, Henning Muller and Willem-Pier Vellinga for organizing this task,
the Xeno-Canto foundation for nature sounds as well as the French projects
Pl@ntNet (INRIA, CIRAD, Tela Botanica) and SABIOD Mastodons for their
support. Last but not least E.S. would like to thank Samuel Bryner, Judith
Dittmer, Nico Neureiter and Boris Schroder-Esselbach for their helpful remarks
and guidance throughout this project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Imageclef / lifeclef - multimedia retrieval in clef. http://www.imageclef.org/ node/199, accessed:
          <fpage>2016</fpage>
          -05-22
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>De</surname>
            <given-names>Schauensee</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.M.</given-names>
            ,
            <surname>Eisenmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          :
          <article-title>The species of birds of South America and their distribution</article-title>
          .
          <source>Academy of Natural Sciences; dist. by Livingston</source>
          , Narberth, Pa. (
          <year>1966</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Lifeclef bird identi - cation task 2016</article-title>
          . In: CLEF working notes
          <year>2016</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spampinato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Champ</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palazzo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Lifeclef 2016: multimedia life species identi cation challenges</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2016</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spampinato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palazzo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Fisher,
          <string-name>
            <surname>B.</surname>
          </string-name>
          , et al.:
          <article-title>Lifeclef 2015: multimedia life species identi cation challenges</article-title>
          .
          <source>In: Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , pp.
          <volume>462</volume>
          {
          <fpage>483</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leveau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Champ</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buisson</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Shared nearest neighbors match kernel for bird songs identi cation-lifeclef 2015 challenge</article-title>
          .
          <source>In: CLEF 2015</source>
          . vol.
          <volume>1391</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lasseck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Bird song classi cation in eld recordings: winning solution for nips4b 2013 competition</article-title>
          . In
          <source>: Proc. of int. symp. Neural Information Scaled for Bioacoustics</source>
          , sabiod. org/nips4b, joint to NIPS, Nevada. pp.
          <volume>176</volume>
          {
          <issue>181</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lasseck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Improved automatic bird identi cation through decision tree based feature selection and bagging</article-title>
          .
          <source>In: Working notes of CLEF 2015 conference (</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Marler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tamura</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Song "dialects" in three populations of white-crowned sparrows</article-title>
          .
          <source>The Condor</source>
          <volume>64</volume>
          (
          <issue>5</issue>
          ),
          <volume>368</volume>
          {
          <fpage>377</fpage>
          (
          <year>1962</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Restall</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodner</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lentino</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Birds of northern South America</article-title>
          . Christopher
          <string-name>
            <surname>Helm</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ridgely</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tudor</surname>
          </string-name>
          , G.:
          <article-title>Field guide to the songbirds of South America: the passerines</article-title>
          . University of Texas Press (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Schluter, J.,
          <string-name>
            <surname>Grill</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Exploring data augmentation for improved singing voice detection with neural networks</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>R.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gre</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Training very deep networks</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <volume>2368</volume>
          {
          <issue>2376</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Takahashi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gygli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P ster</given-names>
            , B.,
            <surname>Van Gool</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Deep convolutional neural networks and data augmentation for acoustic event detection</article-title>
          .
          <source>arXiv preprint arXiv:1604.07160</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Xeno Canto Foundation:
          <article-title>Sharing bird sounds from around the world (</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>