<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bird Species Identification in Soundscapes*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mario Lasseck</string-name>
          <email>Mario.Lasseck@mfn.berlin</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Museum für Naturkunde Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents deep learning techniques for audio-based bird identification in soundscapes. Deep Convolutional Neural Networks are trained to classify 659 species. Different data augmentation techniques are applied to prevent overfitting and improve model accuracy and generalization. The proposed approach is evaluated in the BirdCLEF 2019 campaign and provides the best system to identify bird species in wildlife monitoring recordings. With an ensemble of different single- and multi-label classification models it obtains a classification mean average precision (c-mAP) of 35.6 % and a retrieval mean average precision (r-mAP) of 74.6 % on the official BirdCLEF test set. In terms of classification precision, single model performance surpasses previous stateof-the-art by more than 20 %.</p>
      </abstract>
      <kwd-group>
        <kwd>Bird Species Identification</kwd>
        <kwd>Biodiversity Assessment</kwd>
        <kwd>Soundscapes</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Data Augmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        For the LifeCLEF bird identification task participating teams have to identify
different bird species in a large collection of audio recordings. The 2019 edition mainly
focuses on soundscapes. This is a more difficult task compared to previous editions
where species had to be identified mostly in mono-directional recordings with usually
only one prominent species present in the foreground. Soundscapes on the other hand
are recorded in the field, e.g. for wildlife monitoring, not targeting any specific
direction or individual animal. There can be a large number of simultaneously singing
species overlapping in time and frequency, arbitrary background noise depending on
weather conditions and sometimes very distant and faint calls. Identifying as many
species as possible in such a scenario remains challenging but is an important step
towards real-world wildlife monitoring and reliable biodiversity assessment. An
overview and further details about the BirdCLEF task is given in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is among others
part of the LifeCLEF 2019 evaluation campaign [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
*
      </p>
      <p>The approach described in this paper uses neural networks and deep learning. It builds
on the work of previous solutions to the task and combines proven techniques with
new methods for data augmentation and multi-label classification.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Implementation</title>
      <sec id="sec-2-1">
        <title>Data Preparation</title>
        <p>
          All audio recordings are first high pass filtered at a frequency of 2 kHz (Q = 0.707)
and then resampled to 22050 Hz with the Sound eXchange (SoX) v14.4.1 audio
processing tool [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Soundscapes from the validation set are prepared for training by
cutting them into individual files according to their annotations. Starting from the
beginning of a file, whenever the label or set of labels changes, a new audio file is
generated with the corresponding labels. Additionally, a “noise only” file is created
from each soundscape by merging all parts without bird activity via concatenation.
Those files containing only background noise are later used together with other
background recordings for noise augmentation.
        </p>
        <p>
          In order to also use the validation set for training, it is split into 8 folds via iterative
stratification for multi-label data [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. As a result, a small part of the validation set can
be used to evaluate model performance while the rest of the set can be added to the
Xeno-Canto [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] training set.
        </p>
        <p>
          To allow faster prototyping and to create a more diverse set of models for later
ensembling, different data subsets are formed targeting different numbers of species or
sound classes:
 Data set 1: 78 classes (with 7342 files from the training set)
 Data set 2: 254 classes (with 21542 files from the training set)
 Data set 3: 659 classes (with all 50145 files from the training set)1
The smallest data set covers all 78 species present in the annotated soundscapes of the
validation set and only contains training files belonging to those classes (not
considering background species). The second data set consists of all files belonging to species
mainly present in the recording locations of the United States. To find out which
species are likely to be recorded in the US, the additionally provided eBird [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] data is
taken into account and all files belonging to a species with a frequency value above
zero for any time of the year are added to the first data set. The third data set finally
covers all 659 species and all available training files. The eBird data is also used to
create a list of unlikely species for both the Colombia and the US recording locations.
For some submissions this list is later used to set predictions of unlikely species to
zero for soundscapes in the test set depending on their recording location.
1 8 files of the Xeno-Canto training set are excluded because they are corrupt or too small
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Training Setup</title>
        <p>
          For audio-based bird species identification Deep Convolutional Neural Networks
pretrained on ImageNet [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] are fine-tuned with mel scaled spectrogram images
representing short audio chunks. Models are trained with PyTorch [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] utilizing PySoundFile
and LibROSA [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] python packages for audio file reading and processing. The same
basic pipeline as for the BirdCLEF 2018 task is used for data loading and can be
summarized as follows:
 Extract audio chunk from file with a duration of ca. 5 seconds
 Apply short-time Fourier transform
 Normalize and convert power spectrogram to decibel units (dB) via logarithm
 Convert linear spectrogram to mel spectrogram
 Remove low and high frequencies
 Resize spectrogram to fit input dimension of the network
 Convert grayscale image to RGB image
In each training epoch all training files are processed in random order to extract audio
chunks at random position. Training is done with a batch size of ca. 100 - 200
samples using up to 3 GPUs (Nvidia 1080, 1080 Ti, Titan RTX). Categorical cross
entropy [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is used as loss function for single-label classification considering only
foreground species as ground truth targets. Stochastic gradient descent is used as
optimizer with momentum 0.9, weight decay 1e-4 and an initial learning rate of 0.1. Learning
rate is decreased at least once during training by ca. 10-1 whenever performance on
the validation set stops improving. If more than one species is assigned to an audio
chunk, in case of validation soundscapes, one species or label is chosen randomly as
ground truth target during training. Background species annotated for Xeno-Canto
files are not taken into account.
        </p>
        <p>
          Besides the common single-label classification approach, multi-label classification
models are trained as well to take advantage of the fact that there are multi-label
annotations existing for validation soundscapes with two or more species present at the
same time in many cases. For soundscapes, the multi-label approach also seems the
more suited classification method since recordings are mostly not focused on a single
target species. Two loss functions are tested for multi-label training. PyTorch’s
MultiLabelSoftMarginLoss creates a criterion that optimizes a multi-label one-versus-all
loss based on max-entropy [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The loss function BCEWithLogitsLoss combines a
sigmoid layer and a binary cross entropy layer [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>For the validation and test set audio chunks are extracted successively from each
file with an overlap of 10 % for validation files during training and 80 % for files in
the test set. Predictions are summarized for each file and time interval by taking the
maximum over all chunks. For most submissions, different models are ensembled by
averaging their predictions for each species after normalizing the entire prediction
matrix to have a minimum of 0.0 and a maximum value of 1.0. To increase ensemble
performance a little further, in some cases it helped to clip very low and high
prediction values to -7.0 and 10.0, respectively before normalization.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Data Augmentation</title>
        <p>
          To increase model performance and improve generalization to different recording
conditions and habitats, the most effective data augmentation techniques from the
previous BirdCLEF edition [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] are applied in both time and frequency domain. New
methods are highlighted and explained below. The following methods are applied in
time domain regarding audio chunks:
 Chunk extraction at random position in file
 Duration jitter
 Local time stretching and pitch shifting
 Filter with random transfer function
 Random cyclic shift
 Adding audio chunks from files containing only background noise
 Adding audio chunks from files belonging to the same bird species (single-label)
 Adding audio chunks from files belonging to random bird species (multi-label)
 Random signal amplitude of chunks before summation
 Time interval dropout
A few new methods are added for this year’s challenge to augment individual audio
chunks in time domain before mixing them together:
Local time stretching and pitch shifting in time domain. The audio signal is
divided into segments, each having a randomly chosen duration between 0.5 and 4
seconds. To each segment time stretching or pitch shifting or both is applied individually
using the LibROSA library. The time stretching factor is randomly chosen from a
gauss distribution with a mean value of 1 and a standard deviation of 0.05. The pitch
is shifted by an offset randomly chosen from a gauss distribution with a mean value of
0 and a standard deviation of 25 cents (8th of a tone).
        </p>
        <p>Filter with random transfer function. With a chance of ca. 20 %, audio chunks are
filtered in time domain using a butterworth filter design with variable transfer
function. The following filter parameters are chosen randomly:
 Type: lowpass, highpass, bandpass, bandstop
 Order: 1-5
 Cutoff frequency: 1-22049 Hz
For bandpass and bandstop filter types the second (high) cutoff frequency is chosen
between the (low) cutoff frequency + 1 and 22049 Hz (nyquist frequency - 1).
Depending on filter parameters and audio input, filter stability is not always guarantied.
To prevent unbounded signals the original input is passed as output if the filter output
contains anything that is not a number between -1.0 and 1.0. Examples of a randomly
filtered audio recording are visualized in Figure 1.</p>
        <p>
          Mixing random audio chunks for multi-label classification. For multi-label
classification, audio chunks from random files are mixed together and their corresponding
labels added to the target label set during training. Up to four audio chunks are added
with random signal amplitude to the original training sample with conditional
probabilities of 50, 40, 30 and 20 %. A similar technique is originally used by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] for
image classification and has shown good results for multi-label audio classification as
well [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Here, however, labels are not weighted by signal amplitudes (or influenced
by weighting of the linear combination).
        </p>
        <p>
          For background noise augmentation, besides using noise from validation files,
recordings without bird activity of the Bird Audio Detection (BAD) task 2018 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] are used.
The BAD data set is part of the IEEE AASP Challenge on Detection and
Classification of Acoustic Scenes and Events (DCASE) 2018 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. It consists of audio files
from three separate bird sound monitoring projects each recorded under differing
conditions regarding recording equipment and background sounds.
        </p>
        <p>
          The audio chunk (or sum of chunks) is transformed to frequency domain via
shorttime Fourier transform with a window size of 1536 samples and a hop length of 360
samples. Frequencies are mel scaled with low and high frequencies removed resulting
in a spectrogram with 310 mel bands representing a range of approximately 160 to
10300 Hz. Normalization and logarithm is applied to the power spectrogram yielding
a dynamic range of approximately 100 dB. The final spectrogram image is resized to
299x299 pixel to fit the input dimension of the InceptionV3 [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] network or 224x224
pixel for ResNet [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] models. Resizing is performed with the Python Image Library
fork Pillow using randomly chosen interpolation filters of different qualities. Because
audio chunks are extracted with a random length (e.g. between 4.55 and 5.45 s by
applying a duration jitter of ca. half a second) a global time stretching effect is
obtained after resizing the variable length spectrogram images to a fixed width. Image
resizing is also applied to individual vertical and horizontal spectrogram segments to
accomplish piecewise or local time and frequency stretching (see below and [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for
more details). Since networks are pre-trained on RGB images, the grayscale image is
copied to all three colour channels. Further augmentation is applied in frequency
domain to the spectrogram image during training:
 Global frequency shifting/stretching
 Local time and frequency stretching
 Different interpolation filters for spectrogram resizing
 Colour jitter (brightness, contrast, saturation, hue)
More details on individual augmentation methods and their effect on identification
and detection performance in previous challenges can be found in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>For the first two submitted runs a single model was used to predict the species for
each file and time interval in the test set. All other runs used an ensemble of different
models. The main properties of individual models are listed in Table 2. Selected
results on the official BirdCLEF test set are summarized in Table 3 and further
described in the next section.</p>
      <p>
        Loss function: torch.nn.MultiLabelSoftMarginLoss
Run 1. In order to better compare results and to find out if and how much progress
was made on identification performance since last year, the best performing model of
the 2018 BirdCLEF edition [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] was retrained on this year’s data set. All files from
the Xeno-Canto training set but no validation soundscapes were used for training. The
model obtained a classification mAP of 21.3 % and a retrieval mAP of 44.7 % on the
test set.
      </p>
      <p>Run 2. For the second run, a single model (M2) was trained with main properties
listed in Table 2. Also validation soundscapes were used for training and noise
augmentation. The third generation Inception model M2 used all above mentioned
augmentations except the new time domain methods filtering and local time stretching
and pitch shifting. For this and all following models, BAD 2018 files were used for
noise augmentation. As a result, it is not necessary any more to segment training files
into signal and noise parts to get background material from the training set for noise
augmentation like done in previous challenge editions. This greatly simplifies the
preprocessing step. A c-mAP of 25.9 % and a r-mAP of 69.1 % is obtained on the test set
resulting in a performance increase of 21.6 % and 54.6 %, respectively compared to
previous state-of-the-art (M1). Since M1 and M2 didn’t use the exact same training
set and for the training of M2 not all new time domain augmentations were applied
the given progress is only a rough approximation.</p>
      <p>Run 3. For the third run, two models were ensembled: the 2nd run model and a
ResNet-152 model trained on the 254 classes training set. For this and the following runs,
predictions of unlikely species regarding recording location were set to zero.
Run 4. The 4th run used the ensemble of run 3 plus an additional multi-label
classification model (M7) trained on the 78 classes set. This 3 model ensemble obtained the
highest retrieval mAP of 74.6 % on the test set.</p>
      <p>Run 5. The ensemble of run 5 consists of all previous models (except M1) plus an
additional 254 classes ResNet-152 model (M6) yielding a higher temporal resolution
of spectrogram image inputs. It mainly differs in the following parameters:
 FFT size: 512 (instead of 1536) samples
 FFT hop length: 256 (instead of 360) samples
 Chunk duration: 2.6 (instead of 5.0) seconds
 Duration jitter: 0.2 (instead of 0.45) seconds
 Number of mel bands: 155 (instead of 310)
 Start frequency: 0 (instead of 160) Hz
 End frequency: 11025 (instead of 10300) Hz
 Local time stretch chance: 40 (instead of 50) %
 Local time stretch factor min.: 0.95 (instead of 0.9)
 Local time stretch factor max.: 1.05 instead of 1.1)
The run 5 ensemble obtained the highest classification mAP of 35.6 % on the test set.
Run 6 to 10. Different combinations of the previously mentioned models were used
for run 6 to 10. Also different snapshots of the same model were included for
ensembling and two models were trained on different folds of the validation set.
Nevertheless, no further progress on identification performance on the test set was obtained.
For run 9 the same ensemble as for run 8 was used except run 9 didn’t use the eBird
data to set predictions of unlikely species to zero in the post-processing step. This
demonstrates once again, performance can be increased when unlikely birds are
filtered out for a certain recording location where species composition is known in
advance.</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>The 2019 BirdCLEF edition had a clear focus on identifying birds in soundscapes
originating from real-world wildlife monitoring recordings. Although this was a much
more difficult task compared to previous editions, progress in model performance was
obtained by exploring new augmentation techniques and by combining different
single- and multi-label classification models.</p>
      <p>
        A large performance increase was obtained by adding random background noise
from other and/or similar habitats. A very good source for noise augmentation is the
data set of the DCASE 2018 Bird Audio Detection challenge (E1 vs. E2 in Table 1). It
is easily available and published under the Creative Commons Attribution licence
CC-BY 4.0 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The BAD recordings cover a wide range of background noise and
atmosphere from a diverse set of different monitoring scenarios and are therefore well
suited to improve model generalization. On the other hand, in cases where the target
monitoring location is known in advance, a model can specifically be designed for a
certain habitat and greatly benefit by using background sounds of this particular
recording location for noise augmentation during training (E3 vs. E4 in Table 1).
      </p>
      <p>With additional methods like filtering audio chunks with random transfer function
or applying local time stretching and pitch shifting in time domain, identification
performance can be further increased (E4 vs. E5 &amp; E6 vs. E7 in Table 1).
Unfortunately, training takes significantly longer especially if LibROSA’s time stretching and
pitch shifting algorithms are applied very frequently. Due to the longer training time it
was not possible to investigate the individual influence of each method or different
parameter settings on model performance. To save time, those techniques were only
applied to the original training sample (first audio chunk in the mix) and not, or only
with very low chance, to chunks added for augmentation. Both algorithms also seem
to blur the resulting spectrogram even with very subtle use (time stretching factor
close to 1, pitch shifting offset close to 0). Maybe a more efficient implementation
regarding processing time and quality would be a better choice in the future.</p>
      <p>Multi-label training was successfully applied for the 78 classes set (E4 vs. E6 in
Table 1). In contrast to the single-label approach, for multi-label classification the
residual network obtained better results compared to the Inception architecture (3rd vs.
4th column in Table 1). Unfortunately, training with a larger number of classes didn’t
work so well even when passing a weight vector as argument to the
BCEWithLogitsLoss function to compensate for class imbalances. One explanation for this might
be the exponential growth of possible label combinations depending on the number of
individual labels (classes) and the number of labels considered to be possible for a
single audio chunk. Maybe if species combination constrains are known a priori (e.g.
by distinguishing between diurnal and nocturnal birds) and applied to reduce the
number of possible label sets, models can also be trained successfully in a multi-label
fashion for a much larger number of species.</p>
      <p>To reproduce results and to provide a baseline for future BirdCLEF challenges and
further research on bird species identification, source code will be made available at:
www.animalsoundarchive.org/RefSys/BirdCLEF2019.</p>
      <p>Acknowledgments. I would like to thank Stefan Kahl, Alexis Joly, Hervé Goëau,
Willem-Pier Vellinga and Hervé Glotin for organising this task, Xeno-Canto for
providing the training data and the Cornell Lab of Ornithology and Paula Caycedo for
providing and annotating the soundscapes. I also want to thank the Museum für
Naturkunde for supporting my research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kahl</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stöter</surname>
            <given-names>FR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planqué</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            <given-names>WP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            <given-names>A</given-names>
          </string-name>
          (
          <year>2019</year>
          )
          <article-title>Overview of BirdCLEF 2019: large-scale bird recognition in soundscapes</article-title>
          .
          <source>In: CLEF working notes 2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Joly</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goëau</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Botella</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kahl</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Servajean</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            <given-names>WP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planqué</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stöter</surname>
            <given-names>FR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            <given-names>H</given-names>
          </string-name>
          (
          <year>2019</year>
          )
          <article-title>Overview of LifeCLEF 2019: Identification of Amazonian Plants, South &amp; North American Birds, and Niche Prediction</article-title>
          .
          <source>In: Proceedings of CLEF 2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>SoX</given-names>
            <surname>Homepage</surname>
          </string-name>
          , http://sox.sourceforge.net/,
          <source>last accessed</source>
          <year>2019</year>
          /06/19
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Sechidis</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsoumakas</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            <given-names>I</given-names>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>On the Stratification of Multi-Label Data</article-title>
          . In:
          <string-name>
            <surname>Gunopulos</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofmann</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malerba</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vazirgiannis</surname>
            <given-names>M</given-names>
          </string-name>
          <article-title>(eds) Machine Learning and Knowledge Discovery in Databases</article-title>
          .
          <source>ECML PKDD 2011. Lecture Notes in Computer Science</source>
          , vol
          <volume>6913</volume>
          . Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Xeno-Canto</surname>
            <given-names>Homepage</given-names>
          </string-name>
          , https://www.xeno-canto.org/,
          <source>last accessed</source>
          <year>2019</year>
          /06/19
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. eBird Homepage, https://ebird.org/home, last accessed
          <year>2019</year>
          /06/19
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Deng</surname>
            <given-names>J</given-names>
          </string-name>
          et al. (
          <year>2009</year>
          )
          <article-title>Imagenet: A largescale hierarchical image database</article-title>
          .
          <source>In: IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          . pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Paszke</surname>
            <given-names>A</given-names>
          </string-name>
          et al. (
          <year>2017</year>
          )
          <article-title>Automatic differentiation in PyTorch</article-title>
          . In: NIPS-W
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>McFee</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McVicar</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balke</surname>
            <given-names>S</given-names>
          </string-name>
          et al. (
          <year>2019</year>
          )
          <article-title>librosa/librosa: 0.6.3</article-title>
          . Zenodo. https://doi.org/10.5281/zenodo.2564164
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. https://pytorch.org/docs/stable/nn.html#crossentropyloss,
          <source>last accessed</source>
          <year>2019</year>
          /06/19
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. https://pytorch.org/docs/stable/nn.html#multilabelsoftmarginloss,
          <source>last accessed</source>
          <year>2019</year>
          /06/19
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss,
          <source>last accessed</source>
          <year>2019</year>
          /06/19
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lasseck</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Audio-based Bird Species Identification with Deep Convolutional Neural Networks</article-title>
          .
          <source>In: Working notes of CLEF 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Zhang</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cisse</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dauphin</surname>
            <given-names>YN</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Paz</surname>
            <given-names>D</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>mixup: Beyond empirical risk minimization</article-title>
          .
          <source>In: International Conference on Learning Representations</source>
          ,
          <year>2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Xu</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mi</surname>
            <given-names>H</given-names>
          </string-name>
          et. al (
          <year>2018</year>
          )
          <article-title>Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network</article-title>
          . In: arXiv:
          <year>1805</year>
          .07319
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Stowell</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stylianou</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wood</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pamuła</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            <given-names>H</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Automatic acoustic detection of birds through deep learning: the first Bird Audio Detection challenge</article-title>
          .
          <source>In: Methods in Ecology and Evolution</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. http://dcase.community/challenge2018/task-bird
          <string-name>
            <surname>-</surname>
          </string-name>
          audio-detection,
          <source>last accessed</source>
          <year>2019</year>
          /06/19
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Szegedy</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ioffe</surname>
            <given-names>S</given-names>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>Rethinking the Inception Architecture for Computer Vision</article-title>
          . arXiv preprint arXiv:
          <volume>1512</volume>
          .
          <fpage>00567</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>He</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            <given-names>J</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>In: CVPR</source>
          ,
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Lasseck</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Acoustic Bird Detection with Deep Convolutional Neural Networks</article-title>
          .
          <source>In: Plumbley MD et al. (eds) Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)</source>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>147</lpage>
          , Tampere University of Technology.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>