=Paper=
{{Paper
|id=Vol-2936/paper-139
|storemode=property
|title=Learning to Monitor Birdcalls From Weakly-Labeled Focused Recordings
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-139.pdf
|volume=Vol-2936
|authors=Jan Schlüter
|dblpUrl=https://dblp.org/rec/conf/clef/Schluter21
}}
==Learning to Monitor Birdcalls From Weakly-Labeled Focused Recordings==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-139.pdf</pdf>
<pre>
Learning to Monitor Birdcalls From Weakly-Labeled
Focused Recordings
Jan Schlüter1
1
    Institute of Computational Perception, Johannes Kepler University Linz, Austria


                                        Abstract
                                        Passive acoustic monitoring can support biodiversity assessments, but requires automated
                                        analysis to be affordable at scale, which in turn requires labeled training data. Obtaining
                                        labeled data for each deployed device or location is expensive. The BirdCLEF 2021 scientific
                                        challenge tasked participants to train models on freely available weakly-labeled recordings
                                        of individual birds from xeno-canto, and apply them to recordings of passive devices. The
                                        ensemble of Convolutional Neural Networks (CNNs) described in this work achieved an F-Score
                                        of 0.672 across six recording locations, the twelth best entry among 816 teams.

                                        Keywords
                                        birdcall identification, soundscapes, domain mismatch, deep learning


1. Introduction
For some species, such as birds, passive acoustic monitoring is an interesting option to
assess the status and trends of populations in an area. Detecting animal vocalizations
and identifying the species in audio recordings is labor-intensive, prompting research for
automated solutions to analyze recordings of monitoring devices. Many such solutions
are based on machine learning, which requires fitting a prediction model to labeled
recordings. Current prediction models are sensitive to recording conditions [1] and benefit
from being constrained to the set of species to be expected. For ideal performance, the
labeled recordings should thus be prepared with the same recording device and at the
same location that is to be monitored later using the model. Since labeling recordings is
labor-intensive, this represents a big hurdle for deployments at new locations.
   For some species, such as birds, there are freely available online databases of audio
recordings, such as the Macaulay Library [2], Tierstimmenarchiv [3], or xeno-canto [4].
The BirdCLEF 2021 scientific challenge [5], a part of the LifeCLEF initiative [6], explored
tapping into this resource for training prediction models for passive acoustic monitoring
of birds. Specifically, participants were provided with 62 874 recordings of 397 bird species
from xeno-canto as well as 20 labeled recordings of passive devices from two locations,
and tasked to detect and identify birds in 5-second intervals on 80 passive recordings

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest,
Romania
$ jan.schlueter@jku.at (J. Schlüter)
 0000-0003-3862-6888 (J. Schlüter)
                                        c 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution
                                       4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
                                                      predictions
                            Frontend


                                                                                              Sigmoid
                                                                                    pooling
                                                                                    Global
                                                         Local


                                                                                                         ...
                                                                                                                class
                     time


                                               time


                                                                             time
             ampl.                     pitch                        logits                              prob.

Figure 1: Outline of the prediction model architecture.


from six locations. For all recordings, the geographic location and recording date and
time are known.
   This scenario is highly interesting in practice: if prediction models achieve satisfying
accuracy in this setting, they can be deployed to new locations without requiring matching
training data. Solving this task poses two major challenges:
  A. Most recordings from xeno-canto are focused recordings intended to capture the
     vocalizations of a particular bird, often done with directional microphones. In con-
     trast, the recordings to predict on were done unattended and with omnidirectional
     microphones (sometimes referred to as “soundscapes” [5]). This creates a strong
     domain mismatch between training and test recordings.
  B. The xeno-canto recordings are weakly-labeled: There is no timing information
     regarding bird vocalizations, and only the species intended to be recorded is known
     for sure, other species occurring in the background may be labeled or omitted. In
     contrast, predictions on unattended recordings are to be done in 5-second intervals
     and include all audible species.
In this work, we attempt to tackle the first challenge with preprocessing and data
augmentation, and the second one with multiple-instance learning and a two-level
inference procedure.
   The following section details our prediction models, followed by the training procedure
in Section 3 and inference in Section 4. Section 5 describes our experimental setup and
compares results both of single models and ensembles on a validation set and the official
test set. Section 6 discusses ideas that did not work, and Section 7 concludes the paper.


2. Models
The general model architecture is depicted in Figure 1: From an (arbitrarily long)
monophonic raw audio recording, a frontend computes a spectrogram-like representation.
A Fully-Convolutional Network (FCN) processes this representation into a time series
of logits for every class. When passed through a sigmoid, these would give us local
predictions at every time step. Since we do not have temporally accurate labels to train
these, only recording-wise labels, we apply a global pooling operation (over time) to
obtain a single logit per class. Passed through a sigmoid, these serve as our recording-wise
predictions.
  There are several options for the frontend, the local predictor and the global pooling
operation, which we will look at in the following sections.
2.1. Frontend
The frontend processes monophonic audio recordings of sample rate 32 kHz (this is the
rate the unattended recordings are done at, and all training recordings were resampled
to). It consists of the following operations applied in sequence:

A Short-Time Fourier Transform (STFT) computes a linear spectrogram. For two of
    the predictors, windows are 1486 samples long and start every 457 samples, resulting
    in 70 frames per second. For two pretrained predictors, the window and hop size
    is 1024 and 320, respectively, giving 100 frames per second. In both cases, Hann
    windowing is used and only magnitudes are kept.

A mel filterbank transforms spectrograms to mel spectrograms. Depending on the
    predictor, it has 80 bands from 27.5 Hz to 10 kHz, 80 bands from 27.5 Hz to 15 kHz,
    64 bands from 50 Hz to 8 kHz, or 64 bands from 50 Hz to 14 kHz.

A pointwise nonlinearity compresses magnitude values by passing them through either
     y = log(1 + 10a x) or y = xσ(a) , where σ(a) = 1/(1 + exp(−a)) denotes the logistic
     sigmoid, a is initialized to zero and learned by backpropagation.

Denoising: The recordings have very different background noise floors, both due to
    the environment and recording equipment. To help generalization, recordings
    are preprocessed by subtracting the median over time from each frequency band
    (separately for each recording or excerpt). For pretrained models, this step is either
    skipped, or a global offset is added after median subtraction to retain the maximum
    input value (otherwise the value range would not match what the models were
    trained on).

Normalization: To ensure inputs are in a suitable range (even when the magnitude
    compression changes during training), each frequency band is normalized over the
    batch and time dimension with Batch Normalization [7].

2.2. Local predictions
The purpose of the local predictor is to take the spectrogram computed by the frontend,
and produce 397 time series of logits, one for each bird species. Regarding the input
as an 80 pixel (or 64 pixel) high one-channel image, and the output as a 1 pixel high
397-channel image, the predictor should consist of a series of convolutions and pooling
operations that reduces the image height to 1 pixel and produces 397 channels. We used
four different such local predictors:

Vanilla ConvNet: An 8-layer network described in [8, p.3], trained on 70 frames per
      second, 80 mel bands spectrograms. It has a receptive field of 79×103 (79 fre-
      quency bands, 103 spectrogram frames, about 1.5 seconds), a temporal stride of 9
      spectrogram frames (7.778 predictions per second), and 3.74 million parameters.
      Compared to [8], it uses group normalization [9] with 16 groups instead of batch
      normalization.
ResNet: A 14-layer residual network with pre-activations described in [8, p.4], also
    trained on 70 frames per second, 80 mel bands spectrograms. Its receptive field is
    80 × 119, temporal stride is 9 frames, and it has 3.87 million parameters. Compared
    to [8], it uses group normalization with 16 groups instead of batch normalization,
    and crops the shortcut connections instead of padding the convolutions.

Cnn14: A 16-layer network described in [10], pretrained on AudioSet [11], trained on
    100 frames per second, 64 mel bands spectrograms (from 50 Hz to 8 kHz) with
    log(1 + 10a x) magnitudes initialized to a = 5. The pretrained network’s global max
    + average pooling and final two layers are replaced with two 1×1 convolutions of 1024
    and 397 channels, respectively (with batch normalization and leaky rectification in
    between, and both convolutions preceded by 50% dropout [12]). It has a receptive
    field of 284×284 (2.84 seconds), stride of 32 frames (3.125 predictions per second),
    and 77.98 million parameters. The size of the receptive field exceeds the number of
    mel bands by far; this is possible because all convolutions are zero-padded.

ResNet38 : A 38-layer residual network described in [10], also pretrained on AudioSet,
    trained on 100 frames per second, 64 meld bands spectrograms (from 50 Hz to
    14 kHz). The pretrained network’s final layers are replaced as described for Cnn14.
    Its receptive field is 2997×2997 (30 seconds), stride is 256 frames (one prediction
    every 2.56 seconds), and it has 71.01 million trainable parameters.
We optionally use 8-fold multi-sample dropout [13], implemented by replicating the inputs
and targets before the second to last dropout layer.

2.3. Global pooling
Up to here, the model produces a time series of logits for each class. The final step is to
pool these logits into a single prediction per class for the full recording, such that we
can compute (and minimize) the classification error wrt. the given global labels for the
recording.
  Global average pooling would distribute the gradient of the loss uniformly over all
time steps, training the network to predict a labeled bird everywhere in the recording.
Global max pooling would route the gradient only to the most confident detection     of
                                                              1      1 PT
each species. As a compromise, log-mean-exp pooling with a log T t=1 exp(axt ) [14,
Eq. 6] allows to interpolate between taking the maximum (a → ∞) and mean (a → 0.
With a = 1, the output depends on the largest few values, which is also where the
gradient is distributed to. This setting was used for most models. Some models updated
a with backpropagation, possibly using a separate a per species (since some species might
vocalize densely, warranting a small a, and others sparsely, requiring a large a).


3. Training
The training procedure considers both challenges explained in Section 1: weak labels and
domain mismatch.
3.1. Excerpts
Ideally, the model would be trained on complete xeno-canto recordings – this is the only
way we can be sure all weak labels are correct. If we pick a random excerpt, it is not
guaranteed that all birds annotated to be present in the recording are also audible in
the chosen excerpt. However, the longest recordings are 3 hours, which is impractical.
As in [8], we train on randomly selected 30-second excerpts instead, hoping that most
annotated birds will be audible at least once. Too short files are looped to make up 30
seconds. Validation uses the central 30 seconds of a recording.

3.2. Augmentation
To help the model generalize, especially in the light of the domain mismatch from focused
training recordings to unattended test recordings, we employ several data augmentation
strategies:

Adding noise: To make the models work under low signal-to-noise ratios (i.e., the con-
     ditions found in the unattended recordings), we mix them with excerpts from
     the Chernobyl BiVA [15] and BirdVox-full-night [16] datasets. These datasets are
     precisely annotated with bird occurrences, so we can extract all parts void of birds.
     We started out carefully, but the best setting turned out to be mixing every training
     example with background noise, drawing a value p ∈ [0, 1) and scaling the noise
     with p and the bird recording excerpt with 1 − p. For some models, we also set
     p = 1 with 1% or 0.1% probability, setting the labels to all zero in this case. And
     for some models, the chosen noise excerpt is scaled to a maximum absolute value
     in [0, 1) before mixing it.

Random downmixing: The model is trained on monophonic recordings, but xeno-canto
    usually has stereo recordings.1 We downmix them during training with a randomly
    uniform weight p for the left channel, and 1 − p for the right channel.

Filterbank pitch shifting: Some models are trained with pitch shifting, cheaply imple-
      mented by modifying the mel filterbank: Instead of precomputing the filterbank,
      it is constructed on-the-fly, scaling the minimum and maximum frequency by the
      same random factor chosen uniformly between 0.95 and 1.05. A separate filterbank
      is constructed for every example in a minibatch (and applied to the minibatch with
      a batched matrix multiplication).

Magnitude warping: Drawing on an idea of Vladislav Kramarenko2 , in the log(1 + 10a x)
    magnitude transformation, for some models, we scaled a by a random factor between
    0.5 and 1.5 during training, shifting the result such that the maximum output value
    matches the unmodified a (to not drastically change the value range).

    1
      For BirdCLEF 2021, only downmixed xeno-canto recordings were provided; we partly replaced them
with the original stereo files.
    2
      https://www.kaggle.com/c/birdsong-recognition/discussion/183269
High-frequency damping: Also drawing on an idea of Vladislav Kramarenko, for some
     models, we lowered up to 50% of the high-frequency part of the spectrum by up to
     50% magnitude, with a linear fade to the lowest frequency bin of the spectrum (the
     only bin not damped at all). His reasoning was that in the unattended recordings,
     birds are often farther away than in focused recordings, and high frequencies are
     damped more strongly with distance (which is indeed the case in forests [17]). We
     only apply this damping to models without median subtraction – it is applied before
     compressing magnitudes, so for models with approximately logarithmic magnitudes
     (log(1 + 105 x)) and median subtraction its effect would be canceled.

3.3. Optimization
To be able to monitor training progress, we split off 10% of the xeno-canto recordings as
a validation set, ensuring that no recordist is part of both the training and validation set
(as they tend to visit the same locations using the same equipment).
   Optimization uses ADAM with mini-batches of 16 examples, an initial learning rate of
10−3 , β1 = 0.9, β2 = 0.999,  = 10−8 , minimizing binary cross-entropy against all species
(for a single model, we reduced targets for background species to 0.6). The validation loss
is computed every 1000 update steps. If it does not improve over the current best value
for 10 such evaluations in a row, the learning rate is reduced to a tenth, and training
is continued. Training is stopped when the learning rate reaches 10−6 . For the two
pretrained models, we tried reducing the learning rate for the pretrained layers to 1% or
10% compared to the novel layers or to freeze the pretrained layers for some time, but it
turned out that using the full learning rate for all the layers from the start works best.


4. Inference
While the model is trained on 30-second excerpts, at test time, it needs to predict the set
of birds for a 10-minute recording in non-overlapping 5-second windows. We perform
this in two steps: establishing a set of species present in the 10-minute recording, and
detecting these species in 5-second windows. This allows to tune a separate threshold for
the second step, to reduce false negatives without impacting false positives too much.

4.1. Species set
The way the model is constructed (Section 2), it can be applied to arbitrarily long
recordings. We could thus apply it directly to a 10-minute recording to obtain a set of
species. However, the employed global pooling method (log-mean-exp pooling) includes a
division by the input length. If a bird appears only in the first two minutes of a recording,
its pooled prediction is lower than if it appeared throughout the ten minutes. We could
thus not distinguish low-confidence detections from high-confidence detections. Our
solution is to apply the model to 30-second windows overlapping by 50% with a threshold
of 0.5 and taking the union of all detections (or, equivalently, taking the maximum over
the prediction windows and applying the threshold afterwards).
4.2. Windowed detections
Again, the way the model is constructed, we can apply it directly to 5-second windows
even when it was trained on 30-second excerpts. Since log-mean-exp pooling is dependent
on the input length, we need to adjust the threshold to make up for the mismatch. As
we have established a set of bird species in the previous step, we can afford to set the
threshold very low, removing detections that are outside the established set. Optimized
for a single model on the 20 unattended recordings available for training, we found a
threshold of 0.18 to be optimal; when optimizing an ensemble of 18 models, the optimal
threshold was 0.08.
   As an alternative, we can opt to use max pooling, not distinguishing single detections
from repeated detections within a 5-second window and relying on the established set
of species to filter out false positives. For the same 18-model ensemble, the optimal
threshold on the 20 unattended recording using max pooling is 0.55.

4.3. Ensembling
To combine results from multiple models, we average their predicted logits directly after
pooling, and apply thresholds afterwards. Applying thresholds first and combining models
by vote counts per species performed worse.

4.4. Implementation
For improved performance, instead of splitting up the audio recording into overlapping
30-second and non-overlapping 5-second windows and applying the models to each
excerpt, we apply the model to the full 10-minute recording up to the global pooling.
Using information on the model’s receptive field size, padding and stride, we compute a
timestamp for every prediction time step. This way we can extract windows from the
series of logits and pool them as needed. For an 18-model ensemble, inference takes 6
seconds for a 10-minute file on an NVIDIA GTX 1080 Ti. In addition to improving
computational efficiency, this method also limits potential artifacts from zero-padding
the audio excerpts in the two pretrained models.


5. Experiments
As mentioned in Section 3.3, we formed a train/validation split, grouped by recordist,
using about 90% of the 62 874 recordings for training and the remainder for validation.
We use the 20 provided unattended recordings (which are labeled in 5-second windows) as
additional validation files, separated into the two recording locations Costa Rica (COR)
and Sapsucker Woods (SSW).
  Evaluation is done in terms of F-score: The numbers of true positives, false positives
and false negatives are determined and added up across all 397 species as well as a
“nocall” class for 5-second windows without any audible bird vocalization. From the total
numbers, precision, recall and F-score are computed.
   We trained several combinations of frontend, predictor, global pooling, and data
augmentation and evaluated them both on the xeno-canto recordings left for validation
(using the central 30 seconds only), and the 20 unattended recordings (using the full
inference procedure of Section 4 with a second–level threshold of 0.18). Table 1, columns
“F-score” show the results. The variations were chosen in search of good models for the
challenge, not for evaluating the effect of any particular measure. Thus, we can only
draw limited conclusions from these results:
   • Unsurprisingly, results vary when repeating an experiment (rows 5 and 6 differ by
     up to 0.01 F-score with the same hyperparameters). Comparing single experiments
     with close results is thus meaningless.
   • Comparing the predictors, on the xeno-canto recordings, Cnn14 performs best,
     followed by ResNet38, the small ResNet, and Vanilla ConvNet.
   • On the unattended recordings (from locations COR and SSW), the order is the
     same, except that ResNet38 performs much worse than all others. A possible
     explanation is its large receptive field and large stride, which may make the series
     of logits too inaccurate for 5-second window predictions.
   • None of the augmentations (except for adding noise) has a clear positive or negative
     effect when tested in isolation. It is possible that the augmentations diversify the
     set of models in a way that is helpful for ensembling.
   For the challenge, we formed ensembles of multiple models. An initial ensemble of 5
models was picked by hand (Table 1, column “5-model ensemble”). It performs better
than all Cnn14 models combined, even when adding all small ResNets, and all Vanilla
ConvNets, and achieved an F-score of 0.667 on the official challenge data. Finally,
a slightly better ensemble of 18 models was found in three trials of starting from an
all-model ensemble (except ResNet38) and greedily removing randomly chosen single
models until no improvement on SSW and COR combined was observed. It obtained an
F-score of 0.672 on the challenge dataset, the 12th best entry. After observing that the
second inference stage performs better using max pooling (Section 4.2), another three
trials found an 11-model ensemble slightly improving on this with an F-score of 0.676
submitted after the end of the challenge.


6. Failed ideas
For each xeno-canto recording and for each unattended recording (both in the training
and test set), the geographic location is known. It would thus be possible to focus on
predicting those species that are likely to be present at each recording site. Exploring
the data, we saw that most species present in the 20 labeled recordings from COR and
SSW have also been uploaded to xeno-canto within short distance of the recording site,
and only few species have not been uploaded within 60 km of the site. We tried to make
use of this in three ways:
   1. Computing the set of species uploaded within 60 km of sites COR and SSW results
      in reduced lists of 133 and 89 species, respectively. We filtered the predictions of a
      397-species model using these site-specific lists, but it did not improve results on
      the 20 recordings.
   2. Using the reduced species lists, we trained site-specific prediction models. Both
      for single models and for a 5-model ensemble, this worked worse than the generic
      397-species models.
   3. We weighted xeno-canto recordings by their distance to a particular site during
      training, giving close recordings
                                  q     higher importance in the loss function. We com-
      puted these weights as 1/ 1 + (d/500)k , where d is the distance in kilometers,
      setting k = 2 for a softer and k = 3 for a harder distance dependance (inspired by
      the shape of a Butterworth filter response). These site-specific models performed
      comparable to unspecific models, not warranting the extra effort.


7. Conclusions
We obtained competitive results in the BirdCLEF 2021 challenge following a common
recipe for such competitions: Training a lot of prediction models with varying settings,
then blending them into an ensemble using some means of automatic model selection.
   For practical purposes, we deem the following to be important: (1) Training on long
enough excerpts to increase the chance that the weak labels are correct, (2) including a
form of global pooling in the prediction model that encourages proper credit assignment
to local predictions, (3) augmenting data by adding noise, (4) performing inference on
two timescales to filter short-term detections using long-term information, (5) using
pretrained audio models as a basis.
   While these aspects were followed in our work, each offers room for improvement. For
example, the global log-mean-exp pooling is oblivious to the typical frequency of calls.
Even with a trainable a hyperparameter per species, it is forced to use the same pooling
for different call types. The noise used for augmentation in our work is limited in diversity,
as it stems from only two locations. Extending it requires manual screening of data for
the absence of birds, or a highly accurate bird detector. The pretrained Cnn14 performs
comparably well, but has an overly large receptive field in frequency dimension, and high
computational demands. Another interesting route for future work is to make more use
of long-term temporal structure. The two-stage inference procedure only captures the
assumption that a particular bird (or bird species) will be audible multiple times during
a 10-minute recording, but some bird calls have more complex regularities that could
help detecting or distinguishing them.


Acknowledgments
First of all, thanks a lot to the organizers of BirdCLEF 2021: Stefan Kahl, Tom Denton,
Holger Klinck, Hervé Glotin, Hervé Goëau, Willem-Pier Vellinga, Robert Planqué and
Alexis Joly. Apart from that, thanks to Paul Primus for suggesting the datasets used
for background noises, and to Khaled Koutini for the idea of using a batched matrix
multiplication to allow different pitch shifts for each minibatch item.
References
 [1] T. Heittola, A. Mesaros, T. Virtanen, Acoustic scene classification in DCASE
     2020 challenge: generalization across devices and low complexity solutions, in:
     Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020
     Workshop (DCASE2020), 2020. URL: https://arxiv.org/abs/2005.14623.
 [2] Macaulay library, 2003. URL: https://www.macaulaylibrary.org, accessed: 2021-07-
     02.
 [3] Tierstimmenarchiv, 1951. URL: https://www.tierstimmenarchiv.de, accessed: 2021-
     07-02.
 [4] xeno-canto, 2005. URL: https://www.xeno-canto.org, accessed: 2021-07-02.
 [5] S. Kahl, T. Denton, H. Klinck, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué,
     A. Joly, Overview of BirdCLEF 2021: Bird call identification in soundscape record-
     ings, in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation
     Forum, 2021.
 [6] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean,
     R. Ruiz De Castañeda, I. Bolon, H. Glotin, R. Planqué, W.-P. Vellinga, A. Dorso,
     H. Klinck, T. Denton, I. Eggel, P. Bonnet, H. Müller, Overview of LifeCLEF
     2021: a system-oriented evaluation of automated species identification and species
     distribution prediction, in: Proceedings of the Twelfth International Conference of
     the CLEF Association (CLEF 2021), 2021.
 [7] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training
     by reducing internal covariate shift, in: Proceedings of the 32nd International
     Conference on Machine Learning (ICML), volume 37 of Proceedings of Machine
     Learning Research, PMLR, Lille, France, 2015, pp. 448–456. URL: http://proceedings.
     mlr.press/v37/ioffe15.html.
 [8] J. Schlüter, Bird identification from timestamped, geotagged audio recordings, in:
     Working Notes of CLEF, Avignon, France, 2018.
 [9] Y. Wu, K. He, Group normalization, in: Proceedings of the European Conference
     on Computer Vision (ECCV), 2018.
[10] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, Panns: Large-
     scale pretrained audio neural networks for audio pattern recognition, IEEE/ACM
     Transactions on Audio, Speech, and Language Processing 28 (2020) 2880–2894.
     doi:10.1109/TASLP.2020.3030497.
[11] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore,
     M. Plakal, M. Ritter, Audio set: An ontology and human-labeled dataset for audio
     events, in: 2017 IEEE International Conference on Acoustics, Speech and Signal
     Processing (ICASSP), 2017, pp. 776–780. doi:10.1109/ICASSP.2017.7952261.
[12] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov,
     Improving neural networks by preventing co-adaptation of feature detectors, arXiv
     e-prints abs/1207.0580 (2012). URL: http://arxiv.org/abs/1207.0580.
[13] H. Inoue, Multi-sample dropout for accelerated training and better general-
     ization, CoRR abs/1905.09788 (2019). URL: http://arxiv.org/abs/1905.09788.
     arXiv:1905.09788.
[14] P. O. Pinheiro, R. Collobert, From image-level to pixel-level labeling with convolu-
     tional networks, in: Proceedings of the 28th IEEE Conference on Computer Vision
     and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 1713–1721.
[15] P. Kendrick, L. Barçante, N. Beresford, S. Gashchak, M. Wood, Bird vocalisation
     activity (biva) database: annotated soundscapes from the chernobyl exclusion
     zone, 2018. URL: https://doi.org/10.5285/be5639e9-75e9-4aa3-afdd-65ba80352591.
     doi:10.5285/be5639e9-75e9-4aa3-afdd-65ba80352591.
[16] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. P. Bello, Birdvox-full-
     night: A dataset and benchmark for avian flight call detection, in: 2018 IEEE
     International Conference on Acoustics, Speech and Signal Processing (ICASSP),
     IEEE, 2018, pp. 266–270. URL: https://doi.org/10.1109/ICASSP.2018.8461410.
     doi:10.1109/ICASSP.2018.8461410.
[17] M. Padgham, Reverberation and frequency attenuation in forests—implications
     for acoustic communication in animals, The Journal of the Acoustical Society of
     America 115 (2004) 402–410. URL: https://doi.org/10.1121/1.1629304. doi:10.1121/
     1.1629304.
                    predictor                                                  frontend                                                                                                                augmentation                                                                                                    F-score                                          ensemble
                                                                                                                                                                                                                                                                                                                                                                                                                                                         Table 1


row
                                                                                                                                                                                                                                                                                                                         COR
                                                                                                                                                                                                                                                                                                                                 SSW


                         Cnn14


      Vanilla
                ResNet
                                                                                                                                                                                                                                                                                                                                                           all Cnn14


                                 ResNet38
                                                                                                                                                                                                                             pitch shift


                                                                   mel bands
                                                                                                                                                                                                                                                                                                          xeno-canto


                                                                                                                                                                                             noise scaled


                                                                               max. freq. (kHz)
                                                                                                                                               trainable pooling
                                                                                                                                                                                                                                           mag. warp (±%)


                                                                                                                                                                                                            random downmix
                                                                                                                                                                                                                                                                                                                                        5-model ensemble


                                                                                                                                                                                                                                                                                      background target
                                                                                                                                                                                                                                                                                                                                                                       all Cnn14+ResNet
                                                                                                                                                                                                                                                                                                                                                                                                                 18-model ensemble
                                                                                                                                                                                                                                                                                                                                                                                                                                     11-model ensemble


                                                                                                                          median subtraction
                                                                                                                                                                                                                                                                                                                                                                                           all except ResNet38


                                            multi-sample dropout
                                                                                                  magnitude compression
                                                                                                                                                                   noise only (% of items)
                                                                                                                                                                                                                                                            freq. damp (% of items)
 1      x                                                          80          10                 pow                           x                                                                x                                                                                       1                .640          .715     .629                                                          x                      x
 2      x                                                          80          10                 pow                           x                                                                x              x                                                                        1                .637          .684     .615                                                          x                      x                   x
 3      x                                                          80          15                 pow                           x                                                                x              x                                                                        1                .641          .703     .650        x                                                 x                      x                   x
 4      x                                       x                  80          15                 pow                           x                                                                x              x                                                                        1                .658          .691     .642                                                          x                      x                   x
 5                x                                                80          10                 pow                           x                                                                                                                                                        1                .655          .683     .697        x                             x                   x                      x
 6                x                                                80          10                 pow                           x                                                                                                                                                        1                .647          .686     .691                                      x                   x                      x                   x
 7                x                                                80          10                 pow                           x                                                                                                                                                        1                .656          .690     .714        x                             x                   x                      x                   x
 8                x                                                80          15                 pow                           x                                                                               x                                                                        1                .661          .695     .674                                      x                   x                      x
 9                x                             x                  80          15                 pow                           x                                                                               x                                                                        1                .666          .704     .701                                      x                   x                      x                   x
10                x                                                80          15                 pow                           x                                                                               x              x                                                         1                .622          .673     .545                                      x                   x                      x                   x
11                        x                                        64           8                 log                                                                                                                                                                                    1                .676          .770     .693        x                x            x                   x                      x                   x
12                        x                                        64           8                 log                                                                         1                                                                                                          1                .678          .736     .688        x                x            x                   x                                          x
13                        x                                        64           8                 log                                                                         1                                 x              x                                                         1                .680          .766     .673                         x            x                   x
14                        x                                        64           8                 log                                                                         1                                 x              x           40                                            1                .683          .704     .705                         x            x                   x                      x                   x
15                        x                                        64           8                 log                                                                         1                                 x              x                            50                           1                .684          .723     .707                         x            x                   x                      x
                                                                                                                                                                                                                                                                                                                                                                                                                                                         on the two validation set locations and the challenge test set.


16                        x                     x                  64           8                 log                                                                         1                                 x              x                                                         1                .682          .771     .689                         x            x                   x                      x
17                        x                     x                  64           8                 log                                                                                                           x              x                                                         1                .681          .806     .695                         x            x                   x
18                        x                     x                  64           8                 log                                              x                                                                                                                                     1                .682          .772     .549                         x            x                   x                      x                   x
19                        x                                        64           8                 log                                                                  1                         x                                                                                       1                .678          .767     .646        x                x            x                   x
20                        x                                        64           8                 log                                                                  1                         x              x              x                                                        .6                .675          .719     .658                         x            x                   x                      x
21                        x                     x                  64           8                 log                                                                 .1                         x              x              x           20                          5                 1                .683          .753     .679                         x            x                   x
22                        x                                        64           8                 log                           x                                      1                                        x              x                                                         1                .686          .778     .612                         x            x                   x                      x
23                        x                     x                  64           8                 log                           x                                      1                                        x              x           50                                            1                .684          .768     .699                         x            x                   x                      x
24                        x                     x                  64           8                 log                           x                                      1                         x              x              x           50                                            1                .690          .758     .627                         x            x                   x
25                                 x                               64          14                 log                                                                                                           x              x                                                         1                .679          .611     .553
26                                 x            x                  64          14                 log                                                                                                           x              x                                                         1                .677          .667     .561
27                                 x                               64          14                 log                           x                                             1                  x              x              x           50                                            1                .678          .610     .529

                                                                                                                                                                                                                                                                                                                               COR      .802               .789        .796               .800                   .805                .804
                                                                                                                                                                                                                                                                                                                              SSW       .723               .699        .717               .724                   .731                .732
                                                                                                                                                                                                                                                                                                                          Challenge     .667                —           —                  —                     .672                .676
                                                                                                                                                                                                                                                                                                                                                                                                                                                         model selections for six ensembles (columns “ensemble”). The last three rows display ensemble results
                                                                                                                                                                                                                                                                                                                                                                                                                                                         Results for several variants on the validation data (columns “F-score”). The last six columns show

</pre>