<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognizing bird species in audio recordings using deep convolutional neural networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Karol J. Piczak</string-name>
          <email>K.Piczak@stud.elka.pw.edu.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Electronic Systems, Warsaw University of Technology</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper summarizes a method for purely audio-based bird species recognition through the application of convolutional neural networks. The approach is evaluated in the context of the LifeCLEF 2016 bird identi cation task - an open challenge conducted on a dataset containing 34 128 audio recordings representing 999 bird species from South America. Three di erent network architectures and a simple ensemble model are considered for this task, with the ensemble submission achieving a mean average precision of 41.2% (o cial score) and 52.9% for foreground species.</p>
      </abstract>
      <kwd-group>
        <kwd>bird species identi cation</kwd>
        <kwd>convolutional neural networks</kwd>
        <kwd>audio classi cation</kwd>
        <kwd>BirdCLEF 2016</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Reliable systems that would allow for large-scale bird species recognition from
audio recordings could become a very valuable tool for researchers and
governmental agencies interested in ecosystem monitoring and biodiversity preservation.
In contrast to eld observations made by expert and hobbyist ornithologists,
automated networks of acoustic sensors [1{4] are not limited by environmental
and physiological factors, tirelessly delivering vast amounts of data far surpassing
human resources available for manual analysis.</p>
      <p>
        Over the years, there have been numerous e orts to develop and evaluate
methods of automatic bird species recognition based on auditory data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Unfortunately, with more than 500 species in the EU itself [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and over 10 000
worldwide [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], most experiments and competitions in this area seemed rather
limited when compared to the scope of real-world problems. The NIPS 2013
multi-label bird species classi cation challenge [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] encompassed 87 sound classes,
whereas the ICML 2013 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and MLSP 2013 [10] counterparts were even more
constrained (35 and 19 species respectively).
      </p>
      <p>The annual BirdCLEF challenge [11], part of the LifeCLEF lab [12] organized
by the Conference and Labs of the Evaluation Forum, vastly expanded on this topic
by evaluating competing approaches on a real-world sized dataset comprising
audio recordings of 501 (BirdCLEF 2014 ) and 999 bird species from South
America (BirdCLEF 2015-2016 ). The richness of this dataset, built from eld
recordings gathered through the Xeno-canto project [13], provides a benchmark
which is much closer to actual practical applications.</p>
      <p>Past BirdCLEF submissions have evaluated a plethora of techniques based
on statistical features and template matching [14, 15], mel-frequency cepstral
coe cients (MFCC ) [16, 17] and spectral features [18], unsupervised feature
learning [19{21], as well as deep neural networks with MFCC features [22].
However, to the best of the author's knowledge, neural networks with convolutional
architectures have not yet been applied in the context of bird species identi
cation, apart from visual recognition tasks [23]. Therefore, the goal of this work is
to verify whether an approach utilizing deep convolutional neural networks for
classi cation could be suitable for analyzing audio recordings of singing birds.
2</p>
      <p>Bird identi cation with deep convolutional neural
networks
2.1</p>
      <sec id="sec-1-1">
        <title>Data pre-processing</title>
        <p>The BirdCLEF 2016 dataset consists of three parts. In the training set, there are
24 607 audio recordings with a duration varying between less than a second and
up to 45 minutes. The training set was annotated with a single encoded label
for the main species and potentially with a less uniform list of additional species
which are most prominently present in the background. The main part of the
evaluation set has been left unchanged when compared to BirdCLEF 2015 - 8 596
test recordings (1 second to 11 minutes each) of a dominant species with others
in the background. The new part of the 2016 challenge comprises 925 soundscape
recordings (MP3 les, mostly 10 minutes long) that are not targeting a speci c
dominant species and may contain an arbitrary number of singing birds.</p>
        <p>The approach presented in this paper concentrated solely on evaluating
singlelabel classi ers suitable for recognition of the foreground (main) species present in
the recording. At the beginning, all recordings were converted to a uni ed WAV
format (44 100 Hz, 16 bit, mono) from which mel-scaled power spectrograms were
computed using the librosa [24] package with FFT window length of 2048 frames,
hop length of 512, 200 mel bands (HTK formula) with a max frequency cap at
16 kHz. Perceptual weighting using peak power as reference was performed on all
spectrograms. Subsequently, all spectrograms were processed and normalized with
some simple scaling and thresholding to enhance foreground elements. 25 lowest
and 5 highest bands were discarded. Additionally, total variation denoising was
applied with a weight of 0.1 to achieve further smoothing of the spectrograms
(the implementation of Chambolle's algorithm [25] provided by scikit-image [26]
was used for this purpose). An example of the results of this processing pipeline
can be seen in Figure 1.</p>
        <p>80% of training recordings were randomly chosen for network learning, while
20% of the dataset was set aside for local validation purposes. Each recording was
then split into shorter segments with percentile thresholding in order to discard
silent parts. As a nal outcome of this process, 85 712 segments of varying length
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
LIFECLEF2014_BIRDAMAZON_XC_WAV_RN1.wav / ruficapillus
00.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
00.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
were created for training - each labeled with a single target species. In order to
accommodate a xed input size expectation of most network architectures, all the
segments were adjusted on-the- y during training by either trimming or padding
so as to achieve a desired segment length of 430 frames (5 seconds). This also
allowed for some signi cant data augmentation - shorter segments being inserted
with a random o set and padded with -1 values, while longer segments trimmed
at random points to get a 5-second-long excerpt. Finally, the input vectors were
standardized.
2.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>Network architectures</title>
        <p>Numerous convolutional architectures loosely based on the author's previous
work in environmental sound classi cation [27] were evaluated, with 3 models
being chosen for nal submissions (schematically compared in Table 1). All the
models were implemented using the Keras Deep Learning library [28]. Each
architecture processed input segments of spectrograms (170 bands 430 frames)
into a softmax output of 999 units (one-hot encoding all the target species in the
dataset) providing a probability prediction of the dominant species present in
the analyzed segment. Final prediction for a given audio recording was computed
by averaging the decisions made across all segments of a single le. The
multilabel character of the evaluation data was simplistically addressed in the nal
submission by providing a ranked list of the most probable dominant species
encountered for each le, thresholded at a probability of 1%.
DROP - dropout, CONV-N - convolutional layer with N lters of given size, LReLU
Leaky Recti ed Linear Units, M-P - max-pooling with pooling size (and stride size),
FC - fully connected layer, PReLU - Parametric Recti ed Linear Units, SOFTMAX
output softmax layer</p>
        <sec id="sec-1-2-1">
          <title>Run 1 - Submission-14.txt</title>
          <p>This model was inspired by recent work of Phan et al. [29] which considered
shallow architectures with 1-Max pooling. The main idea here is to use a single
convolutional layer with numerous lters that would allow learning specialized
templates of sound events, and then to use their maximum activation value
throughout the whole time span of the recording.</p>
          <p>The actual model consists of a single convolutional layer of 600 rectangular
lters (170 5) with LeakyReLUs (recti er activation with a small non-active
gradient, = 0:3) and dropout probability of 5%. The activation values are then
1-max pooled (pooling size of 1 426) into a chain of 600 single scalar values
representing the maximum activation of each learned lter over the entire input
segment. Further processing is achieved through a fully connected layer of 3 000
units with dropout probability of 30% and Parametric ReLU [30] activations. The
output softmax layer (999 fully connected units) also has a dropout probability
of 30%. All layer weights are initialized with a uniform scaled distribution [30]
(denoted in Keras by he uniform) with biases of the initial layer set to 1.</p>
        </sec>
        <sec id="sec-1-2-2">
          <title>Run 2 - Submission-6.txt</title>
          <p>This submission was based on a model with 4 convolutional layers and some
small regularization:
{ Convolutional layer of 80 lters (167 6) with L1 regularization of 0.001 and</p>
          <p>LeakyReLU ( = 0:3) activation,
{ Max-pooling layer with 4 6 pooling size and stride size of 1 3,
{ Convolutional layer of 160 lters (1 2) with L2 regularization of 0.001 and</p>
          <p>LeakyReLU ( = 0:3) activation,
{ Max-pooling layer with 1 2 pooling size and stride size of 1 2,
{ Convolutional layer of 240 lters (1 2) with L2 regularization of 0.001 and</p>
          <p>LeakyReLU ( = 0:3) activation,
{ Max-pooling layer with 1 2 pooling size and stride size of 1 2,
{ Convolutional layer of 320 lters (1 2) with L2 regularization of 0.001 and</p>
          <p>LeakyReLU ( = 0:3) activation,
{ Max-pooling layer with 1 2 pooling size and stride size of 1 2,
{ Output softmax layer (999 units) with dropout probability of 50% and</p>
          <p>L2 regularization of 0.001.</p>
          <p>Weight initializations are performed in the same manner as already described. The
smaller vertical size of lters in the rst layer allows for some minor invariance
in the frequency domain. No further dense (fully connected) layers are utilized
between the output layer and the last convolutional layer.</p>
        </sec>
        <sec id="sec-1-2-3">
          <title>Run 3 - Submission-9.txt</title>
          <p>This run was also performed by a model with 4 convolutional layers, same
initialization technique, however the size of the lters learned is considerably
wider, thus more lters are utilized in each layer:
{ Convolutional layer of 320 lters (167 10) with dropout of 5% and LeakyReLU
( = 0:3) activation,
{ Max-pooling layer with 4 10 pooling size and stride size of 1 5,
{ Convolutional layer of 640 lters (1 2) with dropout of 5% and LeakyReLU
( = 0:3) activation,
{ Max-pooling layer with 1 2 pooling size and stride size of 1 2,
{ Convolutional layer of 960 lters (1 2) with dropout of 5% and LeakyReLU
( = 0:3) activation,
{ Max-pooling layer with 1 2 pooling size and stride size of 1 2,
{ Convolutional layer of 1280 lters (1 2) with dropout of 5% and LeakyReLU
( = 0:3) activation,
{ Max-pooling layer with 1 2 pooling size and stride size of 1 2,
{ Output softmax layer (999 units) with dropout probability of 25%.</p>
        </sec>
        <sec id="sec-1-2-4">
          <title>Run 4 - Submission-ensemble.txt</title>
          <p>The nal run consisted of a simple meta-model averaging the predictions of the
aforementioned submissions.
2.3</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>Training procedure</title>
        <p>All network models were trained using a categorical cross-entropy loss function
with a stochastic gradient descent optimizer (learning rate of 0.001, Nesterov
momentum of 0.9). Training batches contained 100 segments each. Validation was
performed locally on the hold-out set (20% of the original training data available)
by selecting a random subset on each epoch (approximately 2 500 les each time)
and calculating the model's prediction accuracy. This metric was assumed as
a proxy for the expected mean average precision without background species
category which was reported as M AP2 in BirdCLEF 2015 results.</p>
        <p>Each model was trained for a number of epochs (30{102). The training time
for a single model on a single GTX 980 Ti card was in the range of 30{60 hours.
The results of nal validation for each of the trained models are presented in
Table 2, whereas Figure 2 depicts a small selection of lters learned by one of
the models.</p>
        <p>Fig. 2: Example of lters learned in the rst convolutional layer</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Submission results &amp; discussion</title>
      <p>The o cial results of the BirdCLEF 2016 challenge are presented in Table 3
and Figure 3. There were 6 participating groups which submitted 18 runs in
total. The submission described in this work resulted in a 3rd place among
participating teams with individual runs achieving 6th, 8th, 9th and 10th o cial
score (1st column - MAP with background species and soundscape les). The
analysis of these results and the experience gathered during the BirdCLEF 2016
challenge allows for the following remarks:
{ With almost 1 000 bird species, the BirdCLEF dataset creates a demanding
challenge for any machine audition system. In this context, an approach
based on convolutional neural networks seems to be valid and promising
for the analysis of bioacoustical data. Looking at comparable results from
the very last year, surpassing a foreground only MAP of 50% is de nitely
a success. However, this year's top performing submission was still able to
remarkably improve on this evaluation metric.
{ The performance of the described networks is quite consistent between models.</p>
      <p>It seems that a decent convolutional architecture with proper training and
initialization regime should be able to learn a reasonable approximation of
the classifying function based on the provided data, and minor architectural
decisions may not be of the utmost importance in this case.
{ Very poor performance in the soundscape category con rms that the presented
approach has a strong bias against multi-label scenarios - a thing which is not
surprising when considering the applied learning scheme, which was rather
forcefully extended to the multi-label case. Not only does learning on a single
target label for each recording impose some constraints in this process, but
the whole pre-processing step may also be detrimental in this situation. Thus
it seems that further work should concentrate more on what is learned (data
segmentation and pre-processing, labeling, input/output layers) than how
(internal network architecture).
{ A promising feature of the dataset lies in the good correspondence between
results obtained through local validation and evaluation of the private ground
truth by the organizers. This means that the dataset is both rich and uniform
enough for such estimations to be of value - an aspect which should help in
further e orts in improving the described solution.
{ A very simple ensembling method was quite bene cial in the case of the
evaluated models. This shows that more sophisticated approaches could yield
some additional gains - both when it comes to meta-model blending and
in-model averaging. A progressive increase of the dropout rate was one of the
facets which was actually considered during the experiments. Unfortunately,
these attempts had to be preemptively stopped due to the time constraints
encountered in the nal stage of the competition.
The top results achieved this year in the foreground category of the BirdCLEF
challenge are very promising - a MAP of almost 70% with 1 000 species is de nitely
something which could be called an expert system. The presented method based
on convolutional neural networks has a slightly weaker, yet still very decent
performance of 52.9%, warranting further investigation of this approach.</p>
      <p>At the same time, the performance of all teams in the soundscape category
is not overwhelming, to say the least. This raises some interesting questions:
Is this kind of problem so hard and conceptually di erent that it would require
a completely overhauled approach? Considering that uniform ground-truth
labeling is much harder in this case, what is the impact of this aspect on the whole
evaluation process?</p>
      <p>One thing is certain though - there is still a lot of room for improvement, and
despite a constant stream of enhancements presented by new submissions, the
bar is set even higher in every consecutive BirdCLEF challenge.</p>
      <sec id="sec-2-1">
        <title>Acknowledgments</title>
        <p>I would like to thank the organizers of BirdCLEF and the Xeno-canto Foundation
for an interesting challenge and a remarkable collection of publicly available audio
recordings of singing birds.</p>
        <p>0.7
0.6
0.5
0.1
0.0
with background
foreground only
soundscapes only
Cube4 Cube3 Cube2 MarioTB1MarioTB4 WUT4 MarioTB3</p>
        <p>WUT2 WUT3 WUT1BMETMITB2METMIT3MarioTB2BMETMITB4METMIT1Cube1</p>
        <p>DYNILSIS1 BIG1</p>
        <p>BirdCLEF2015</p>
        <p>Submission</p>
        <p>Fig. 3: BirdCLEF 2016 results
10. Briggs, F. et al.: The 9th annual MLSP competition: New methods for acoustic
classi cation of multiple simultaneous bird species in a noisy environment. Proceedings
of the IEEE International Workshop on Machine Learning for Signal Processing
(MLSP), IEEE, 2013.
11. Goeau, H. et al.: LifeCLEF bird identification task 2016. CLEF working notes 2016.
12. Joly, A. et al.: LifeCLEF 2016: multimedia life species identi cation challenges.</p>
        <p>Proceedings of CLEF 2016.
13. Xeno-canto project. http://www.xeno-canto.org (accessed 24/05/2016).
14. Lasseck, M.: Improved automatic bird identi cation through decision tree based
feature selection and bagging. CLEF working notes 2015.
15. Lasseck, M.: Large-scale identi cation of birds in audio recordings. CLEF working
notes 2014.
16. Joly, A., Leveau, V., Champ, J. and Buisson, O.: Shared nearest neighbors match kernel
for bird songs identification - LifeCLEF 2015 challenge. CLEF working notes 2015.
17. Joly, A., Champ, J. and Buisson, O.: Instance-based bird species identi cation with
undiscriminant features pruning. CLEF working notes 2014.
18. Ren, L. Y., Dennis, J. W. and Dat, T. H.: Bird classi cation using ensemble
classi ers. CLEF working notes 2014.
19. Stowell, D.: BirdCLEF 2015 submission: Unsupervised feature learning from audio.</p>
        <p>CLEF working notes 2015.
20. Stowell, D. and Plumbley, M. D.: Audio-only bird classi cation using unsupervised
feature learning. CLEF working notes 2014.
21. Stowell, D. and Plumbley, M. D.: Automatic large-scale classi cation of bird sounds
is strongly improved by unsupervised feature learning. PeerJ 2:e488, 2014.
22. Koops, H. V., Van Balen, J. and Wiering, F.: A deep neural network approach to
the LifeCLEF 2014 bird task. CLEF working notes 2014.
23. Branson, S., Van Horn, G., Belongie, S. and Perona, P.: Bird species categorization
using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952, 2014.
24. McFee, B. et al.: librosa: 0.4.1. Zenodo. 10.5281/zenodo.32193, 2015.
25. Chambolle, A.: An algorithm for total variation minimization and applications.</p>
        <p>Journal of Mathematical Imaging and Vision, 20 (1-2), 89{97, 2004.
26. van der Walt, S. et al.: scikit-image: Image processing in Python. PeerJ 2:e453, 2014.
27. Piczak, K. J.: Environmental sound classi cation with convolutional neural networks.</p>
        <p>Proceedings of the IEEE International Workshop on Machine Learning for Signal
Processing (MLSP), IEEE, 2015.
28. Chollet, F.: Keras. https://github.com/fchollet/keras (accessed 24/05/2016).
29. Phan, H., Hertel, L., Maass, M. and Mertins, A.: Robust audio event recognition with
1-Max pooling convolutional neural networks. arXiv preprint arXiv:1604.06338, 2016.
30. He, K., Zhang, X., Ren, S. and Sun, J.: Delving deep into recti ers: Surpassing
human-level performance on ImageNet classi cation. Proceedings of the IEEE
International Conference on Computer Vision, IEEE, 2015.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al.:
          <article-title>Sensor network for the monitoring of ecosystem: Bird species recognition</article-title>
          .
          <source>Proceedings of the 3rd IEEE International Conference on Intelligent Sensors, Sensor Networks and Information. IEEE</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mporas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          et al.:
          <article-title>Integration of temporal contextual information for robust acoustic recognition of bird species from real- eld data</article-title>
          .
          <source>International Journal of Intelligent Systems and Applications</source>
          ,
          <volume>5</volume>
          (
          <issue>7</issue>
          ),
          <volume>9</volume>
          {
          <fpage>15</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wimmer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al.:
          <article-title>Sampling environmental acoustic recordings to determine bird species richness</article-title>
          .
          <source>Ecological Applications</source>
          ,
          <volume>23</volume>
          (
          <issue>6</issue>
          ),
          <volume>1419</volume>
          {
          <fpage>1428</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. BirdVox. https://wp.nyu.edu/birdvox/ (accessed 24/05/
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Stowell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Plumbley</surname>
          </string-name>
          , M. D.:
          <article-title>Birdsong and C4DM: A survey of UK birdsong and machine recognition for music researchers</article-title>
          .
          <source>Centre for Digital Music</source>
          , Queen Mary University of London,
          <source>Technical report C4DM-TR-09-12</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. http://ec.europa.eu/environment/nature/legislation/birdsdirective/ (accessed 24/05/
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. IOC World Bird List. http://www.worldbirdnames.org/ (accessed 30/06/
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Glotin</surname>
          </string-name>
          , H. et al.:
          <source>Proceedings of Neural Information Processing Scaled for Bioacoustics. NIPS</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Glotin</surname>
          </string-name>
          , H. et al.:
          <source>Proceedings of the rst workshop on Machine Learning for Bioacoustics. ICML</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>