<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Baseline for Large-Scale Bird Species Identi cation in Field Recordings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Kahl</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Wilhelm-Stein</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Holger Klinck</string-name>
          <email>Holger.Klinck@cornell.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danny Kowerko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maximilian Eibl</string-name>
          <email>maximilian.eiblg@informatik.tu-chemnitz.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bioacoustics Research Program, Cornell Lab of Ornithology</institution>
          ,
          <addr-line>159 Sapsucker Woods Road, Ithaca, NY 14850</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Chair Media Informatics, Chemnitz University of Technology</institution>
          ,
          <addr-line>D-09107 Chemnitz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Junior Professorship Media Computing, Chemnitz University of Technology</institution>
          ,
          <addr-line>D-09107 Chemnitz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>5</lpage>
      <abstract>
        <p>The LifeCLEF bird identifcation task poses a di cult challenge in the domain of acoustic event classi cation. Deep learning techniques have greatly impacted the eld of bird sound recognition in recent years. We discuss our attempt of large-scale bird species identi cation using the 2018 BirdCLEF baseline system.</p>
      </abstract>
      <kwd-group>
        <kwd>Bioacoustics Bird Sounds Deep Learning BirdCLEF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>
        Large-scale bird sound identi cation in audio recordings is the foundation of
long-term species diversity monitoring. Aiding this labor intensive task with
automated systems that can recognize multiple hundreds of species has been the
focus in recent years. As part of the 2018 LifeCLEF workshop [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the BirdCLEF
bird identi cation challenges [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] provide large datasets containing almost 50.000
recordings to assess the performance of various systems attempting to push the
boundaries of automated bird sound recognition.
In 2016, Sprengel et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] demonstrated the superior performance of
convolutional neural networks (CNN) for the classi cation of bird sounds. Following
that approach, we were able to improve the performance on a larger dataset
containing 1500 di erent species with our 2017 BirdCLEF participation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This
year, we present an implementation of a streamlined work ow built on the most
fundamental principles of visual classi cation using CNN. We published the code
repository as baseline system complementing the 2018 BirdCLEF challenge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
The following work ow design, training scheme and submission results are
entirely based on that system, establishing a good overall baseline for future
comparisons and improvements.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Work ow</title>
      <p>The key stages of our work ow include dataset pre-processing, spectrogram
extraction, CNN training and evaluation. We adopted our last year's attempt and
focused mainly on basic deep learning techniques, keeping the code base as simple
and comprehensible as possible, while maintaining a good overall performance.
3.1</p>
      <sec id="sec-2-1">
        <title>Dataset Handling</title>
        <p>
          Using convolutional neural networks for the classi cation of acoustic events
proved to be very e ective despite the fact that these techniques are tailor made
for visual recognition. Representing audio recordings as spectrograms overcomes
this gap between the two domains of audio and image. We decided to use
MELscale log-amplitude spectrograms which have been e ectively used in similar
approaches (e.g. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]). A more detailed description of the extraction and
preprocessing process can be found in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Training</title>
        <p>Our baseline training process supports multiple shallow and deep model
architectures, extensive dataset augmentation, learning rate scheduling, model
pretraining and result pooling. We implemented two basic CNN concepts:
Fullyconvolutional architectures with simple layer sequences and ResNet variations
with shortcut connections. We also provide eBird4 checklist metadata for both
soundscape locations in Peru and Columbia along with the basline repository.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Model Distillation</title>
        <p>Most CNN implementations are computationally expensive and rely on
powerhungry hardware. Future applications of automated bird sound recognition will
include eld recorders capable of not only of recording, but also analyzing audio
data in real-time. In those cases, battery life becomes an issue. In recent years,
(semi-) mobile hardware - mostly used for IoT-applications - has been designed
to aid this task. However, those hardware platforms are not yet suited for deep
learning inference using complex models.</p>
        <p>
          In 2015, Hinton et. al [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] presented an approach to distill knowledge in neural
networks. We followed that scheme of model distillation and implemented a
basic variant of teacher-student learning. Our baseline system allows to replace
binary training targets with log-probability predictions of either single models or
entire ensembles. We designed a simple shallow model that can predict species
probabilities of one-second audio chunks in less than one second running on a
Raspberry Pi 3+. The resulting scores are slightly lower than those of large
single models, but still above the initial capabilities of the tiny CNN model.
        </p>
        <sec id="sec-2-3-1">
          <title>4 www.ebird.org/explore</title>
          <p>The prediction performance of this approach is promising and model distillation
may have signi cant impact on the eld of mobile real-time species diversity
assessment.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We tried to cover di erent basic training and prediction schemes with our run
submissions, including single baseline models, large and diverse model ensembles,
metadata assisted attempts with species pre-selection and knowledge distillation
training of tiny models. Table 1 provides an overview of selected results from
our submissions.</p>
      <p>The results show that our baseline attempt yields competitive results
considering the complex evaluation task. Most results did match our expectations
for the audio-only classi cation of eld recordings. The key takeaways of the
analysis of the submission results are:
{ Diverse model ensembles covering di erent net architectures and dataset
splits outperform single neural nets by a signi cant margin. This comes as
no surprise in the domain of metric-centered competitions, but might not be
applicable to real-world scenarios due to increased computational costs.
{ Pre-selecting species did not improve the overall performance as expected.</p>
      <p>In some cases, selecting species based on time of the year and location helps
to reduce training time. Using metadata as post- lter to eliminate false
detections or as input during model training might lead to better results.
{ Model distillation is a powerful tool to increase the classi cation
performance of tiny neural networks. The results show comparable performance
in the soundscape domain despite much smaller model architectures, when
compared to model ensembles.</p>
      <p>We published our entire code repository5 and encourage future participants
and interested research groups to build upon our results and improve the
performance for the analysis of complex soundscapes - the most crucial aspect of
species diversity monitoring.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Future Work</title>
      <p>Assessing high-quality eld recordings for the presence of bird species using
convolutional neural networks is an e ective application of deep learning techniques
to the domain of acoustic event detection. Considering our own scores and those
of other participants, current machine learning algorithms yield very strong
results for this task. However, the 2018 BirdCLEF evaluation showed that the
transfer of knowledge extracted from monophonic community recordings to the
domain of long-term soundscape recordings is still very di cult. Hardly any
improvements over last year's result have been accomplished. Future research
should speci cally focus on this task. Additionally, power-hungry hardware and
computationally expensive algorithms are not well-suited for real-world
applications such as mobile reorders. Improving techniques to shrink the size of neural
networks while maintaining the overall performance will greatly help the eld of
long-term species diversity assessment.</p>
      <sec id="sec-4-1">
        <title>5 https://github.com/kahst/BirdCLEF-Baseline</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Botella</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.-P.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Overview of LifeCLEF 2018: a large-scale evaluation of species identi - cation and recommendation algorithms in the era of AI</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2018</year>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.-P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of BirdCLEF 2018: monophone vs. soundscape bird identi cation</article-title>
          .
          <source>In: CLEF working notes 2018</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Sprengel</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin Jaggi</surname>
            ,
            <given-names>Y. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Audio based bird species identi - cation using deep learning techniques</article-title>
          .
          <source>In: Working notes of CLEF</source>
          <year>2016</year>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilhelm-Stein</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hussein</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klinck</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kowerko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eibl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Large-Scale Bird Sound Classi cation using Convolutional Neural Networks</article-title>
          .
          <source>In: CLEF working notes 2017</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilhelm-Stein</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klinck</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kowerko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eibl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Recognizing Birds from Sound - The 2018 BirdCLEF Baseline System</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>07177</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Grill</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Schluter, J.:
          <article-title>Two convolutional neural networks for bird detection in audio signals</article-title>
          .
          <source>In: Signal Processing Conference (EUSIPCO)</source>
          ,
          <year>2017</year>
          25th European, pp.
          <volume>1764</volume>
          {
          <issue>1768</issue>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distilling the knowledge in a neural network</article-title>
          .
          <source>arXiv preprint arXiv:1503.02531</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>