<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Multi-modal Deep Neural Network approach to Bird-song identi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Botond Fazekas</string-name>
          <email>botond.fazekas@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Schindler</string-name>
          <email>alexander.schindler@ait.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Lidy</string-name>
          <email>lidy@ifs.tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Rauber</string-name>
          <email>rauber@ifs.tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Austrian Institute of Technology Center for Digital Safety and Security</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vienna University of Technology Institute of Software Technology and Interactive Systems</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a multi-modal Deep Neural Network (DNN) approach for bird song identi cation. The presented approach takes both audio samples and metadata as input. The audio is fed into a Convolutional Neural Network (CNN) using four convolutional layers. The additionally provided metadata is processed using fully connected layers. The attened convolutional layers and the fully connected layer of the metadata are joined and fed into a fully connected layer. The resulting architecture achieved 2., 3. and 4. rank in the BirdCLEF2017 task in various training con gurations.</p>
      </abstract>
      <kwd-group>
        <kwd>bird song</kwd>
        <kwd>deep neural network</kwd>
        <kwd>exponential linear unit</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        We present our multi-modal Deep Neural Network (DNN) submission to the
BirdCLEF 2017 task[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which is part of the LifeCLEF [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] Multimedia Retrieval
of biodiversity data evaluation campaign. The presented system is an adoption
of the approach introduced in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which was extensively evaluated in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The
presented approach extends the originally audio-only based model to include further
modalities as input. The original part is based on the provided samples of eld
recorded audio content. This information is converted to the frequency domain
and sequentially processed before being fed into a custom Convolutional
Neural Network (CNN) using four convolutional layers. The additionally provided
metadata is processed using fully connected layers. The attened convolutional
layers and the fully connected layer of the metadata were joined and fed into a
large fully-connected layer.
      </p>
      <p>To achieve better convergence of the neural networks and to improve model
accuracy several pre-processing steps are applied to the provided data. The
eld-recordings are split into bird-song and noise parts. For the training of the
models a random audio segment is selected from the sound le. Various
dataaugmentation steps detailed below are applied to the Mel-scaled spectrograms
and fed into the network. From the provided metadata, longitude, latitude,
elevation and the part of the day are used as additional information. Each feature
is agged with an extra bit in case of missing data.</p>
      <p>For the nal calculation of the results, sequential audio segments with 50%
overlap are taken from the sound les and predictions are retrieved from the
trained model. To asses the nal classi cation, the average predictions for each
segment of a sound le is calculated.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preprocessing</title>
      <p>This section describes data transformations, especially in the audio domain,
including data manipulation methods to augment the provided training data.
2.1</p>
      <sec id="sec-2-1">
        <title>Sound preprocessing</title>
        <p>
          For the sound preprocessing a similar method formulated in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is applied. The
audio recordings are split in sound, noise and irrelevant segments. In order to
do that we compute the spectrogram of the sound le using short-time Fourier
transform (STFT) with a Hanning window function (size 512, 74% overlap). We
normalize the resulting spectrogram to the interval [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ]. The spectrogram is
treated then as a grayscale image.
        </p>
        <p>
          As in most of the recordings the foreground bird singing/calling has higher
amplitude than the background noise, in order to distinguish the relevant sound
from the background noise, each STFT frequency bin is set to 1 if it is above
three times the median of the corresponding row and three times the median of
its corresponding column, otherwise it is set to 0. However, as this step results in
a noisy spectrogram, binary erosion and dilation lter is applied to it. We have
used a 4 by 4 lter as suggested in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. A one dimensional indicator vector is
created from this image in which the i-th column is set to 1 if the corresponding
column in the spectrogram contains at least one 1, other it is set to 0. This vector
is further binary dilated twice. The indicator vector is then scaled to the original
length of the recording and it is used as a mask to extract the relevant sound
part. For separating the noise the same method is applied with a threshold of 3
instead of 2.5 for the median clipping and the resulting image is then inverted.
Columns containing pixels which don't have an amplitude larger than 3 times or
smaller than 2.5 times the row and column median are considered as irrelevant
as in this case they cannot be distinguished clearly from the sound or noise part.
However, with very noisy recordings or in ones that contain only bird songs
without any quiet parts this approach can result in very short or even empty
segments, as none or very few pixels will be above the median threshold. To
overcome the problem of short segments, a minimum segment length of 32.768
samples is selected, as this is the minimum sound chunk size in our network
architecture. The noise/sound separation threshold is iteratively lowered by 0.1
until the length of the sound part is over this limit.
        </p>
        <p>Since the Deep Learning network needs a x sized input during the training, for
composing the batches we randomly select 16 (our batch size) les from which
we randomly select segments. If the les contain less than 32768 samples, instead
of padding we loop the les. The selected segments are then converted to the
time-domain using Short-term Fourier Transform (STFT). In a subsequent step
a log-normalized Mel-scale Transform with with 80 Mel-bands is applied.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Data Augmentation</title>
        <p>
          Most of the data augmentation steps are similar to the ones used in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. However,
we found that in addition to the data augmentation steps proposed small
variations in the amplitude, overlaying other birds from neighboring areas can further
improve the accuracy, leading to the data augmentation process described below.
Noise overlay During the training up to 4 random noise samples are taken
from the noise les of the training set and each is added With
75% probability to the sound sample. This results in having
some segments containing no noise overlay at all, while others
having four di erent. Adding more noise to the sound samples
results in a worse performance. We found that greatly
dampening the volume of the noise (as described in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] reduces the
accuracy. Thus the overlay volume is only changed by 10%.
        </p>
        <p>Combining same class audio les With a probability of 70%, recordings of
birds from the same class are overlayed with a random damping
factor between 20% and 60%.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Combining birds from the neighboring area In addition to the noise, with</title>
        <p>a probability of 30%, a bird singing/calling of a di erent class
that can be found in a distance of 1 East/West/North/South
is overlayed on the sample with 30% 5% damping.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Random cut</title>
        <p>After applying one of the above described overlays the
spectrogram is randomly cut into two parts and the two parts
concatenated again after switching the order.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Volume shift</title>
        <sec id="sec-2-5-1">
          <title>The volume of the input audio is randomly changed by 5%.</title>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>Pitch shift</title>
        <sec id="sec-2-6-1">
          <title>The pitch of the input audio is randomly changed by 5%.</title>
          <p>2.3</p>
        </sec>
      </sec>
      <sec id="sec-2-7">
        <title>Metadata preprocessing</title>
        <p>To incorporate the available metadata in the model some preprocessing is
required due to missing or inapplicable values. For the missing values we use other
instances of the same species where these attributes are available. We calculate
the mean and the variance of the respective attribute distribution and generate
a normal distributed random value.</p>
        <p>
          Apart from the date and the geo-coded coordinates, the time of the day is
available. If this information is missing it is randomly generated as above. It
has been shown that bird song intensity correlates with the melatonin levels
in the birds and thus with the daylight [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ]. As the time of the sunrise and
sunset varies during the year, the time of the day is not directly related to the
amount of light. Thus, instead of directly using the time values we decided to
divide the day into six di erent categories of sunlight exposure corresponding to
di erent positions of the sun on the sky. The time of the sunrise and the sunset
is dependent on the coordinates and on the day of the year (and partially on the
elevation, however this is ignored in our implementation) and it is approximated
with the algorithm formulated in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We de ne the following parts of a day:
{ Night1 - From midnight until the sun is 9 below the horizon (BTH)
{ Dawn - From 9 BTH until 4 above the horizon (ATH)
{ Forenoon - From 4 ATH until noon
{ Afternoon - From noon until 4 ATH
{ Dusk - From 4 ATH to 9 BTH
{ Night2 - 9 BTH until midnight
9 BTH is selected because it lies between the nautical twilight (i.e. the horizon
is visible) and the civil twilight (i.e. terrestrial objects are visible to the human
eye), 4 ATH is selected arbitrarily as a point where the sun is already clearly
above the horizon.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Network architecture</title>
      <p>The network has two types of input: one for the spectrogram and one for the
metadata. The metadata input is a vector of 7 elements:
1. Coordinates available (1 if available, 0 if not)
2. Latitude (normalized to 0..1)
3. Longitude (normalized to 0..1)
4. Elevation available (1 if available, 0 if not)
5. Elevation (normalized to 0..1)
6. Part of day available (1 if available, 0 if not)
7. Part of day (normalized to 0..1)</p>
      <p>The metadata input is fed into a fully connected layer of 100 neurons. The
spectrogram input layer (80 512 units) is followed by four convolutional layers
with Exponential Linear Unit (ELU) activation, each followed by a max-pooling
layer. We found that ELUs yield the same results as using a rectifying activation
function but without the need of batch normalization and reducing the training
times by about 300%. A dropout of 0.2 is used on the input layer, after attening
the convolutional layers (0.4) and after the fully connected layer (0.4).</p>
      <p>We use either an FFT window of 256 with an 32768 sample long sound
segment or an FFT window of 512 with an 65536 sample long sound segment.
For both of them the Mel scale is calculated with 80 bands. Thus, the input layer
is a matrix of 80 512. It is important to note that having 80 Mel bands with
an FFT windows size of 256 means that some Mel bands are empty. However,
these are ltered anyway in the rst convolutional / max-pooling layer. Thus,
instead of using di erently con gured input layers for FFT sizes 256 and 512 we
are using a single input layer con guration covering all bands. For the training
we use a batch size of 16, and a learning rate of 0.001, with Nesterov momentum
of 0.9.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Cynapse Run 1:</title>
      </sec>
      <sec id="sec-4-2">
        <title>Cynapse Run 2:</title>
      </sec>
      <sec id="sec-4-3">
        <title>Cynapse Run 3:</title>
      </sec>
      <sec id="sec-4-4">
        <title>Cynapse Run 4:</title>
        <sec id="sec-4-4-1">
          <title>In this run we use an 256 FFT window, and we train the</title>
          <p>network with 90% of the training set for 4 days.</p>
          <p>We use an 512 FFT window, and the training is kept
running for 3 days with 90% of the training set. Then, for 1
day it is trained on the whole training set.</p>
          <p>The network from Cynapse Run 1 is kept training for an
additional day with the whole training set.</p>
          <p>The predictions of Cynapse Run 2 and Cynapse Run 3
were taken and averaged for each class.</p>
          <p>For detailed scenario descriptions and a full report on the BirdCLEF
evaluation campaign results, please refer to the BirdCLEF 2017 Web-page3. We also
tested a more complicated architecture with three inputs: an FFT 256
spectrogram, an FFT 512 spectrogram and the metadata which are then co-joined in
a fully connected layer. This architecture yielded better results as the FFT
512only network, but worse than the FFT-256 network, however the training time
is considerably longer.
3 http://www.imageclef.org/lifeclef/2017/bird
The presented approach harnesses information deriving from multiple
modalities. Cynapse Run 3 performed the best for the time-coded soundscapes, which
may contain longer parts without any relevant sound. However for the traditional
records with less noise it had the worst performance. On the other hand,
Cynapse Run 4 that incorporated the results of Cynapse Run 3 and the Cynapse
Run 2 results with higher frequency resolutions performed the best for
traditional record, but only third best for the time-coded soundscapes. A possible
explanation could be that the higher frequency resolution is a more
distinguishing feature than the higher temporal resolution but only if there are long enough
sound segments available. Future work would focus on regarding ornithological
relationships instead of treating each species as an isolated class.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Go</surname>
          </string-name>
          <article-title>eau, Herve and Glotin, Herve and Planque, Robert and Vellinga, Willem-Pier and Joly, Alexis LifeCLEF Bird Identi cation Task 2017 CLEF working notes 2017</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Elias</given-names>
            <surname>Sprengel</surname>
          </string-name>
          , Martin Jaggi, Yannic Kilcher, and Thomas Hofmann (
          <year>2016</year>
          )
          <article-title>Audio Based Bird Species Identi cation using Deep Learning Techniques</article-title>
          .
          <source>CLEF 2016 Working Notes</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>G.F.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Balthazart</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2002</year>
          )
          <article-title>Neuroendocrine mechanisms regulating reproductive cycles and reproductive behavior in birds</article-title>
          .
          <source>In Hormones, Brain, and Behavior</source>
          .
          <volume>2</volume>
          :
          <fpage>649798</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bentley</surname>
            ,
            <given-names>G.E.; Vant</given-names>
          </string-name>
          <string-name>
            <surname>Hof</surname>
            , T.J.; Ball,
            <given-names>G.F.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Seasonal neuroplasticity in the songbird telencephalon: A role for melatonin</article-title>
          .
          <source>In Proceedings of the National Academy of Sciences of the United States of America. Proceedings of the National Academy of Sciences</source>
          <volume>96</volume>
          (
          <issue>8</issue>
          ):
          <fpage>4674</fpage>
          -
          <lpage>4679</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Van</given-names>
            <surname>Flandern</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. C.</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. F.</given-names>
            <surname>Pulkkinen</surname>
          </string-name>
          . (
          <year>1979</year>
          ).
          <article-title>Low-precision formulae for planetary positions</article-title>
          .
          <source>In The Astrophysical Journal Supplement Series</source>
          <volume>41</volume>
          :
          <fpage>391</fpage>
          -
          <lpage>411</lpage>
          . APA
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goau</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spampinato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mller</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <source>LifeCLEF</source>
          <year>2016</year>
          :
          <article-title>mMltimedia life species identi cation challenges</article-title>
          .
          <source>In International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          (pp.
          <fpage>286</fpage>
          -
          <lpage>310</lpage>
          ). Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Schindler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lidy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Rauber.</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Comparing shallow versus deep neural network architectures for automatic music genre classi cation</article-title>
          .
          <source>In Proceedings of the 9th Forum Media Technology (FMT2016)</source>
          , St. Poelten, Austria.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>T.</given-names>
            <surname>Lidy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Schindler.</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>CQT-based convolutional neural networks for audio scene classi cation</article-title>
          .
          <source>In Proceedings of the Detection and Classi cation of Acoustic Scenes and Events 2016 Workshop (DCASE2016)</source>
          , (pages
          <fpage>60</fpage>
          {
          <fpage>64</fpage>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>