<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Construction and Improvements of Bird Songs' Classi cation System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haiwei Wu</string-name>
          <email>wuhaiweideyouxiang@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ming Li</string-name>
          <email>ming.li369@dukekunshan.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Duke Kunshan University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sun Yat-sen University</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Detection of bird species with bird songs is a challenging and meaningful task. Two scenarios are presented in BirdCLEF challenge this year, which are monophone and soundscape. We trained convolutional neural network with both spectrograms extracted from recordings and additionally provided metadata. Focusing on the soundscape situation, we applied bird event detection to reduce false alarm. Besides, we rescored the retrievals using masks which are designed for all species being identi ed. In addition, context information was also taken into consideration in our system. Our system was evaluated in BirdCLEF 2018 and we achieved an o cial mean average precision (MAP) score of 0.6548 for monophone classi cation without background bird songs and 0.5882 for identi cation with background bird songs. For soundscape, we achieved 0.1196 in classi cation mean average precision (C-MAP).</p>
      </abstract>
      <kwd-group>
        <kwd>sound detection</kwd>
        <kwd>bird song</kwd>
        <kwd>convolutional neural network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        BirdCLEF challenge is hosted by the LifeCLEF lab [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. The aim of the
competition is to train models which can classify di erent bird species by bird songs.
Data of bird songs in this challenge are collected and displayed on
www.xenocanto.org. This year, a training set of 36,496 bird songs' audios covering 1500
species is provided. As for evaluation, two scenarios are focused on [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The rst
scenario is the identi cation of bird species with given monophone recordings.
Each of these recordings includes mainly one bird's song. For this scenario, 12,347
unlabeled bird songs' audios are provided for evaluation. The second scenario is
the detection of species of soundscape recordings. Participants are required to
nd out the most likely species for each segment of 5 seconds. In the contest
this year, a well-labeled soundscape's evaluation set of 20 minutes including 240
segments of 5 seconds and a test set of 6 hours including 4382 segments of 5
seconds are provided. In this note, construction of our basic system for the rst
scenario and improvements focusing on soundscape scenario will be introduced.
      </p>
      <p>
        The training features of our model mainly consist of two parts. The original
part is the frequency information of each recording and the additional part is
the metadata [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] including latitude, longitude, elevation and time information.
For the original part, audios are converted into features on the frequency
domain. Every 5 seconds' segment of recordings is turned into a time-frequency
image with the resolution of 512 256 pixels [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. The Problem of
audio classi cation is transformed into the problem of image classi cation where
convolutional neural network performs very well [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ]. In our system, the
original spectrograms are fed into a multi-layer convolutional neural network.
Additional metadata are provided in the given XML les. Before the last fully
connected layer, the additional features are concatenated to the attened
convolutional neural network layer. Together, the concatenated features are then used
to compute the remaining layers. Besides a regular multi-layers' convolutional
neural network, we also tried out ResNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Above is the method of our model training. Based on our model, we made
some improvements focusing on the problem of soundscape in the test period.
Firstly, a simple bird event detection [
        <xref ref-type="bibr" rid="ref10 ref4 ref9">4, 9, 10</xref>
        ] was applied before spectrograms
being classi ed by our trained neural network. Secondly, we designed a mask for
each kind of birds. Every time after getting the list of bird species from neural
networks, we sorted it and rescored the top 3 or 5 species by our model after
applying our masks. Thirdly, we considered the previous and next 5 seconds'
information for current evaluation using a simple mechanism.
      </p>
      <p>Pytorch was used for our model training and evaluation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Feature preparation</title>
      <p>
        We transform the problem of bird songs' classi cation to image classi cation.
Each 5 seconds' segment of given audios is turned into a spectrogram with the
resolution of 512 256 pixels [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. A sliding window is used to segment the
audios with an overlap of 4 seconds. For the reason that some spectrograms
contain mostly noises, a simple approach introduced by [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ] is used to separate
the spectrograms into training samples and noise samples. The noise samples
here are also used for data augmentation latter. Data imbalance is a severe
problem in the data. For bird species whose spectrograms are less than a given
number, over-sampling [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] using augmented data is applied.
      </p>
      <p>Data augmentation is necessary for building robust models and handling
data imbalance. Adding noises is a commonly used data augmentation method.
We try to add two kinds of noises to spectrograms. For each epoch of training,
10 percent of data are added Gaussian noises and 10 percent are added noise
samples.</p>
      <p>
        Gaussian noises: Gaussian noises [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] are commonly used for
augmentation. Adding Gaussian noises is a regular method for building robust classifying
networks. Models are able to ignore this kind of noises after training. We add
these noises with randomly chosen weights to our spectrograms and re-normalize
the results.
      </p>
      <p>
        Noise samples: Besides Gaussian noises, noise samples are also considered
and added to our spectrograms. Noises of audios recorded by similar equipment
under similar environments often share some common patterns. Adding similar
noises will help improve the performance. During data processing, we have
obtained many spectrograms which are thought to be noise samples [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. We
randomly choose some of them and add them to current features with random
weights. Re-normalization is also used after addition.
      </p>
      <p>
        Researchers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] noted that considering metadata will do good to the
performance of the model. As for our metadata, we consider latitude, longitude,
elevation, and the time of a recording. We simplify the method of metadata
processing in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. From these provided metadata, we are able to obtain a vector of
7 elements [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Values of elements [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] are shown below:
1. Latitude and Longitude provided, 1 if available, 0 if not;
2-3. Latitude and Longitude, normalized between 0 and 1;
4. Elevation provided, 1 if available, 0 if not;
5. Elevation, normalized between 0 and 1;
6. Time of recording provided, 1 if available, 0 if not;
7. Time information directly normalized between 0 and 1;
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Model construction</title>
      <p>
        We use a relatively shallow architecture of convolutional neural network as
our basic model [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ]. Finding the best architecture of network is very
timeconsuming and we tend to nd out some new methods to improve the
performance in the test period. Our basic network consists of 6 convolutional layers
and 3 fully connected layers. Max Pooling layers are added after each
convolutional layer. Each convolutional and fully connected layer is followed by a batch
normalization [12] layer to avoid parameters getting too extreme and fasten the
process of convergence as well. Dropout [13] is also used after each fully
connected layer to reduce the problem of over tting. As for activation function, we
select exponential linear units (ELU) [14], which is thought to be a proper choice
[
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ].
      </p>
      <p>As the problem can be viewed as a multi-class identi cation problem, cross
entropy loss is used here to be minimized. We use Adam [15] as our optimizer.
Adam optimizer can be regarded as RMSprops [16] with momentum, which
makes the best use of the rst moment and the second moment of the gradient.
Parameters can be updated more stably using it.</p>
      <p>Learning rate decay technique [17] is used in our training process. At the
very beginning, learning rate is set to 0.0001. After nearly 15 epochs of training,
it is lowered to 0.00001 in order to optimize the updating. We stop the process
when the accuracy converges.</p>
      <p>
        Above we mention that metadata is also used for training in our system.
Spectrograms are attened to a vector of 512 elements by our convolutional
neural network. We construct an additional fully connected layer for metadata
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Vectors of 7 elements are transformed to vectors of 100 elements through
this layer. For the limitation of time, the output dimension of this layer is not
further explored here. Later, we concatenate the 512 and 100 elements and feed
them into the next fully connected layers. Finally, a softmax layer [18] of 1500
elements outputs the predicted probability for each bird species.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Improvements</title>
      <p>In the competition of last year showed that the performances on soundscape still
had a large room for improvement. The performance of model has a great impact
on the nal result. While for the limitation of hardware resources and time, we
did not lay stress on the model training. Instead, we tried to nd out methods
that make the best use of our current models. Several methods we applied will
be introduced below.</p>
      <p>
        Bird event detection: False alarm of target species will in uence the metric
of C-MAP. Introduction of bird event detection [19] is able to reduce false alarm
and improve the nal performance. At the very beginning, we planned to use
the soundscape evaluation set to train a neural network. While for the limitation
of labeled data, performance was not good enough for use. At last, we directly
used the method mentioned above [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ] to separate bird songs and noises. If
a spectrogram is regarded as noise, classi cation will not be done on it.
      </p>
      <p>
        Masking and rescoring: For birds belonging to a speci c species, the
frequency of their songs always falls in a certain range. Outside this range of
frequency, any other information including environmental sound or songs of other
kinds of birds can be considered as noises. Inspired by this idea, we designed
a mask for each kind of birds. We accumulated spectrograms of a species on
the frequency axis and normalized it. The range of values under 0.6 would be
masked. Here, we consider 0.6 a relatively proper threshold. The masks for all
birds being classi ed can be viewed as band-pass lters [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. According to the
output for each 5 seconds' segment of the neural network, bird species will be sorted
by their probabilities. Top 3 or 5 species will be selected and spectrogram will
be applied the band-pass lters of these chosen species separately. After being
masked, these 3 or 5 new spectrograms will be rescored by the neural network.
Using this method, we can reduce the interference and obtain a more accurate
result with our current model. In our experiment, we rescored top 3 retrievals.
Illustration Fig.1 describes the whole process in detail.
      </p>
      <p>Considering context: We found that, at most of the time, a bird song
often lasts for a period of time more than 5 seconds. For a 5 seconds' segment
in soundscape, the nal result is strongly relevant to the result of previous and
next 5 seconds' segments. This context information is considered in monophone
scenario by overlapping while seldom considered in soundscape. Here, we simply
added the outputs of the previous and next 5 seconds' segments to current output
with a given weight which can be 0.2 or 0.3 and so on. Here, we set this value
to 0.3 which we found that it resulted in a relatively better result in validation
set. By this method, we took the context into consideration of classi cation.
We totally trained 4 models for our classi cation task. Methods of data
augmentation and addition of metadata are introduced above. Besides the basic
convolutional neural network, we also trained a Resnet for further improving
the nal fused results.</p>
      <p>1. ConvNet with Data augmentation without metadata addition;
2. ConvNet with Metadata addition without data augmentation;
3. ConvNet with Data augmentation and metadata addition;
4. Resnet with Data augmentation without metadata addition.</p>
      <p>This year, a labeled soundscape's evaluation set is given. We are able to test
our improvemnts with it. Model 3 is used to test the e ect of our methods.
From table 1, we can see that masking and rescoring method as well as context
considering can improve C-MAP.</p>
      <p>From table 2, we can nd that with increasing of the fused systems, the
performance is getting better. As expected, system of run5 has the highest scores
on MAP without background species among our submissions.</p>
      <p>Soundscape scenario:</p>
      <p>DKU SMIIP run1: The output of model 3;
DKU SMIIP run2: Fusion of model 2 and 3;
DKU SMIIP run3: Fusion of model 1, 2 and 3;
DKU SMIIP run4: Fusion of model 1, 2, 3, 4.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future work</title>
      <p>In this competition, there are two scenarios, monophone and soundscape. We
trained models using the convolutional neural network with bird songs'
spectrograms. Besides the regular model training, we made data augmentation to
improve the robustness. We also added metadata to further improve the
performance.</p>
      <p>Focusing on soundscape scenario, we made some improvements based on our
current models in the test period. Firstly, bird event detection was introduced
to reduce false alarm. Secondly, masks were designed for each kind of birds.
Rescoring is done on the top 3 or 5 of sorted bird species list after being masked.
Thirdly, context is considered by adding outputs of previous and next 5 seconds'
segments to current output.</p>
      <p>Above methods still have many spaces for improvement. Bird event detection
[19] can be done using neural network models if enough labeled data provided.
Bandpass lters of birds can be more delicate. In our work, context information
is considered using a relatively simple method. During the evaluation, we found
that this kind of information can obviously improve the performance. Further
investigations need to be done in this direction.</p>
      <p>In addition, due to the lack of hardware resources and time, performances of
our basic models still have room for improvement. Further, more model
structures and fusion methods will be explored.
12. Io e, S., &amp; Szegedy, C. (2015). Batch normalization: Accelerating deep network
training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
13. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &amp; Salakhutdinov, R.
(2014). Dropout: A simple way to prevent neural networks from over tting. The
Journal of Machine Learning Research, 15(1), 1929-1958.
14. Clevert, D. A., Unterthiner, T., &amp; Hochreiter, S. (2015). Fast and
accurate deep network learning by exponential linear units (elus). arXiv preprint
arXiv:1511.07289.
15. Kingma, D. P., &amp; Ba, J. (2014). Adam: A method for stochastic optimization.</p>
      <p>arXiv preprint arXiv:1412.6980.
16. Tieleman, T., &amp; Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by
a running average of its recent magnitude. COURSERA: Neural networks for
machine learning, 4(2), 26-31.
17. Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv
preprint arXiv:1212.5701.
18. Hinton, G. E., &amp; Salakhutdinov, R. R. (2009). Replicated softmax: an undirected
topic model. In: Advances in neural information processing systems (pp.
16071614).
19. Stowell, D., Wood, M., Stylianou, Y., &amp; Glotin, H. (2016). Bird detection in audio:
a survey and a challenge. In: Machine Learning for Signal Processing (MLSP), 2016
IEEE 26th International Workshop on (pp. 1-6). IEEE.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Joly</surname>
          </string-name>
          , Alexis and Goeau, Herve and Botella, Christophe, Glotin, Herve and Bonnet, Pierre and Planque, Robert and Vellinga, Willem-Pier and Muller, Henning. (
          <year>2018</year>
          ).
          <source>Overview of LifeCLEF</source>
          <year>2018</year>
          :
          <article-title>a large-scale evaluation of species identi cation and recommendation algorithms in the era of AI</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Alexis and Goeau, Herve and Glotin, Herve and Spampinato, Concetto and Bonnet, Pierre and Vellinga, Willem-Pier and Lombardo, Jean-Christophe and Planque, Robert and Palazzo, Simone</article-title>
          and Muller, Henning. (
          <year>2017</year>
          ).
          <article-title>LifeCLEF 2017 lab overview: multimedia species identi cation challenges</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Goeau, Herve and Glotin, Herve and Planque, Robert and Vellinga,
          <string-name>
            <surname>Willem-Pier</surname>
          </string-name>
          , and
          <string-name>
            <surname>Stefan</surname>
            , Kahl, Joly,
            <given-names>Alexis.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Overview of BirdCLEF 2018: monophone vs. soundscape bird identi cation</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fazekas</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schindler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lidy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>A multi-modal deep neural network approach to bird-song identi cation</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sevilla</surname>
            , Antoine,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bessonne</surname>
            , and
            <given-names>H. Glotin.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Audio bird classi cation with inception-v4 extended with time and time-frequency attention mechanisms</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilhelm-Stein</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hussein</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klinck</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kowerko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Eibl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Large-scale bird sound classi cation using convolutional neural networks</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fritzler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koitka</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Recognizing bird species in audio les using transfer learning</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: Proceedings of CVPR</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sprengel</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin Jaggi</surname>
            ,
            <given-names>Y. K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Audio based bird species identi cation using deep learning techniques</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lasseck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Bird song classi cation in eld recordings: winning solution for NIPS4B 2013 competition</article-title>
          .
          <source>In Proc. of int. symp. Neural Information Scaled for Bioacoustics</source>
          , sabiod. org/nips4b, joint to NIPS, Nevada (pp.
          <fpage>176</fpage>
          -
          <lpage>181</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Japkowicz</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stephen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>The class imbalance problem: A systematic study</article-title>
          .
          <source>Intelligent data analysis</source>
          ,
          <volume>6</volume>
          (
          <issue>5</issue>
          ),
          <fpage>429</fpage>
          -
          <lpage>449</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>