<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bird sound classi cation using a bidirectional LSTM</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lukas Muller</string-name>
          <email>lukas@roraro.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Marti</string-name>
          <email>mario.marti@outlook.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bi-directional RNN</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ZHAW Zurich University of Applied Sciences, School of Engineering</institution>
          ,
          <addr-line>Winterthur</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While RNN's are widely applied to many similar tasks, such as automatic speech recognition and transcription, nobody succeeded yet using them for classi cation in the BirdCLEF challenge. Recent work at ZHAW on the topic of speaker clustering has yielded very promising results using a Bidirectional LSTM. We have set out to apply the LSTM-Network mentioned above to the BirdCLEF challenge. By doing so, we hope to make a valuable contribution to the ongoing research in this eld. Unfortunately, our approach did not perform as well as we hoped for. In the following, we detail the approach taken and some issues a bachelor student with little experience in the eld of machine learning will encounter when participating in such a challenging competition.</p>
      </abstract>
      <kwd-group>
        <kwd>LSTM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introdruction</title>
      <p>
        Recognising birds by their song in the wild is a challenging task. One reason for
this is that according to recent research by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] there are approximately 18'043
di erent bird species. Since birds do not just vocalise in bird songs but also
various types of calls such as alarm calls or contact calls [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], this leads to a huge
classi cation problem.
      </p>
      <p>The Cross-Language Evaluation Forum (CLEF) hosts an annual challenge
for bird classi cation. This BirdCLEF challenge [5] allows researchers from all
over the world to test their approaches against state of the art for bird voice
recognition. The organisers have built a large dataset based on contributions of
Xeno Canto1, a network for sharing bird songs from all over the world.</p>
      <p>The Datalab at Zurich University of Applied Sciences had great success over
the last years in applying recurrent neural networks, speci cally LSTMs [4], to
speaker recognition tasks, such as speaker clustering.</p>
      <p>Since the BirdCLEF challenge provides an exciting opportunity to test those
approaches on a big dataset, we have set out to show that an LSTM can perform
as well on the task as a CNN based approach.
1 https://www.xeno-canto.org</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>We participated in the challenge in the context of our bachelor's thesis. As such
we did not have much experience in machine learning, let alone signal processing.
For that reason, we decided to try and bring established preprocessing methods,
and the LSTM mentioned earlier together. This way we would not need to build
up the full knowledge required when building a di erent approach from scratch.
2.1</p>
      <sec id="sec-2-1">
        <title>Preprocessing</title>
        <p>When the baseline [7] [6] was released, we had already started to implement a
pipeline in Python 3.5, matching the software versions used in [10]. We integrated
and subsequently used the preprocessing and augmentation steps found in the
baseline [7] [6].</p>
        <p>Di erences to the baseline As a signal to noise threshold value, we used 1e
3, as that value was mentioned in the readme in [6]. We did not experimentally
validate that threshold.</p>
        <p>The LSTM Model was originally designed to work on spectrograms of 0.5
second long segments and a resolution of 50 pixels wide by 128 pixels wide.
With the intentions to keep the input similar, we changed the corresponding
settings compared to the baseline.</p>
        <p>This results in an identical n t-window length of 1024 but increases the
hoplength from 173 to 450. E ectively the baseline system used windows overlapped
by 83.2%, while our variant shrinks this overlap to 56.0%.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Model 2.3</title>
      </sec>
      <sec id="sec-2-3">
        <title>Training</title>
        <p>The model we have used for our participation in BirdCLEF was based on [10].
As mentioned above, we use inputs of 50 pixels width for segments that are 5
seconds long. Other than in the paper above, we use more hidden units in each
bidirectional LSTM layer. This change was derived during one of our
experiments.</p>
        <p>Datasets Before we started training, 5% of the provided training set were put
aside in a validation set. We used strati ed sampling to ensure that at least one
recording for each class was present in the validation split.</p>
        <p>We then went on to generate spectrograms out of these recordings, as
described in 2.1.
Caches Since the dataset was stored on network storage, we have encountered
a bottleneck when loading the samples. To enable for more sequential reads, we
rst shu ed the generated dataset and then created caches containing 12800
samples and labels each. For this we have made use of pythons pickle module.</p>
        <p>During training, a producer-consumer pattern was applied, with multiple
threads loading samples from the caches and augmenting them for training.
Every time a cache was loaded the contained sample, and label pairs were shu ed
to safeguard against the constellation of samples in a batch repeating each epoch.
Augmentation For augmentation, we have settled on adding noise samples to
the training samples. As in [7] those are byproducts of the preprocessing steps.
We have also tested adding some gaussian noise and applying vertical roll. Those
augmentation steps did not improve the generalisation of our model, so we have
dismissed them. Based on the baseline we did augmentation on the y with a
probability of 0.5.</p>
        <p>Training Duration and Batchsize While experimenting, we could not make
out a di erence in the performance of our model between training with batch
sizes of 64, 128 or 256 samples per batch. The time it took to train however was
signi cantly lower using a higher batch size. For this reason, we have usually
used a batch size of 256 samples.</p>
        <p>Time permitting, we trained for at most 100 epochs without applying early
stopping. Throughout this paper, we will refer to the time or number of batches
it takes until the model was trained on each sample once as one epoch. After
each epoch, we saved a snapshot of the trained model. Training one epoch took
anywhere from 30 minutes to a couple of hours.</p>
        <p>During training Adam [8] was used as optimiser with a learning rate of 1e 4.</p>
        <p>We did evaluate the snapshots for each epoch on the provided soundscape
validation set. Time permitting, we also calculated an MRR score on our 5%
validation split.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>To nd the settings used for our runs, we have conducted multiple experiments.
Out of these, we describe the two most relevant for our submission.
Hidden Units Since the model was originally used on TIMIT [3], a relatively
small dataset of studio recordings, its capacity was lacking for the task at hand.</p>
      <p>Our experiments showed that increasing the number of hidden units from
256 to 512 and 768 for each bidirectional LSTM layer lead to a signi cantly
better MRR score. On the soundscape validation set, only a minor increase
in performance was seen. While more hidden units lead to a better score, the
performance gains showed a diminishing trend between higher numbers. At the
same time, the training time increased. For this reason, we did not push past
768 hidden units.
Augmentation As [7] shows, not every form of augmentation is necessarily
good for a model. For this reason, we have conducted an experiment to show
how augmentation through noise samples, adding of Gaussian noise and applying
vertical roll in uences the models performance. To do so, we have trained a model
using all of these methods and one using none of them as baselines. For each of
the augmentation methods, we trained one model where that method was not
used, while the other two were. Due to time limitations, we were not able to test
all permutations of these augmentation methods.</p>
      <p>Since the models that were trained either without adding Gaussian noise
or without applying vertical roll performed better than the baseline using all
augmentation methods, we decided on not using those for our submission.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Submitted Runs</title>
      <p>We have submitted four runs, one of which was an ensemble of two other runs.
For both the soundscape task and the monophone task, we reported the 100
classes that were predicted with the highest con dence. For MRR the wrong
predictions after the correct class do not matter for the score. For the soundscape
task, we expected that there would be a sweet spot reporting fewer classes for
each segment. However, experiments on the soundscape validation set showed
that the score increased with the number of classes reported.</p>
      <p>In the following, we present the settings used to train the models for our
submissions. The number of hidden units refers to the hidden units in both
bidirectional LSTM layers, while dropout is the setting we used for the rst
dropout layer. The second dropout layer was set to use half as much dropout
as the rst. The augmentation probability is the probability determining if any
augmentation would be done. The Gaussian noise parameter gives the maximum
intensity of Gaussian noise that will be added to the sample, and lastly, we
controlled how much downsampling was done, by setting an upper limit on the
number of samples per class.
4.1</p>
      <p>Run 1
The model used for Run 1 was trained for 94 epochs on an old dataset which
initially had a faulty separation of the training and validation splits, letting it
train on audio les that were also used in validation. For the submission, the
model was trained for an additional 34 epochs on a new and corrected dataset
which did not have that issue.
Run 2 was trained for 34 epochs on a dataset, which was downsampled to 1000
samples per class instead of the 1500 used in the experiments and other runs.
This run was primarily intended as a test run to evaluate if a new update to our
pipeline would work. Due to its performance and the lack of better models, we
submitted this run as our second run. Some settings for this run could be proven
in our experiments to be bene cial (hidden units, learning rate), the in uence of
others was not experimentally determined (increased augmentation probability).
This run was an ensemble of both run 1 and run 2. The ensemble was generated
by using a python script and averages the values reported in run 1 and 2. We
did not use a weighted average.
For run 4 the newly generated dataset was used. The selection of the parameters
was based on the results of the experiments described earlier in this thesis.</p>
      <p>Since the results of the rst three submissions were lower than expected, we
reworked the process used for making predictions. As we used all spectrograms
for a given monophone sample or soundscape segment, it is likely that some
of them did not contain a usable signal. For these spectrograms, the results
would be close to random. Once the predictions for the segment or monophone
sample are pooled, these random predictions would inevitably pull down correct
predictions.</p>
      <p>As we have used a signal to noise threshold of 0.001 to decide if we train on
any given spectrogram, we tested using the same threshold to lter out unsuitable
spectrograms during predictions.</p>
      <p>Brie y validating this hypothesis on the soundscape validation set, con rmed
that the scores could be improved this way. Unfortunately, we only gained an
additional score of 0.02 on the validation set on average.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>The results we achieved on both subtasks were underwhelming. Compared to
other submissions, our approach consistently performed the worst. For the
monophone task, we can see, that the increased capacity of the network and the
changes made for run 4 did have a positive e ect on the results. On the
soundscapes that does not seem to be the case.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Challenges we faced</title>
      <p>As mentioned before, we participated in BirdCLEF in the context of our
bachelors' thesis. It was the rst scienti c challenge we took part in. Being bachelor
students, we were fairly inexperienced in the eld of arti cial intelligence. In the
previous semester, we have attended lectures on the subject, but we had little
practical experience but were interested in gathering some experience by doing
a bigger project. The BirdCLEF challenge was an ideal candidate for this, as
recognising bird calls is an interesting topic with room for improvement.</p>
      <p>While we were aware at the beginning that it is a challenging task, we
overestimated our abilities and ultimately did not achieve the results we strived for.
Handling such a large data set brought additional unexpected di culties. In the
following section, we would like to brie y describe the experiences of
participating in the challenge and pick out some lessons we have learned.
6.1</p>
      <sec id="sec-6-1">
        <title>Deadline, pressure and code quality</title>
        <p>While we had experienced stressful times and worked towards deadlines both
during our studies as well as our careers, we were surprised by the amount of
pressure we felt close to the deadline of this challenge. This pressure was partly
because we had high expectations for ourselves and partly because at times we
hit so many problems that it looked like we would not even be able to submit
any runs to the challenge.</p>
        <p>We realised that the closer the deadline came, the more mistakes we began
to make. Since the turnaround time for preprocessing a dataset was long, even
small mistakes turned out to be costly. To get back on track and get our code
quality back to a barely acceptable level, we started to either do code reviews or
make use of pair programming. Even though it sometimes seemed to take a lot
of time to do so, it was ultimately worth it, since even small mistakes did cost
more time as mentioned above.
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Time management</title>
        <p>Initially, we planned to use most of our time to experiment with di erent
approaches. In reality, an unexpectedly large part of our time went into the
debugging of numerous steps from the generation of spectrograms to the output
of training and validation values through the network.</p>
        <p>We were a bit naive and assumed that we could debug the pipeline by training
a model and analysing what did not work as expected. Unfortunately, we had
made some mistakes that led to a bad performance of the model, which we could
not debug this way.</p>
        <p>An example of such an error was what we dubbed the "reshape bug". Initially,
we used a tried and true CNN to verify the pipeline was working. Once satis ed
it worked correctly, we replaced the CNN with the LSTM mentioned before.
Compared to the CNN, the LSTM expected the input axes to be swapped.
(a) The
original input
(b) The input after the reshape
operation</p>
        <p>Instead of the axes, we mistakenly used numpy.reshape, an operation that
rearranges the input to t the desired shape. The resulting spectrograms did not
resemble the initial data at all. With the new input, the network still learned
some features but was useless overall. In 3 you can see the original image and
the output after the reshape operation.</p>
        <p>Ultimately we used a small set to debug our whole pipeline to nd this error.
In retrospect, it would have been bene cial to carry out such a test immediately
after the pipeline's completion and after introducing the input transformation for
the LSTM. Because the training a model and preprocessing the dataset takes
a very long time, we lost much time by trying to nd the error by training
the model on a complete dataset. Even worse, we conducted some experiments
while the pipeline was still broken. These experiments were rendered completely
unusable, which we only realised after they have been carried out. Repeating the
experiments cost time again. Doing a thorough test of the pipeline is particularly
important with such large data sets and turnaround times of several days for a
complete training run.
6.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Machine learning basics</title>
        <p>Validation This is a rather basic lesson that we have learnt: it is important
to have meaningful measurements for validation. Since a validation set for the
soundscape task was provided, we were a bit lazy in xing the issues outlined
below. In hindsight, we should have prioritised xing those higher.</p>
        <p>At rst, our split between validation and test samples did divide all
generated spectrograms without taking into account that spectrograms from the
same recording would end up in both splits. Needless to say that our models
performed very well on the validation samples.</p>
        <p>We have later xed that by rst splitting the recordings and then generating
the spectrograms. However, for the validation split, we did keep all spectrograms,
regardless if they contained only noise or indeed a bird. Since no model can
predict a bird only from background noise, unless there was a lot of over tting
on a very similar training sample, this resulted in a situation where we would
get quite a low score (approx. 22% top 1 accuracy) on the validation samples.
As we did not know how many of the generated spectrograms even contained a
usable signal, this was not very meaningful.</p>
        <p>The best way to avoid these problems is to use validation methods that closely
mirror the measurements used for the submissions. Unfortunately, we did not
have enough time to implement such a validation function in keras for use during
training. Instead, we have loaded the snapshots of the model and used a script
to make predictions on the validation split and subsequently calculate an MRR
score.</p>
        <p>To validate the models performance for the soundscape task, we have
heavily relied on the provided soundscape validation set and the o cial script to
calculate c-MAP scores. As it turned out, this validation set did was not
representative of the soundscape test set. Which lead to a marked di erence between
the score on this set and the scores achieved in the competition.</p>
        <p>All in all, we have learned to use measurements that are indicative of the
performance in the real task and to ensure that the validation set is representative
of the test set used in the challenge.</p>
        <p>Experiment When conducting experiments, it is important that they are
repeatable and well documented. Again this is nothing new, and we were told that
this is important early on.</p>
        <p>We have made sure that we can have a con guration or a script which sets up
the experiment for each one. That ensures repeatability up to a point. Namely,
we have realised over the course of this challenge, that we have made use of
Pythons keyword arguments with default values in various places. This, in turn,
led us to omit some "default" parameters in the scripts. Later on, we changed
some default values in the method signatures. Hence, the previous con gurations
were rendered obsolete, and it became hard to be sure we have used the correct
settings. For that reason, we had to repeat some experiments.</p>
        <p>While using con guration les or scripts to get reproducible experiments is
great when the above pitfall is avoided, we have learned that it is best also
to document the reasoning behind the experiment somewhere. Additionally, a
consistent naming scheme for artefacts such as snapshots of the trained model,
tensorboards, graphs or other analysis on the data can help to keep some order in
the experiments. E ectively all time spent on documenting ongoing experiments
will save time when writing up the results in a paper or thesis.</p>
        <p>For long-running experiments, it is also advisable to form a robust hypothesis
and do some research to validate it, before running an experiment. As such it
will ensure, that the precious time it takes to run the experiments will be well
spent.
The training set for 2018 contained 113 gigabytes worth of audio recordings.
The resulting training sets contained over 1200000 spectrograms, which were
saved in the PNG format. Additionally, some spectrograms were saved to be
used for augmentation. Converting all recordings into a ready-to-train dataset
took upwards of 12 hours.</p>
        <p>Since the turnaround time for generating a dataset is so big, it is well worth
creating a smaller dataset to test the generation process. Since it is not
necessary to do any training on that dataset, this can be done by merely choosing
recordings from a subset of the 1500 species. Ideally, the quality of the recordings
should range from noisy to very clean recordings.</p>
        <p>Such a dataset can be used to quickly see the e ect of di erent settings for
the transformations, gain a better understanding of the in uence of di erent
signal detection methods or perform sanity checks as described below. Due care
needs to be taken to ensure no hyper-parameters are tuned on such a set, as it
is not representative of the whole dataset.
7
7.1</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Future Work</title>
      <sec id="sec-7-1">
        <title>Using longer segments for training:</title>
        <p>One experiment showed that using spectrograms which represented only 0.2
seconds of audio performed signi cantly worse than using the 0.5 seconds used
for our submissions. Additionally, participants in the past seem to be using longer
segments (e.g. lengths of 3 seconds were used by [9]).</p>
        <p>Based on this information, the hypothesis that longer samples improve the
performance of the model can be formed and should be experimentally tested.
7.2</p>
      </sec>
      <sec id="sec-7-2">
        <title>Preprocessing</title>
        <p>For this thesis, the preprocessing provided by the baseline system [7] was used.
These steps were originally all used in combination with a CNN and might not
be suitable for the use with an RNN. It could be bene cial to evaluate if there are
some RNN speci c preprocessing steps to help the network better understand
the data provided.</p>
      </sec>
      <sec id="sec-7-3">
        <title>Use of variable input lengths for the BLSTM</title>
        <p>At the moment, the model is limited to a xed input length. In principle, an
LSTM supports variable input lengths. Using variable input lengths could allow
for more complex and entirely di erent preprocessing methods. On the other
hand, the whole training sample could be fed to the network, hoping that it will
learn to discern noise from signal by itself.
8</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>Due to multiple technical issues and the limited time frame, we were not able
to show that an LSTM can perform as well as a state of the art CNN in the
BirdCLEF challenge.</p>
      <p>While the results show that the network did indeed learn something usable,
there certainly is room for improvement. We assume that there is a signi cant
issue either within preprocessing or the used hyper-parameters. Unfortunately,
we were not able to determine what the issue is. This is partly due to not being
able to run extensive and systematic enough experiments due to time lost to
technical issues. For this reason, we cannot conclusively say whether an LSTM
based approach is or is not unable to perform as well as a CNN on this task.</p>
      <p>However, the application of an LSTM to the complex task available in the
form of BirdCLEF warrants further research, as LSTMs are applied highly
successful in similar domains.</p>
      <p>While we did not achieve the results we had hoped for, participating in
BirdCLEF was interesting and a tremendous opportunity to gain practical experience
with arti cial intelligence. We certainly learned a lot.
9</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgements</title>
      <p>We would like to thank all the people that supported us through this journey.
Namely, we want to thank Thilo Stadelmann, Martin Braschler and
Mohammadreza Amirian from the ZHAW Datalab for their great support and
supervision of our project. Additionally, we would like to thank Niclas Simmler and
Amin Trabi for discussions and inputs. Also, we want to thank Stefan Kahl for
answering questions regarding their 2017 submission. Finally, we want to thank
the organisers for organising such a great challenge and providing us with all the
necessary information for the task.
3. Fisher, W., Doddington, G., Goudie-Marshall, K.: The darpa speech recognition
research database: Speci cations and status. In: in Proc. DARPA Workshop on
Speech Recognition. pp. 93{99 (1986)
4. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural
Comput. 9(8), 1735{1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735,
http://dx.doi.org/10.1162/neco.1997.9.8.1735
5. Joly, A., Goeau, H., Botella, C., Glotin, H., Bonnet, P., Planque, R., Vellinga,
W.P., Muller, H.: Overview of lifeclef 2018: a large-scale evaluation of species
identi cation and recommendation algorithms in the era of ai. In: Proceedings of CLEF
2018 (2018)
6. Kahl, S., Wilhelm-Stein, T., Hussein, H., Klinck, H., Kowerko, D., Ritter, M., Eibl,
M.: [GIT] Source code of the TUCMI submission to BirdCLEF2017 (Jan 2018),
https://github.com/kahst/BirdCLEF2017
7. Kahl, S., Wilhelm-Stein, T., Klinck, H., Kowerko, D., Eibl, M.: Recognizing Birds
from Sound - The 2018 BirdCLEF Baseline System. arXiv:1804.07177 [cs] (Apr
2018), http://arxiv.org/abs/1804.07177, arXiv: 1804.07177
8. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization.</p>
      <p>arXiv:1412.6980 [cs] (Dec 2014), http://arxiv.org/abs/1412.6980, arXiv: 1412.6980
9. Sprengel, E., Jaggi, M., Kilcher, Y., Hofmann, T.: Audio Based Bird Species
Identication using Deep Learning Techniques. In: CLEF (Working Notes). pp. 547{559
(2016)
10. Stadelmann, T., Glinski-Haefeli, S., Gerber, P., Drr, O.: Capturing Suprasegmental
Features of a Voice with RNNs for Improved Speaker Clustering. In: Proceedings
of the 8th IAPR TC 3 Workshop on Arti cial Neural Networks for Pattern
Recognition (ANNPR18)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Types Of Bird Songs And Calls, https://www.britishbirdlovers.co.uk/identifyingbirds/types
          <article-title>-of-bird-songs-and-calls</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Barrowclough</surname>
            ,
            <given-names>G.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cracraft</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klicka</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zink</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          :
          <article-title>How Many Kinds of Birds Are There</article-title>
          and Why Does It Matter? PLOS ONE
          <volume>11</volume>
          (
          <issue>11</issue>
          ),
          <source>e0166307 (Nov</source>
          <year>2016</year>
          ). https://doi.org/10.1371/journal.pone.
          <volume>0166307</volume>
          , http://dx.plos.
          <source>org/10</source>
          .1371/journal.pone.0166307
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>