<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Deep Neural Network Approach to the LifeCLEF 2014 Bird task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hendrik Vincent Koops</string-name>
          <email>h.v.koops@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan van Balen</string-name>
          <email>j.m.h.vanbalen@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frans Wiering</string-name>
          <email>f.wiering@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Utrecht University Department of Information and Computing Sciences</institution>
        </aff>
      </contrib-group>
      <fpage>634</fpage>
      <lpage>642</lpage>
      <abstract>
        <p>This paper describes the methods that are used in our submission to the LifeCLEF 2014 Bird task. A segmentation algorithm is created that is capable of segmenting the audio les of the Bird task dataset. These segments are used to select relevant Mel-Frequency Cepstral Coe cients (MFCC) frames from the MFCC dataset. Three datasets are created, 48: containing only the mean MFCC per segment, 96: containing the mean and variance of the MFCCs in a segment, and 240: containing the mean, variance and the mean of three sections. These dataset are shu ed and split in a test and train set to train Deep Neural Networks with several topologies, which are capable to classify the segments of the datasets. It was found that the best network was capable of correctly classifying 73% of the segments. The results of a run from our system placed us 6th in the list of 10 participating teams. In a follow-up research it is found that shu ing the data before splitting introduces over tting, which can be reduced by not shu ing the datasets prior to splitting, and using dropout networks.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Learning</kwd>
        <kwd>Neural Networks</kwd>
        <kwd>Feature Learning</kwd>
        <kwd>Birdsong Recognition</kwd>
        <kwd>Bioacoustics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The LifeCLEF 2014 Bird task is a bird identi cation task based on di erent
types of audio records over 501 species from South America centred on Brazil.
This paper will describe the methods used to create the \Utrecht Univ." run
that was submitted to the task.</p>
      <p>Features are predominantly handcrafted in audio information retrieval
research. Handcrafting features relies on a signi cant amount of domain- and
engineering knowledge to translate insights into algorithmic methods.
Building features on heuristics makes the process of good feature extraction a di cult
problem, as it hinges on the assumption that the feature designer can know what
a good representation of a signal must be to solve a problem. Feature design is
thus constrained by what a designer can conceive and comprehend. Furthermore,
manual optimisation of handcrafted features is a slow process, costly in e ort.</p>
      <p>A research area that tries to solve some of the aforementioned problems
in feature design is called deep learning, which uses deep architectures to learn
feature hierarchies. The features that are higher up in deep hierarchies are formed
by the composition of features on lower levels. These multi-level representations
allow a deep architecture to learn the complex functions that map the input
(such as digital audio) to output (e.g. classes), without the need of dependence
on manual handcrafted features.</p>
      <p>In our submission, we used deep neural networks (neural networks with
several hidden layers) to facilitate hierarchical feature learning and classi cation of
birdsongs. Our system consists roughly out of tree composed steps:</p>
    </sec>
    <sec id="sec-2">
      <title>1. Automatic segmentation of noisy bird sounds 2. Extracting relevant Mel-Frequency Cepstral Coe cients (MFCC) features 3. Feature learning and classi cation with Deep Neural Networks</title>
    </sec>
    <sec id="sec-3">
      <title>The following sections will describe these steps.</title>
      <p>2</p>
      <sec id="sec-3-1">
        <title>Automatic Segmentation of Noisy Bird Sounds</title>
        <p>A dataset containing audio data normalised to 44.1 kHz, .wav format (16 bits) is
provided for the Bird task. In addition to audio recordings, a dataset containing
Mel-Frequency Cepstral Coe cients (MFCC) is provided. This MFCC dataset
contains comma separated value les corresponding to the audio les, containing
48 coe cients per line, computed on windows of 11.6 ms, each 3.9 ms. The 48
coe cients are the rst 16 MFCCs and their speed an acceleration.</p>
        <p>To facilitate the learning of features on only the relevant birdsong parts of the
audio les, a segmentation algorithm is created that is based on the assumption
that the loudest parts of an audio recording are the most relevant. The algorithm
consists of three major sections:</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>1. Decimating and dynamic ltering of the input signal 2. Detection of segments 3. Clustering of segments into higher temporal structures</title>
      <p>2.1</p>
      <sec id="sec-4-1">
        <title>Decimating and Dynamic Filtering</title>
        <p>
          The elements of the BirdCLEF2014 dataset are normalised to a bit rate of 16
bits per second and a sample rate of 44100 Hz. This input signal is downsampled
by a factor 4, resulting in a signal with a sample rate of 11025 Hz. Decimation
of a signal is common practice in speech recognition, as it reduces the amount of
information by removing the top part of the spectrum that we know cannot hold
the most important information. The spectrum energy in song birds is typically
concentrated on a very narrow area in the range of 1 to 6 kHz [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. By decimating
the input signal by a factor 4, we reduce the highest possible frequency to 5512.5
Hz. Although some bird song overtone information could exist beyond 5 kHz,
this is never the loudest frequency.
        </p>
        <p>After decimation, the signal is passed through a high pass lter F1 with
a passband frequency of 1kHz, to lter unwanted low frequency noise such as
continuous noise from recording equipment, low frequency animal sounds and
mechanical sounds. This lter is a 10th order high pass lter with a passband
frequency of 1kHz and a stop band attenuation of 80 dB. From this resulting
signal, the fundamental frequency is calculated by nding the maximum value
of the spectrogram. This value, f0, is then used in a adaptive lter.</p>
        <p>The signal is ltered by another high pass lter that adapts its passband
frequency to 0:6 f0. An adaptive lter such as this can account for loud sounds
that occur below the bird sound in the spectrogram. This lter is also a 10th
order high pass lter, with a passband frequency of 0:6 f0 Hz and a stop band
attenuation of 80 dB.
2.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Detection of Segments</title>
        <p>
          This part of the algorithm uses an energy-based algorithm somewhat similar
to [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to detect the loudest elements in the spectrogram of the decimated and
ltered input signal. In the spectrogram jS(fn; tn)j of a signal, the maximum
value of fn with tn is found. From this position tn, the frequency parameter
!n(0) = fn and amplitude value an(0) = 20 log10 jS(fn; tn)j [dB] is stored. From
this position jS(fn; tn)j, a left and right wise trace from the maximum peak of
S(f; t) for t &gt; t0 and for t &lt; t0 until an(t t0) &lt; an(0) T dB is performed.
From a visual inspection on a subset of the BirdCLEF2014 dataset, the stopping
criteria T of 17 dB for marking the boundary of a section was found to create the
best positions. The obtained frequency and amplitude trajectories corresponding
to the nth annotated part are stored in functions !n( ) and an( ), where = t0
ts; : : : ; t0 +te. By removing the annotated part from starting position ts until end
position te S(f; [ts; ts + 1; : : : ; te]) from the spectrogram, a reanalysis of the same
area is prevented. These steps are repeated N times until no maximum value in
the spectrogram above 17 dB is found, resulting in N annotated segments with
corresponding start and end time values.
2.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Clustering of Segments into Higher Temporal Structures</title>
        <p>A problem with the aforementioned segmentation algorithm, especially in
recordings with little background noise, is that a lot of small areas of only a few
milliseconds in length are detected. Bird songs are better described at a higher
temporal level, which is richer in information. The smaller sections are therefore
merged into larger chunks, representing larger parts of the bird song.</p>
        <p>After experimenting with several clustering techniques, it was found that
simple gap-wise merging of smaller sections was the most successful in creating
meaningful larger sections. Larger chunks are created by analysing the silences
between sections and merging subsequent sections with silences smaller than a m
milliseconds into one annotated section. Because the segmentation algorithm in
the previous step does not annotate each found segment in rank of loudness, the
segments need to be sorted rst. From this sorted list of n segments Sn, each gap
between Si and Si+1 is analysed. If the di erence between the beginning of Si+1
and the end of Si is smaller than m milliseconds, then let Si = Si +Si+1 and each
subsequent section Si+2 is demoted one position in the sorted list of segments,
i.e. Si+2 = Si+1. From an evaluation experiment in which handcrafted segments
were compared to automatically generated segments, m = 800 milliseconds was
found to create good segments.</p>
        <p>The start and end boundaries of the segments of each le are stored in comma
separated value (csv) les, with each line representing one segment. These
annotations will be used to select the part of a le from the BirdCLEF2014 data
set.
3</p>
        <sec id="sec-4-3-1">
          <title>Extracting Relevant Mel-Frequency Cepstral</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>Coe cients (MFCC) Features</title>
          <p>The start and end boundaries of the segments created by our algorithm are
used to select MFCC features from the MFCC dataset that was included in the
BirdCLEF2014 training set. Using the calculated segments, three datasets are
created:
{ 48: only the mean of the MFCCs of a segment is calculated
{ 96: the mean and variance of the MFCCs of a segment are calculated
{ 240: the mean, variance, speed, acceleration and each mean of the section
divided in three equal parts are calculated</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>Feature Learning and Classi cation With Deep Neural</title>
        </sec>
        <sec id="sec-4-3-4">
          <title>Networks</title>
          <p>
            Arti cial Neural Networks (ANNs) are a class of machine learning techniques
that take inspiration from the physical structure and workings of the animal
brain. Recent developments in neural network research unveiled ways to train
neural networks with more than one hidden layer, which developed a research
area now known as deep learning [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ].
          </p>
          <p>
            Deep neural networks with several topologies are used in feature learning and
classi cation experiments. All networks are implemented in Python, a widely
used general-purpose, high-level programming language, using the numerical
computation library Theano [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. Training and testing errors of these topologies
can be found in Table 2.
          </p>
          <p>input layer
1st hidden 2nd hidden 3rd hidden classification</p>
          <p>layer layer layer layer
T
U
P
N
I
w1
w2
w3
w4</p>
          <p>The neural networks take as input the elements from the 48, 96, and 240
datasets, and are trained to classify the segments as one of the 501 species.
Figure 1 shows an abstract version of a three-hidden layer Deep Arti cial Neural
Network used in experiments. The implementations in this experiment use
minibatch optimisation, in which the updates of the parameters of the network are
based on the summed gradients measured on a rotating subset of the complete
training set. To achieve mini-batch optimisation, a selection mechanism is
implemented that randomly selects a subset of size n from the data set, in which n
is a size that is a parameter to the main call to the network. During early testing
and implementation, it was found that a batch size of 250 examples returned
favourable results. In all of the experiments in this research, a batch size of 250
is used, which means that at every iteration of training phase, 250 randomly
selected training examples are passed though the network, from which a gradient
and updates are computed.</p>
          <p>The input layers of the network correspond with the dataset that is used:
networks with input layer size 48 are trained and tested with the 48-dataset,
the same holds for the 96 and 240 input layer sizes. For the hidden layers, two
approaches are used in experiments: one in which the hidden layer size is smaller
than the input layer, and one in which the hidden layer is larger than the input
layer. For the networks with input layer size 48, experiments are carried out
with hidden layer sizes of 48 and 40. For the networks with input layer size
96, networks with hidden layer sizes 64 and 84 are used. For networks with
input layer size 240, experiments are carried out with hidden layer sizes 128 and
350. All of these networks contain two hidden layers of equal size, except for
networks with input layer size 240, two experiments are carried out with three
hidden layers. The classi cation layer is always of size 501 in every network,
simply because all the audio les in the BirdCLEF2014 data set belong to one
of 501 species.
4.1</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Voting</title>
        <p>The input to the networks are segments, and the networks therefore perform
segment based classi cation. The goal of the LifeCLEF 2014 Bird task is to
classify les, which means a mechanism is needed to use the segment classi cation
for le classi cation. Several voting schemes are implemented to accomplish this,
of which we will discuss the one with the best performance: activation-weighted
voting.</p>
        <p>In activation-weighted voting, the activations of the classi cation layer of the
network is used to create a weighted vector of classes from which the mode is
chosen. Let F be segmented into ve segments fs1; s2; s3; s4; s5g, and assume
there are three classes C1, C2 and C3. Now let [0:1; 0:1; 0:8] be the activations
for the classes for s1. Create a new vector w1 and add C1 one time, C2 one time
and C3 eight times. Repeat for all segments in the le, and concatenate every
wn for n = 1 . . . 5 in w. This results in a weighted vector w from which the mode
can be taken, which will be the class of F . The result of this voting mechanism
on the output of several neural networks can be found in the third column of
Table 2.</p>
        <p>The advantage of this approach is the usage of the complete layer
activations as probabilities, thereby taking advantage of the network's classi cation
uncertainty. A drawback of this scheme is that it can become slow for a large
number of species and a large dataset. Fortunately, the Python package Numpy
has great optimisation strategies to deal with this kind of vector processing that
minimise computation time.
5</p>
        <sec id="sec-4-4-1">
          <title>Results and Submission</title>
          <p>with hidden layer size 84, with and without voting. For both the 48 and 96
networks it is found that increasing the hidden layer size decreases the classi cation
accuracy. This is not the case in the networks with input layer size 240, of which
four networks are created with di erent topologies.</p>
          <p>In the 240-networks with two hidden layers, the network with hidden layers of
size 350 outperforms the network with hidden layers of size 128 with a dramatic
increase of 62.5%. The bad performance of the network with two hidden layers
of size 128 could be attributed to the relative large di erence between input
layer and hidden layer size. Then again, in the network with three hidden layers
of size 128, a classi cation accuracy is obtained that is close to the accuracy of
the network with three hidden layers of size 350. The reason for this might be
that by adding more hidden layers, the network is capable of learning a higher
representation of the input data. The networks with three hidden layers again
show an increase in accuracy when the hidden layer size is increased.</p>
          <p>Overall, it shows that when using voting, the classi cation accuracy increases
on average with 0.02 percent points. If the the big drop in accuracy due to voting
in the 48-48-48-501 network is excluded, the average accuracy increases with 4.1
percent points using voting.</p>
          <p>The best performing neural network (240-350-350-501) was able to classify
the segments with 27 % error (i.e. 73 % accuracy). This was chosen to calculate
the submitted run. The network was retrained using the entire segmented
training set, and its classi cation results are transformed into the BirdCLEF comma
separated value le format. This run is submitted to the BirdCLEF submission
system, as 1400099635524 clef 240350350v under the team name \Utrecht
Univ. Run 1".</p>
          <p>Based on the Mean Average Precision (MAP), our run ranks 15th out of
29 submitted total runs. By comparing the best runs from each group, we are
reported to be 6th in the list of 10 participating teams. The MAP of our run
with background species is 0.123 and the MAP without background species is
0.14.</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>Conclusion and Discussion</title>
          <p>After receiving the results, and observing the large di erence between our
results and the results calculated by the LifeCLEF 2014 Bird task committee,
we re-evaluated our system. It is believed that over tting of the neural
network occurs when the obtained segments from the elements from the BirdCLEF
dataset are shu ed prior to splitting the dataset in a train and test set.
Shufing before dividing the dataset dramatically raises the probability of segments
of one le being present in both the test and train set, which could result in the
neural networks learning a wrong underlying relationship (such as microphone
characteristics, auditory location characteristics, recorder noise or a particular
individual bird) in the training set. This in turn results in a poorer performance
in the test set.</p>
          <p>
            Therefore, after receiving the results from all the teams, the datasets are
re-created and split without shu ing. In addition to the new datasets, new
networks are created that incorporate dropout [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ], to prevent over tting. The
obtained results so far are closer to the results obtained by other teams in the
LifeCLEF 2014 Bird task, with test set classi cation accuracies ranging between
10% and 20%.
7
          </p>
        </sec>
        <sec id="sec-4-4-3">
          <title>Future Work</title>
          <p>At the time of writing this paper, it is not clear where over tting of the neural
networks precisely has occurred. It would be interesting to compare the metadata
that was included in the BirdCLEF2014 datasets with classi cation results by
the neural networks. This might show what unwanted relationships are learned
by the neural networks, which in turn could help with preventing over tting in
the future.</p>
          <p>It would also be interesting to see how the performance of other deep
neural network architectures, such as stacked autoencoders or convolutional neural
networks compare to our implementation.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Geo</surname>
          </string-name>
          rey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <surname>Ruslan R Salakhutdinov.</surname>
          </string-name>
          <article-title>Improving neural networks by preventing co-adaptation of feature detectors</article-title>
          .
          <source>CoRR</source>
          , volume abs/1207.0580,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Aki</given-names>
            <surname>Harma</surname>
          </string-name>
          .
          <article-title>Automatic identi cation of bird species based on sinusoidal modeling of syllables</article-title>
          .
          <source>In Acoustics, Speech, and Signal Processing</source>
          ,
          <year>2003</year>
          . Proceedings.
          <source>(ICASSP'03)</source>
          . 2003 IEEE International Conference on, volume
          <volume>5</volume>
          ,
          <string-name>
            <surname>pages</surname>
            <given-names>V</given-names>
          </string-name>
          {
          <article-title>545</article-title>
          . IEEE,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Panu</given-names>
            <surname>Somervuo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Aki</given-names>
            <surname>Harma</surname>
          </string-name>
          .
          <article-title>Bird song recognition based on syllable pair histograms</article-title>
          .
          <source>In Acoustics, Speech, and Signal Processing</source>
          ,
          <year>2004</year>
          . Proceedings.
          <source>(ICASSP'04)</source>
          . IEEE International Conference on, volume
          <volume>5</volume>
          ,
          <string-name>
            <surname>pages</surname>
            <given-names>V</given-names>
          </string-name>
          {
          <article-title>825</article-title>
          . IEEE,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>James</given-names>
            <surname>Bergstra</surname>
          </string-name>
          , Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Theano: a CPU and GPU math expression compiler</article-title>
          .
          <source>In Proceedings of the Python for Scienti c Computing Conference (SciPy)</source>
          ,
          <year>June 2010</year>
          . Oral Presentation.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Geo</surname>
          </string-name>
          rey E Hinton,
          <string-name>
            <surname>Simon Osindero</surname>
          </string-name>
          , and
          <string-name>
            <surname>Yee-Whye Teh</surname>
          </string-name>
          .
          <article-title>A fast learning algorithm for deep belief nets</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>18</volume>
          (
          <issue>7</issue>
          ):
          <volume>1527</volume>
          {
          <fpage>1554</fpage>
          , MIT Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Go</surname>
          </string-name>
          <article-title>eau, Herve and Glotin, Herve and Vellinga, Willem-Pier and Rauber, Andreas</article-title>
          .
          <source>LifeCLEF Bird Identi cation Task</source>
          <year>2014</year>
          . CLEF working notes
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>