<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognizing Bird Species in Audio Files Using Transfer Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreas Fritzler</string-name>
          <email>andreas.fritzler@stud.fh-dortmund.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sven Koitka</string-name>
          <email>sven.koitka@fh-dortmund.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph M. Friedrich</string-name>
          <email>christoph.friedrich@fh-dortmund.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Dortmund University Department of Computer Science Otto-Hahn-Str.</institution>
          <addr-line>14, 44227 Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Applied Sciences and Arts Dortmund (FHDO) Department of Computer Science Emil-Figge-Strasse 42</institution>
          ,
          <addr-line>44227 Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, a method to identify bird species in audio recordings is presented. For this purpose, a pre-trained Inception-v3 convolutional neural network was used. The network was ne-tuned on 36,492 audio recordings representing 1,500 bird species in the context of the BirdCLEF 2017 task. Audio records were transformed into spectrograms and further processed by applying bandpass ltering, noise ltering, and silent region removal. For data augmentation purposes, time shifting, time stretching, pitch shifting, and pitch stretching were applied. This paper shows that ne-tuning a pre-trained convolutional neural network performs better than training a neural network from scratch. Domain adaptation from image to audio domain could be successfully applied. The networks' results were evaluated in the BirdCLEF 2017 task and achieved an o cial mean average precision (MAP) score of 0.567 for traditional records and a MAP score of 0.496 for records with background species on the test dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>Bird Species Identi cation</kwd>
        <kwd>BirdCLEF</kwd>
        <kwd>Audio</kwd>
        <kwd>Short</kwd>
        <kwd>Term Fourier Transform</kwd>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>Transfer Learn- ing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Since 2014, a competition called BirdCLEF is hosted every year by the LifeCLEF
lab [5]. The LifeCLEF lab is part of the \Conference and Labs of the Evaluation
Forum" (CLEF). The goal of the competition is to identify bird species in audio
recordings. The di culty of the competition increases every year. This year, in
the BirdCLEF 2017 task [2], 1,500 bird species had to be identi ed. The training
dataset was built from the Xeno-canto collaborative database3 and consists of
36,492 audio recordings. These records are highly diverse according to sample
rate, length, and the quality of their content. The test dataset comprises 13,272
audio recordings.</p>
      <p>In 2016, a deep learning approach was applied by [17] to the bird identi cation
task and outperformed other competitors. In this research, a similar method,
inspired by the last year's winner is used with an additional extension. Transfer
learning [11] is applied by using a pre-trained Inception-v3 [19] convolutional
neural network. Related works of identifying bird species in audio recordings in
the BirdCLEF 2016 task [3] can be found in [8, 12, 14, 17, 20].
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>To solve the BirdCLEF 2017 task, a convolutional neural network on audio
spectrograms was used. The main methodology was oriented on the winner [17]
of the BirdCLEF 2016 task. The concept of their preprocessing method was
partially used. The following sections describe the work ow and parameters in
an abstract way, details on the parameters for the runs are given in Section 3.
2.1</p>
      <sec id="sec-2-1">
        <title>Overview</title>
        <p>First, the whole BirdCLEF 2017 training dataset was split into two parts. One
part consisted of 90% of the training les and was used to train a convolutional
neural network and the other part consisted of the remaining 10% and was used
to validate on an independent validation set for model selection. For the rest of
this paper, the whole BirdCLEF 2017 training dataset shall be referred to as
\full training set", the 90% subset shall be referred to as \reduced training set",
and the 10% subset shall be referred to as \validation set". The whole pipeline
that creates a model that is ready to solve the BirdCLEF 2017 task can be seen
in Figure 1.</p>
        <p>Next, the audio les were preprocessed. The preprocessing step transforms
audio les (.wav, .mp3) to picture les (.png). One audio le typically produces
several picture les depending on the length of the audio le and its content.</p>
        <p>
          Then, the generated picture les that were transformed from the reduced
training set were used to ne-tune a pre-trained Inception-v3 convolutional
neural network. Pre-training was done on the ILSVRC-2012-CLS [15] image
classication dataset by the contributors of Tensor ow Slim model repository, and a
checkpoint le of the model was provided4. By using the provided checkpoint,
the models' knowledge was transferred to the BirdCLEF 2017 task. For
netuning, Tensor ow Slim5 version 1.0.1 was used. For each picture, an adapted
3 http://www.xeno-canto.org/ (last access: 31.0
          <xref ref-type="bibr" rid="ref5">5.2017</xref>
          )
4 http://download.tensor ow.org/models/inception v
          <xref ref-type="bibr" rid="ref3">3 2016</xref>
          08 28.tar.gz (last access:
27.03.2017)
5 https://github.com/tensor ow/models/tree/master/slim (last access: 23.0
          <xref ref-type="bibr" rid="ref5">5.2017</xref>
          )
reduced
training
set
preprocessing
picture
files
full
training
        </p>
        <p>set
(audio files)</p>
        <p>Tensorflow Slim
data augmentation</p>
        <p>training
Inception-v3
selecting best model
according to MAP score
on validation set</p>
        <p>full training
Tensorflow Slim
data augmentation</p>
        <p>training
Inception-v3
validation
set
preprocessing
picture
files
continuous validation
every few epochs
using MAP score
data augmentation was applied that includes time shifting, time stretching
using factors in the range [0:85; 1:15), pitch shifting, and pitch stretching using
percentages in the set f0; : : : ; 8g.</p>
        <p>The whole training was done in three phases. In the rst phase, the top layers
of the pre-trained model were deleted6 and trained from scratch leaving the rest
of the model xed. The reason for this is to adjust the number of output classes
from the pre-trained network with 1,000 classes to 1,500 species. Afterward,
the second phase was started, and the whole model was ne-tuned including
all trainable weights. Throughout the whole training during the second phase
snapshots of the model were validated every few epochs with pictures that were
transformed from the validation set. This way the models' progress according to
the MAP score was monitored. It was done to recognize over tting. After the
second phase, a snapshot with the best-monitored MAP score was selected for
a third training phase. In this phase, image les from the full training set were
used to ne-tune the model further. When the third step was nished, the model
was ready to classify test les.</p>
        <p>Finally, the BirdCLEF 2017 test dataset was preprocessed in a similar but
not an identical manner as the full training dataset. Details are described later in
this Section. During preprocessing, every audio le was transformed into many
picture les. In the prediction phase, a xed region was cropped from the center
of every picture le and was predicted by the fully trained model. The predictions
were combined by averaging all image segments per audio le for nal results.
In addition, time-coded soundscapes were grouped in ranges of 5 seconds. The
predictions were ordered in descending order per audio le. Furthermore,
predictions in time-coded soundscapes were ordered per 5-second regions. In the end,
a result le was generated.</p>
        <sec id="sec-2-1-1">
          <title>6 scopes InceptionV3/Logits and InceptionV3/AuxLogits</title>
          <p>2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Preprocessing for Training</title>
        <p>The progress of the following described preprocessing steps can be seen in
Figure 2.</p>
        <p>spectrogram after bandpass filtering (900Hz - 15100 Hz), length 9s</p>
        <p>noise filtering
silent region removal</p>
        <p>segmentation</p>
      </sec>
      <sec id="sec-2-3">
        <title>Extracting Frequency Domain Representation A frequency domain rep</title>
        <p>resentation was generated for all of the audio les using Short-Term Fourier
Transform (STFT) [1]. For this purpose, a Java library \Open Intelligent
Multimedia Analysis for Java" (OpenIMAJ)7 [4] version 1.3.5 was used. It is available
under the New BSD License, and it is able to process .wav and also .mp3 audio
les. Unfortunately, OpenIMAJ does not support sample overlapping in an easy
way by itself, so it had to be implemented. Furthermore, it seems OpenIMAJ is
not capable of processing audio les with a bit depth of 24 bits. Two time-coded</p>
        <sec id="sec-2-3-1">
          <title>7 http://openimaj.org/ (last access: 20.05.2017)</title>
          <p>soundscape audio les8 in the test dataset were converted from a bit depth of
24 bits to 16 bits with the python library \librosa" version 0.5.0 [9], that is
available9 under the ISC License.</p>
          <p>Audio les in BirdCLEF 2017 datasets have di erent sample rates thus the
window size (amount of samples) that was used for the STFT depended on the
le's sample rate. For a sample rate of 44.1 kHz, a length of 512 samples was
used to create a slice of 256 frequency bands (later on the vertical axis of an
image). One slice represents a time interval of approximately 11.6 ms. For a le
with a di erent sample rate, the size of the window was adjusted to match the
time interval of 11.6 ms. Audio les were padded with zeros if their last window
had fewer samples than were needed for the transform.</p>
          <p>The extracted frequency domain representation is a matrix. Its elements were
normalized to the range [0; 1]. Every element of this matrix represents a pixel in
the exported image. The logarithm of the elements was not taken, but instead,
the values were processed in a linear manner. The matrix was further processed
using di erent methods to remove unnecessary information to reduce its size.
Bandpass ltering A frequency histogram of the full training set is shown
in Figure 3. Most of the frequencies below 500 Hz are dominated by noises, for
example, wind or mechanical vibration. This circumstance explains the peak in
the lower frequency range. It was determined by manually examining 20 les
that were randomly selected from the full training set.</p>
          <p>One previous work [10] removed frequencies under 1 kHz. Audio
recordings were in 16 kHz PCM format. The authors in [20] participated in the
BirdCLEF 2016 task and used a low-pass lter with a cuto frequency of 6,250 Hz.</p>
          <p>In this research, a lower frequency limit of 1,000 Hz and an upper frequency
limit of 12,025 Hz was used for bandpass ltering. This reduced the 256 frequency
bands by half to 128 bands.</p>
          <p>0,035
0,030
cy 0,025
n
e
uq 0,020
e
rF 0,015
e
v
ita 0,010
leR0,005
0,000</p>
          <p>Fig. 3: Frequency histogram of the full BirdCLEF 2017 training dataset.
and
Noise Filtering Median Clipping was applied to reduce noise like wind blowing.
This method was also used by the winner [17] of BirdCLEF 2016 task and
formerly by [7]. It selects all of the elements in the matrix whose values are
three times bigger than their corresponding row (frequency band) median and
three times larger than their corresponding column (time frame) median. The
other elements are set to zero. Afterward, tiny objects were removed. If all of
the 8 neighbor elements of an element were zeros, then the element itself was
also set to zero.</p>
          <p>Silent Region Removal The authors in [17] used signal to noise separation to
extract bird calls from audio les. In this research, regions with less information
were deleted to retain bird calls in the following way. If the average of a xed
region did not reach a threshold, then the region was removed. Every column
was examined on its own. In every column, the number of non-zero elements
was counted and normalized by the total number of elements in each column.
For this procedure, a threshold of 0.01 was used. After this step, the resulting
matrix could have just a few or even zero columns.</p>
          <p>In the end, if the resulting matrix had less than 32 columns, the audio le
was completely discarded from training.</p>
          <p>Exporting Image Files Images were exported using a xed resolution. If after
the previous processing steps a matrix had fewer columns than the de ned target
width of a picture then the matrix was padded to the desired amount of columns
and its available content was looped into the padded area.</p>
          <p>The completely processed frequency representation was segmented into
equalsized pieces of a xed length and a prede ned overlapping factor. The matrices'
elements were in the range [0; 1] and were scaled by a constant factor as well as
clamped to the maximum value of 255. The elements were used for all of the
three channels in the nal picture. As a result, the three channels contained the
same information.
2.3</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Preprocessing for Prediction</title>
        <p>During the preprocessing of the BirdCLEF 2017 test dataset, one exception was
made to time-coded soundscapes. On these les, silent region removal was not
applied to preserve their full length. Furthermore, no audio les were discarded
if they had less than 32 columns in their matrix.
2.4</p>
      </sec>
      <sec id="sec-2-5">
        <title>Data Augmentation</title>
        <p>Due to the input dimension of Inception-v3 (299x299x3) the generated picture
les were processed at this stage before they were forwarded to train the model.
This was done by cropping a region from the original image. First, a target
cropping location was computed with a jitter for the vertical axis (random y
o set). Next, time shifting was applied by moving the starting x position
randomly along the x-axis. Then, time stretching was used by moving the target
width by a random factor in the range [0:85; 1:15). After that, pitch shifting
was combined with pitch stretching and was calculated by moving the starting y
position randomly. The target height was reduced randomly the same way. The
maximum amount of pitch stretch was 8% in total. The calculated region was
cropped from the original picture and was scaled with bilinear interpolation to
a size of 299x299 pixels on all of the 3 channels (red, green, blue) to match the
input dimension of Inception-v3. Figure 4 shows this procedure visually.
original
random vertical jitter</p>
        <p>random time shifting
1
4
2
5
3
6
random time stretching
random pitch shifting/stretching
cropping and bilinear scaling
Although more recent network architectures exist like Inception-v4 [18] and
Inception-ResNet-v2 [18] which might improve the results in comparison to
Inception-v3, the former ones were not used for this research because they are
slower than the Inception-v3. The former ones are also available as pre-trained
models10 and are potential candidates for future work.</p>
        <p>
          Four runs were submitted in total. Three runs used slightly di erent methods
of preprocessing, and the fourth run combined the results of the former three
runs by averaging them.
10 http://download.tensor ow.org/models/inception v4 2016 09 09.tar.gz (last access:
28.0
          <xref ref-type="bibr" rid="ref5">5.2017</xref>
          ) and
http://download.tensor ow.org/models/inception resnet v2 2016 08 30.tar.gz (last
access: 28.0
          <xref ref-type="bibr" rid="ref5">5.2017</xref>
          )
        </p>
        <p>First, binary run (Run 2) was created with the preprocessing pipeline
(compare Section 2.2) and binary images. Next, grayscale run (Run 4) was created
with a few changes to binary run (Run 2) to examine the di erences in MAP
scores in comparison to binary run. Lastly, big run (Run 1) was designed by
improving some parts of the previous runs and correcting some mistakes. The
runs were submitted in alphabetical order according to their description names
thus the run's details in this Section does not follow the run's number but rather
their temporal creation time.</p>
        <p>Training was done on one NVIDIA Tesla K80 graphics card that contains
2 GPUs with 12 GB of RAM each. A mini-batch size of 32 was used per GPU,
which results in an e ective batch size of 64. Fine-tuning of a single model until
the stage of prediction took several days. The machine was used non-exclusively.
Predicting was done on one NVIDIA Titan X Pascal GPU.</p>
        <p>Table 1 shows the runs' achieved results measured in MAP score on the
reduced training set and the validation set using all predictions. To show the
advantages of transfer learning, all of the runs were executed twice with identical
parameters. On the one hand a pre-trained Inception-v3 was used, and on the
other hand, the Inception-v3 was trained from scratch. Results in Table 1 show
that ne-tuning a pre-trained convolutional neural network performs better than
training a neural network from scratch, although pre-training was done on
another domain. In addition, o cial results on the BirdCLEF 2017 test dataset of
the submitted runs are stated as well.
The following Section describes only additions and di erences compared to the
description in Section 2.</p>
        <p>Preprocessing STFT used 512 samples without sample overlapping. After the
step noise ltering, all of the elements in the matrix greater than 0 were set to 1
to create a monochrome picture le. After silent region removal, 45 audio les
were discarded from training.</p>
        <p>Images were exported using a resolution of 256 pixels in width and 128 pixels
in height. One image le represents a length of 2.97 s. For this purpose, the
previously generated matrices were segmented into equal-sized fragments of 256
pixels in width with an overlapping factor of 78 . Before matrices were exported
to pictures, their elements were multiplied by 255. The resulting values were
used for all of the three channels in a picture. The reduced training set led to
1,365,849 picture les (2.5 GiB). From the validation set, 145,724 image les
were generated (282.6 MiB). The test dataset produced 1,583,771 picture les
(2.66 GiB).</p>
        <p>Training and Data Augmentation Learning rates were xed in this run.
The top layers of Inception-v3 were trained for 1.48 epochs with a learning rate
of 0.01. Training on the reduced training set was done for 15.8 epochs with a
learning rate of 0.0002. A MAP score of 0.487 was achieved on the validation
set. After that, the full training set was used for training for another 4.28 epochs
with a learning rate of 0.0002.</p>
        <p>During data augmentation, a region of 128 pixels in width ( 15%) and 128
pixels in height ( 8%) should have been randomly cropped.</p>
        <p>Predicting In the predicting phase, a region of 128x128 pixels was cropped from
the center of every picture le. The cropped length of 128 pixels corresponds to
a time interval of 1.49 s.</p>
        <p>Mistakes In this run, data augmentation was implemented incorrectly. No
randomness was used. When training was started then the parameters for time
shifting, time stretching, and pitch shifting were generated in a random manner,
but these values were always the same as long as training was not restarted.</p>
        <p>The model reached a phase of over tting. Because the best checkpoint
according to MAP score was not saved, an over tted version of the model was
used to complete the BirdCLEF task. The best-monitored MAP score of the
lost checkpoint was 0.511 after 8 epochs of training.
3.2</p>
      </sec>
      <sec id="sec-2-6">
        <title>Grayscale Run: Run 4</title>
        <p>This run was almost the same as binary run (Run 2). Here only di erences to
binary run (Run 2) are described.</p>
        <p>Preprocessing In the preprocessing step, there were only two di erences
compared to binary run (Run 2). First, the frequency domain representation in the
range [0; 1] was used without being transformed into zeros and ones. Second,
before image les were exported, the elements of the matrices were multiplied by
2,000 and cut o at value 255. This led to picture les that contained grayscale
information. Everything else in the preprocessing pipeline was left unchanged.
The number of les compared to binary run (Run 2) had not changed, but
the le size had increased. The reduced training set had a size of 7.4 GiB, the
validation set consisted of 812 MiB, and the test set counted 7.25 GiB.
Training and Data Augmentation The top layers of Inception-v3 were
trained for 1.74 epochs with a xed learning rate of 0.02. Afterward, all
layers were trained using an exponential learning rate. The learning rate descended
smoothly. A staircase function was not used. As training had started, the
learning rate had a value of 0.005. After 5.4 epochs, the learning rate reached a
value of 0.0003, and a MAP score of 0.541 was achieved on the validation set.
Unfortunately, training was restarted every few epochs to slightly adjust the
learning rate. Afterward, training was started on the full training set for another
2.6 epochs with an exponential learning rate, starting at 0.0002 and ending at
0.0001.</p>
        <p>Mistakes The same mistakes as they were made in the binary run (Run 2) were
also made in this run. Data augmentation was not working properly. This led to
an over tted model after 6 epochs of training. Training was restarted every few
epochs to correct the learning rate. As a side e ect, the model was trained on
more di erent pictures than the model in the binary run (Run 2).
3.3</p>
        <p>Big Run: Run 1
The name big run is derived from the size of pictures that were generated in the
preprocessing step. Pictures were created by processing each channel (red, green,
blue) di erently. After 7 epochs of ne-tuning, this model had a MAP score of
0.531. Due to the deadline of the BirdCLEF 2017 task, this model could not be
trained completely as planned. One can assume that if this model was trained
for more epochs, the MAP score should become a little bit better because data
augmentation mistakes from the previously made models were corrected.
Preprocessing STFT used a window size of 942 samples. A slice of 471
frequency bands was generated this way. This slice represents a time interval of
approximately 21.4 ms. Furthermore, sample overlapping of 75% was used.</p>
        <p>Bandpass ltering used a lower frequency limit of 900 Hz and an upper
frequency limit of 15,100 Hz. This reduced the 471 frequency bands to 303 bands.</p>
        <p>Before the method described in silent region removal was applied, two other
processing steps were executed. First, all of the elements in the rst 50 columns
(approximately 0.27 s) were examined. That means the arithmetic mean of that
region was calculated. If the calculated value did not reach a threshold of 0.0001,
then the whole region was discarded. Otherwise, the region to be examined was
shifted with 75% overlapping. This was repeated throughout the whole matrix.
Very silent regions of an audio signal were deleted this way. Second, every column
was examined on its own. If the arithmetic mean of a column did not reach a
threshold of 0.0001, then the column was removed using a special treatment. Up
to three sequenced columns may have each an average value below the threshold.
These columns were not deleted. Up to three following columns were set to zero if
each of their averages was also below the threshold. All subsequent columns each
with an average below the threshold were removed. This procedure separated
parts with much audio information visually even more from each other while
quiet frames were deleted. After these two steps, the process described in silent
region removal was applied. In the end, 7 audio les were discarded from training.</p>
        <p>Images were exported using a resolution of 450 pixels in width and 303 pixels
in height. The width of 450 pixels represents a length of approximately 2.4 s.</p>
        <p>The completely processed frequency representation was segmented into
equalsized pieces with a length of 450 columns and an overlapping factor of 23 . The
matrices' were multiplied by 1,000 and then cut o at 255. The result was copied
to three matrices. Each matrix represents a color channel of the nal picture.
One matrix (red channel) was blurred using Gaussian blur [16] with a radius of 4.
Another matrix (blue channel) was sharpened using CLAHE algorithm [13]. A
block radius of 10 and 32 bins were used. The third matrix (green channel) was
left untouched. An example of the three di erently processed channels is shown
in Figure 5.</p>
        <p>The reduced training set was transformed into 816,421 image les (23.3 GiB),
the validation set has produced 87,448 image les (2.5 GiB), and the test set
was converted to 932,573 images (24.4 GiB).</p>
        <p>original (green channel)
blurred (red channel)
sharpened (blue channel) combined (red, green, blue)
Fig. 5: Visualization of the generated channels as well as the nal composed
image. For better visualization the spectrogram was not preprocessed.
Data Augmentation A target cropping location was computed with a jitter
of 4 pixels ( y 2 f0; : : : ; 4g). At this point, the target region had a shape of
299x299 pixels. Time stretching manipulated the target width. Pitch shifting
and pitch stretching were applied by moving the starting y position randomly
by 0, 3, 6, 9, or 12 pixels (that corresponds to percentages in the set f0; : : : ; 4g).
Target height was manipulated the same way.</p>
        <p>Training During the rst phase of training, a learning rate of 0.02 was used for
1 epoch, and a rate of 0.01 was used for a second epoch. After that, the second
phase was started with a learning rate of 0.0008. In the second phase, the learning
rate was exponentially decreased by a staircase function. That means the rate
was adjusted after every epoch was fully completed. A learning rate decay value
of 0.7 for every completed epoch was used. After 7 epochs, the model reached a
learning rate of 0.000066. A MAP score of 0.531 was achieved on the validation
set. The third phase was started using a xed learning rate of 0.0002 for another
1.98 epochs.</p>
        <p>Predicting In the prediction phase, a region of 299x299 pixels was cropped from
the center of every picture le and was predicted by the fully trained model. 299
pixels represent a length of 1.6 s.
3.4</p>
      </sec>
      <sec id="sec-2-7">
        <title>Combined Run: Run 3</title>
        <p>Two di erent methods of combining predictions [6] were tried in every run when
predictions of picture les were combined to create a prediction of an audio
le. Calculating the arithmetic mean was one method. The other method was
majority voting. This can be explained in the following way: a prediction of a
picture is an expert. One asks all of the experts of an audio le to vote for a
single target class. The class with the maximum number of votes is the predicted
class. Calculating the arithmetic mean always performed better. Its MAP score
had a relative di erence of 1%{10% compared to the MAP score of majority
voting.</p>
        <p>Run 3 had not a separate model that was used to predict test audio les
but rather the predictions of the test dataset of the other three runs were
combined. This was done by averaging the predictions of every single picture le
that belongs to one audio le. The combination of results of every model after
the second training phase led to a MAP score of 0.598.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>An approach to identify bird species in audio recordings was shown. For this
purpose, a preprocessing pipeline was created and a pre-trained Inception-v3
convolutional neural network was ne-tuned. It could be shown that ne-tuning
a pre-trained convolutional neural network leads to better results than training a
neural network from scratch. It is remarkable, that this type of transfer learning
is even working from the image to the audio domain.</p>
      <p>Unfortunately, the error-free model was not trained long enough to show
its full potential. The models presented in this paper reached fair results in the
context of the competition and leave room for improvement. A possible
enhancement concerns the preprocessing pipeline and data augmentation. Future works
should consider transferring the preprocessed frequency domain representation
to a convolutional neural network avoiding the use of picture les.</p>
      <p>Furthermore, this research has not focused on identifying bird species in
soundscapes. The winner team of the BirdCLEF 2016 task has extracted noisy
parts from audio les and mixed them into other audio les. Additionally, a sound
e ects library with many di erent ambient noises recorded in nature could be
used. This could increase the diversity of the training les during the phase of
data augmentation further. This approach was not implemented in this research
due to time limitations.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>
        The authors gratefully acknowledge the support of NVIDIA Corporation with
the donation of the Titan X Pascal GPU which supported this research.
9. McFee, B., McVicar, M., Nieto, O., Balke, S., Thome, C., Liang, D.,
Battenberg, E., Moore, J., Bittner, R., Yamamoto, R., Ellis, D., Stoter, F.R.,
Repetto, D., Waloschek, S., Carr, C., Kranzler, S., Choi, K., Viktorin, P.,
Santos, J.F., Holovaty, A., Pimenta, W., Lee, H.: librosa 0.5.0 (feb 2017),
https://doi.org/10.5281/zenodo.293021
10. Neal, L., Briggs, F., Raich, R., Fern, X.Z.: Time-frequency segmentation of bird
song in noisy acoustic environments. In: Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP 2011). pp. 2012{
2015 (2011)
11. Oquab, M., Bottou, L., Laptev, Ivan, S., Josef: Learning and transferring mid-level
image representations using convolutional neural networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014). pp.
1717{1724 (2014)
12. Piczak, K.J.: Recognizing bird species in audio recordings using deep convolutional
neural networks. In: Working Notes of CLEF 2016 - Conference and Labs of the
Evaluation forum, Evora, Portugal, 5-
        <xref ref-type="bibr" rid="ref8">8 September, 2016</xref>
        . CEUR-WS Proceedings
Notes, vol. 1609, pp. 534{54
        <xref ref-type="bibr" rid="ref3">3 (2016</xref>
        )
13. Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer,
T., Haar Romeny, B.t., Zimmerman, J.B., Zuiderveld, K.: Adaptive histogram
equalization and its variations. Computer Vision, Graphics and Image Processing,
vol. 39 pp. 355{368 (1987)
14. Ricard, J., Glotin, H.: Bag of MFCC-based words for bird identi cation. In:
Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Evora,
Portugal, 5-
        <xref ref-type="bibr" rid="ref8">8 September, 2016</xref>
        . CEUR-WS Proceedings Notes, vol. 1609, pp. 544{
546 (2016)
15. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large
scale visual recognition challenge. International Journal of Computer Vision 115(3),
211{252 (2015)
16. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice Hall (2001)
17. Sprengel, E., Jaggi, M., Kilcher, Y., Hofmann, T.: Audio based bird species
identi cation using deep learning techniques. In: Working Notes of CLEF 2016 -
Conference and Labs of the Evaluation forum, Evora, Portugal, 5-
        <xref ref-type="bibr" rid="ref8">8 September, 2016</xref>
        .
      </p>
      <p>
        CEUR-WS Proceedings Notes, vol. 1609, pp. 547{559 (2016)
18. Szegedy, C., Io e, S., Vanhoucke, V., Alemi, A.: Inception-v4, Inception-ResNet
and the impact of residual connections on learning. In: Proceedings of the
International Conference on Learning Representations Workshop (ICLR 2016) (2016)
19. Szegedy, C., Vanhoucke, V., Io e, S., Shlens, J., Wojna, Z.: Rethinking the
inception architecture for computer vision. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR 2016). pp. 2818{2826 (2016),
https://arxiv.org/abs/1512.00567v3
20. Toth, B.P., Czeba, B.: Convolutional neural networks for large-scale bird song
classi cation in noisy environment. In: Working Notes of CLEF 2016 - Conference
and Labs of the Evaluation forum, Evora, Portugal, 5-
        <xref ref-type="bibr" rid="ref8">8 September, 2016</xref>
        .
CEURWS Proceedings Notes, vol. 1609, pp. 560{56
        <xref ref-type="bibr" rid="ref8">8 (2016</xref>
        )
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Allen</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          :
          <article-title>Short term spectral analysis, synthesis, and modi cation by discrete fourier transform</article-title>
          .
          <source>IEEE Transactions on Acoustics, Speech, Signal Processing</source>
          , vol. ASSP-25 pp.
          <volume>235</volume>
          {
          <issue>238</issue>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>LifeCLEF bird identi cation task 2017</article-title>
          . In: Working Notes of CLEF 2017 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Dublin, Ireland,
          <fpage>11</fpage>
          -
          <lpage>14</lpage>
          September,
          <year>2017</year>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>LifeCLEF bird identi cation task 2016: The arrival of deep learning</article-title>
          .
          <source>In: Working Notes of CLEF</source>
          <year>2016</year>
          <article-title>- Conference and Labs of the Evaluation forum</article-title>
          , Evora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          .
          <source>CEUR-WS Proceedings Notes</source>
          , vol.
          <volume>1609</volume>
          , pp.
          <volume>440</volume>
          {
          <issue>449</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hare</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samangooei</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dupplaw</surname>
            ,
            <given-names>D.P.:</given-names>
          </string-name>
          <article-title>OpenIMAJ and ImageTerrier: Java libraries and tools for scalable multimedia analysis and indexing of images</article-title>
          .
          <source>In: Proceedings of the 19th ACM international conference on Multimedia (MM</source>
          <year>2011</year>
          ). pp.
          <volume>691</volume>
          {
          <issue>694</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Joly</surname>
          </string-name>
          ,
          <article-title>Alexis and Goeau, Herve and Glotin, Herve and Spampinato, Concetto and Bonnet, Pierre and Vellinga, Willem-Pier and Lombardo, Jean-Christophe and Planque, Robert and Palazzo, Simone and Muller, Henning: LifeCLEF 2017 lab overview: multimedia species identi cation challenges</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2017</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kuncheva</surname>
            ,
            <given-names>L.I.</given-names>
          </string-name>
          :
          <article-title>Combining Pattern Classi ers: Methods and Algorithms, 2nd Edition</article-title>
          . Wiley (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lasseck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Bird song classi cation in eld recordings: Winning solution for NIPS4B 2013 competition</article-title>
          .
          <source>Proc. of int. symp. Neural Information Scaled for Bioacoustics</source>
          , sabiod.org/nips4b, joint to NIPS pp.
          <volume>176</volume>
          {
          <issue>181</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lasseck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Improving bird identi cation using multiresolution template matching and feature selection during training</article-title>
          . In: Working Notes of CLEF 2016 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Evora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          .
          <source>CEURWS Proceedings Notes</source>
          , vol.
          <volume>1609</volume>
          , pp.
          <volume>490</volume>
          {
          <issue>501</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>