=Paper=
{{Paper
|id=Vol-1609/16090560
|storemode=property
|title=Convolutional Neural Networks for Large-Scale Bird Song Classification in Noisy Environment
|pdfUrl=https://ceur-ws.org/Vol-1609/16090560.pdf
|volume=Vol-1609
|authors=Bálint Pál Tóth,Bálint Czeba
|dblpUrl=https://dblp.org/rec/conf/clef/TothC16
}}
==Convolutional Neural Networks for Large-Scale Bird Song Classification in Noisy Environment==
<pdf width="1500px">https://ceur-ws.org/Vol-1609/16090560.pdf</pdf>
<pre>
   Convolutional Neural Networks for Large-Scale Bird
       Song Classification in Noisy Environment

                              Bálint Pál Tóth, Bálint Czeba

                 Department of Telecommunications and Media Informatics,
                    Budapest University of Technology and Economics,
                   Magyar Tudósok krt. 2., H-1117, Budapest, Hungary
             toth.b@tmit.bme.hu, czbalint14@gmail.com


       Abstract. This paper describes a convolutional neural network based deep learn-
       ing approach for bird song classification that was used in an audio record-based
       bird identification challenge, called BirdCLEF 2016. The training and test set
       contained about 24k and 8.5k recordings, belonging to 999 bird species. The rec-
       orded waveforms were very diverse in terms of length and content. We converted
       the waveforms into frequency domain and splitted into equal segments. The seg-
       ments were fed into a convolutional neural network for feature learning, which
       was followed by fully connected layers for classification. In the official scores
       our solution reached a MAP score of over 40% for main species, and MAP score
       of over 33% for main species mixed with background species.

       Keywords: Convolutional Neural Network, Deep Learning, Classification, Bird
       Song, Audio, Waveform


1 Introduction
Identification and classification of bird species can greatly help to explore biodiversity
and to monitor unique patterns in different soundscapes [1]. The LifeCLEF 2016 is a
competition hosted by CLEF Initiative (Conference and Labs of the Evaluation Forum,
formerly known as Cross-Language Evaluation Forum) [2]. BirdCLEF 2016 [3] is a
part of the LifeCLEF competition and addresses the classification of 999 different bird
species based on audio recordings of Xeno-canto collaborative database [4]. Whereas
the original Xeno-canto database includes about 275,000 audio records covering 9450
bird species from all around the world, the BirdCLEF 2016 focuses on South-America
(Brazil, Colombia, Venezuela, Guyana, Suriname and French Guiana) and contains
24607 audio recordings belonging to the 999 bird species. The test set included 8596
recordings from the BirdCLEF 2015 challenge extended by soundscape recordings. The
latter means that the recordings are not focusing on specific bird species, but contains
the environmental sounds with arbitrary number of singing birds. The length of the
samples was widely diverse, in the training set the longest recording was ~45 minutes
long, and the shortest length of recording was ~260 milliseconds. In the test set the
longest was about 2 hours and 18 minutes, while the shortest ~700 milliseconds.
         The LifeCLEF challenge allows manually aided solutions (like crowdsourc-
ing), however we have chosen state-of-the-art deep learning techniques to address the
problem. Our solution uses two dimensional convolutional neural networks, that is
trained with preprocessed bird songs transformed to the frequency domain.
         The outline of this paper is as follows. Section 2 briefly overviews the appli-
cation of convolutional neural networks in speech recognition and sound classification,
furthermore investigates some solutions for the previous BirdCLEF challenges. Section
3 describes the data preparation method we applied. Section 4 introduces the applied
deep learning technique and neural network architectures for bird song classification.
Section 5 presents our results and Section 6 draws conclusions.


2      Related Work
Besides image classification one of the main propelling force of deep learning is speech
recognition. In speech recognition different deep learning techniques, like deep belief
networks, deep neural networks and convolutional networks, are proven to surpass the
accuracy of ‘traditional’ Gaussian Mixture Models [5]. Recurrent architectures, espe-
cially Long Short-Term Memory (LSTM) networks are successfully applied to speech
recognition tasks as well [6]. Combining convolutional and LSTM-based recurrent net-
works the accuracy of speech recognition can be further improved [7].
   The task of bird song classification with neural networks has been investigated even
back in 1997 [8]. They have applied feedforward neural network with 3-8 hidden neu-
rons to classify 6 bird species from 133 recordings. They have achieved 82% accuracy
with neural nets, however Quadratic Discriminant Analysis reached significantly better
results, namely 93%. Another approach is presented in [9]. In their work after noise
reduction 13 dimensional Mel-Frequency Cepstral Coefficient (MFCC) features were
extracted and their dynamic counterpart were calculated. This 26 dimensional vector of
the current, the preceding and the following frames were fed into a feed forward neural
network with one hidden layer and 10-160 hidden neurons. They reached 98.7% and
86.8% accuracy on classifying 4 and 14 bird species, respectively. In [10] a random
forest based segmentation method is shown to select bird calls in noisy environments
with 93.6% accuracy. The work introduced in [11] uses binned frequency spectrum,
MFCC and Linear Prediction Coefficients (LPC) features, that are classified by an en-
semble of logistic regression, random forests and extremely randomized trees. They
achieved 4th place on NIPS4B bird classification challenge hosted on Kaggle.
   There have been a number of competitive approaches in the BirdCLEF challenges
of previous years, however deep learning was not applied in the BirdCLEF competition
before. The winning solution of 2014 used a robust feature extraction (including
MFCC, fundamental frequency, zero crossing rate, energy features, etc. - altogether
6669 features per recordings), feature selection (reducing the number of features from
6669 to 1277) and template matching. The last year’s challenge was won by the same
competitor. His work described in [12] downsamples the spectrograms for faster feature
extraction, applies decision trees for feature ranking and selection and bootstrap aggre-
gating for classification.
3      Data preparation
As a first step, we downsampled every audio file to 16 kHz frequency in order to reduce
the size of the training data. Following the preprocessing steps of [10], first a Hamming
window and then a short-time FFT were applied with a frame length of 512 samples
and 256 samples overlap between subsequent frames. Next we implemented and ap-
plied a filtering method to extract the essential parts of the spectrogram, that contains
bird calls. Some previous work (e.g. [11]) filters frequencies below 1 kHz, however in
the current dataset we found useful information also in this range (see Figure 1), so we
only applied low-pass filter with cutoff frequency of 6250 Hz.


            Fig. 1. Example of useful information (bird call) below 1 kHz.


          As a result, the vertical dimension (frequency) of the spectogram was 200, and
the horizontal dimension (time) depended on the length of the recording. In the time
domain (horizontal axis) we split the spectograms into 30 sample long columns (that
corresponds ~0.5 seconds) and in the frequency domain (vertical axis) we split the spec-
trograms into 10 sample high rows. As a result, every spectogram was split into 30✕10
sized cells. We used these cells to remove the irrelevant parts (that is likely not to con-
tain any bird call) of the spectrogram based on the mean and the variance. We calculated
the mean and variance of every 10 sample high row (that corresponds a frequency
range). If a cell's mean is less than 1.5 times the addition of mean plus variance of the
actual row, than we dropped the cell. In case of Run 1, 3 and 4 we also removed those
parts of the filtered spectrogram where 95% of the column vectors were zeros (see Fig-
ure 2). This step was skipped at our second submission (referred to as ‘BME TMIT Run
2’ in the official results; see Table 1). After these preprocessing steps we split the re-
maining parts of the spectrogram to five seconds long pieces. Thus the dimensions of
the resulting arrays were 200✕310 (310 samples corresponds to five seconds). We used
this as the input of the convolutional neural network.
Fig. 2. Example of a spectrogram before (above) and after (below) preprocessing, when zero
elements were kept (Run 2).


Fig. 3. Example for the same spectrogram after preprocessing, when mostly-zero cells
                            were removed (Run 1, 3, 4).
4 Deep learning based classification

   For classifying the bird songs we used convolutional neural networks. The resulting
200✕310 arrays of the spectograms after data preparation were fed into the convolu-
tional neural network and was treated like grayscale images. We used two different
CNN architectures: the first one was inspired by the winner architecture of 2012
ImageNet competition [14] (AlexNet [15]), the second convolutional neural network
was inspired by audio recognition systems.
   In the first type of neural network we modified the shape of the input and the con-
volutional layers of AlexNet. We also added batch normalization layers before the max-
pooling layers. Experiments show that with batch normalization significantly better ac-
curacy can be achieved on MNIST and ImageNet datasets with faster convergence [16].
This network is referred to as CNN-Bird-1.
   The second type of neural network used a simpler architecture, it consisted four con-
volutional layers and the fully connected layers had less neurons. We used ReLUs as
activation functions [17] and batch normalization layers were also applied. The number
of parameters of the second network was much less, thus the network was learning
faster. This network is referred to as CNN-Bird-2. The proposed networks are shown
in Figure 4.
   To train the model we used RMSProp adaptive algorithm as optimizer [18] with
mini-batch learning. Early stopping with a patience of 100 epochs was applied.
                     Batch norm.            Batch norm.                                                               Batch norm.
                     Max-pooling            Max-pooling                                                               Max-pooling
                     3x3, s:2x2             3x3, s:2x2                                                                3x3, s:2x2
                                                                                                                        ...


 16x16                  3


                                                                                                                                                           ...
                                                                        3                         3
 s:6x6                   1                       1                      x1                         1
                                                             384@8x12                  384@8x12             256@8x12
                                   256@16x25
                                                                                                                                                     999
               96@32x50
1@200x310
                                                                                                                                     4096     4096


                                   Batch norm.            Batch norm.                                  Batch norm.
                                   Max-pooling            Max-pooling                                  Max-pooling
                                   2x2                    2x2                                          2x2


             16x16                   3
                                                                                                                                             ...


                                                                                  3
             s:8x8                    1                      1                    x1
                                                                        384@6x9               384@6x9
                                                 256@12x19
                                                                                                                                       999
                             96@24x48
            1@200x310
                                                                                                                     2048     2048


Fig. 4. CNN-Bird-1 (above) and CNN-Bird-2 (below) convolutional neural networks for bird
species identification based on the spectogram of bird song recordings. (A@BxC refers to A
number of planes with size BxC. The DxD refers to the kernel size.)
Due to the fact that we split each audio file to smaller pieces (that were fed to the CNNs)
if a recording was longer than five seconds we had to combine the multiple predictions
of the neural network. In case of ‘BME TMIT Run 1, 2, 3’ we simply calculated the
mean of the classification results. In case of ‘BME TMIT Run 4’ we used a custom
calculation method for submitting the classification results: if the recording was split
into more parts than we calculated the variance of the CNN’s outputs of each predicted
class throughout the 5 seconds long split parts. Next the six predictions with the highest
variance were selected. The predicted bird species came from the mean of these predic-
tions.


5      Evaluation
   The hardware we used for training were a NVidia GTX 970 (4 GB) and a NVidia
Titan X (12 GB) GPU card hosted in two i7 servers with 32 GB RAM. Ubuntu 14.04
with Cuda 7.5 and cuDNN 4.0 was used as general software architecture. For data prep-
aration, training and evaluating deep neural networks the Keras [19] framework with
Theano [20] backend was used. For calculating area under the precision-recall curve
(AUROC) values we used the sklearn Python package. The differences in data prepa-
ration (see Section 3), in the architectures, in the combination method of the predictions
(see Section 4) and the epochs needed to reached the maximum AUROC measured on
validation set are summarized in Table 1. The AUROC values throughout the training
of Run 1, 2, 3 and 4 are shown in Figure 5. The database sizes, the data preparation and
CNN training times are shown in Table 2.

                     Table 1. The experimental setup of the submitted runs.
         Run              Data preparation         CNN           Combination of        Epochs
                                                architecture     the predictions
    BME TMIT Run 1       ‘Zero’ parts removed   CNN-Bird-1             Mean             124
    BME TMIT Run 2     ‘Zero’ parts not removed CNN-Bird-1             Mean             121
    BME TMIT Run 3       ‘Zero’ parts removed   CNN-Bird-2             Mean             104
    BME TMIT Run 4       ‘Zero’ parts removed   CNN-Bird-2     Mean of top six pre-     104
                                                               dictions with highest
                                                                     variance
     Fig. 5. The value of AUROC throughout the training for Run 1, Run 2 and Run 3&4.

    We investigated the accuracy of the model on a separated test set. The least average
precisions (AP) were achieved by Ochre-rumped Antbird (AP=0.00067), Santa Marta
Antpitta (AP=0.00136) and Rufous-breasted Leaftosser (AP=0.0015) bird calls. Yel-
low-eared Parrot (AP=0.692), Lesser Woodcreeper (AP=0.796) and Spillmann's
Tapaculo (AP=0.899) species scored the best in the test. Furthermore, a lot of bird calls
were misclassified to Orange-billed Nightingale-Thrush (AP=0.229). Analyzing the
waveforms and the spectrograms of these species we couldn’t find any particular fea-
ture. Hence we suppose the significant difference in AP and the misclassification are
generally caused by some shortcomings of the proposed CNN architectures.
    The MAP (Mean Average Precision) values of our submission in the official results
are shown in Table 3. The first MAP value corresponds to the recordings in which there
was a dominant singing bird in the foreground with some other ones in the background.
The second MAP is for recordings with only one singing bird. And the third MAP value
is for the soundscape audio, that was not targeting specific species and these recordings
might have contained an arbitrary number of singing birds. The results show that the
smaller convolutional neural network (CNN-Bird-2; Run 3 and 4), which was faster to
train performed similarly as the bigger CNN. However, the gain in AUROC on the
validation database is not reflected in the official results (MAP values) in case of Run
3 and 4. Moreover the difference in the combination methods of Run 3 and 4 could be
measured on the validation set, but in the official results Run 4 didn’t outperform our
other approaches. According to the official results we resulted the 4th place out of 6. It
should be noted that we joined the competition only on April and we had no previous
experience with bird call recognition.

       Table 2. Database sizes, data preparation (left) and CNN training (right) times.
                  Preparation                     CNN architecture     Training time [hours]
   Method                        Size [GB]
                time [minutes]                       CNN-Bird-1                 37.8
 ‘Zero’ parts                                        CNN-Bird-2                 48.8
                     103           41.8
   removed                                          CNN-Bird-3 & 4              27.6
 ‘Zero’ parts
                     123           48.76
 not removed

                  Table 3. Official results: MAP values of our submissions.
         Run                    MAP                      MAP                    MAP
                      (with background species)   (only main species) (‘soundscape’ recordings)
   BME TMIT Run 1               0.323                    0.407                  0.054
   BME TMIT Run 2               0.338                    0.426                  0.053
   BME TMIT Run 3               0.337                    0.426                  0.059
   BME TMIT Run 4               0.335                    0.424                  0.053
6      Conclusions
   In this paper a deep learning based approach was presented for large-scale bird spe-
cies identification based on their songs. In the data preparation process the spectogram
for every recording was calculated and the irrelevant parts were removed. The resulting
spectogram was sliced into five seconds long segments, these segments were used as
input of the CNN. Two different types of CNNs were used that achieved about the same
accuracy, while one of them had much less parameters. At the final step the predictions
of the slices were combined. The results show that the deep learning based approach is
well suitable for the task, however fine-tuning is necessary to reach better accuracy,
like separating time and frequency in the CNN feature learning part and applying re-
current architectures, e.g. Long Short-Term Memory (LSTM).


Acknowledgement
   Bálint Pál Tóth gratefully acknowledges the support of NVIDIA Corporation with
the donation of an NVidia Titan X GPU used for his research.


References
 1. Frommolt, K. H., Bardeli, R., and Clausen, M. Computational bioacoustics for assessing
    biodiversity. In Proceedings of the International Expert meeting on IT-based detection of
    bioacoustical patterns, BfN-Skripten, No. 234. (2008)
 2. Joly, A., Goëau, H., Glotin, H., Spampinato, C., Bonnet, P., Vellinga, W. P., Champ, J.,
    Planqué, R., Palazzo, S., Müller, H., LifeCLEF 2016: multimedia life species identification
    challenges, Proceedings of CLEF 2016 (2016)
 3. Goëau, H., Glotin, H., Planqué, R., Vellinga, W. P., Joly, A., LifeCLEF Bird Identification
    Task 2016, CLEF working notes 2016 (2016)
 4. Xeno-canto Foundation. (2012). Xeno-canto: Sharing bird sounds from around the world.
 5. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke,
    V., Nguyen, P., Sainath, T.N. and Kingsbury, B., Deep neural networks for acoustic model-
    ing in speech recognition: The shared views of four research groups. Signal Processing Mag-
    azine, IEEE, 29(6), pp.82-97. (2012)
 6. Graves, A., Mohamed, A.R., and Hinton, G. Speech recognition with deep recurrent neural
    networks. In IEEE International Conference on Acoustics, Speech and Signal Processing
    (ICASSP 2013), pp. 6645-6649. (2013)
 7. Sainath, T.N., Vinyals, O., Senior, A. and Sak, H., Convolutional, long short-term memory,
    fully connected deep neural networks. In IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP 2015), pp. 4580-4584. (2015)
 8. McIlraith, A.L. and Card, H.C., Bird song identification using artificial neural networks and
    statistical analysis. In IEEE Canadian Conference on Electrical and Computer Engineering,
    Engineering Innovation: Voyage of Discovery, Vol. 1, pp. 63-66. (1997)
 9. Cai, J., Ee, D., Pham, B., Roe, P. and Zhang, J., December. Sensor network for the monitor-
    ing of ecosystem: Bird species recognition. In IEEE 3rd International Conference on Intel-
    ligent Sensors, Sensor Networks and Information (ISSNIP 2007), pp. 293-298. (2007)
10. Leng, Y.R. and Dat, T.H., December. Multi-label bird classification using an ensemble clas-
    sifier with simple features. Asia Pacific Signal and Information Processing Association
    (APSIPA), pp. 1-5. (2014)
11. Neal, L., Briggs, F., Raich, R. and Fern, X.Z., Time-frequency segmentation of bird song in
    noisy acoustic environments. In IEEE International Conference on Acoustics, Speech and
    Signal Processing (ICASSP 2011), pp. 2012-2015. (2011)
12. Lasseck, Mario. Improved automatic bird identification through decision tree based feature
    selection and bagging. In Working notes of CLEF 2015 Conference. (2015)
13. Lasseck, Mario. Large-scale Identification of Birds in Audio Recordings. In Working Notes
    CLEF, pp. 643-653. (2014)
14. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Hunag, Z., Karpathy, A.,
    Khosla, A., Bernstein, M., Berg, A.C., & Fei-Fei, L. Imagenet large scale visual recognition
    challenge. International Journal of Computer Vision, 115(3), 211-252. (2015)
15. Krizhevsky, A., Sutskever, I., & Hinton, G. E. Imagenet classification with deep convolu-
    tional neural networks. In Advances in neural information processing systems, pp. 1097-
    1105. (2012)
16. Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network
    training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015)
17. V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In
    Proc. 27th International Conference on Machine Learning, pp. 807-814. (2010)
18. Tieleman, T., & Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average
    of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2. (2012)
19. Chollet, F. Keras: Theano-based deep learning library. Code: https://github.com/fchollet.
    Documentation: http://keras.io. (2015)
20. The Theano Development. A Python framework for fast computation of mathematical ex-
    pressions. arXiv preprint arXiv:1605.02688 (2016)
21. Hochreiter, S., & Schmidhuber, J. Long short-term memory. Neural computation, 9(8),
    1735-1780. (1997)

</pre>