Improved Automatic Bird Identification through
     Decision Tree based Feature Selection and Bagging


                                       Mario Lasseck

                               Animal Sound Archive
                            Museum für Naturkunde Berlin
                          Mario.Lasseck@mfn-berlin.de


       Abstract. This paper presents a machine learning technique for bird species
       identification at large scale. It automatically identifies about a thousand differ-
       ent species in a large number of audio recordings and provides the basis for the
       winning solution to the LifeCLEF 2015 Bird Identification Task. To process the
       very large amounts of audio data and to achieve similar good results compared
       to previous identification challenges new methods e.g. downsampling of spec-
       trogram images for faster feature extraction, advanced feature selection via de-
       cision tree based feature ranking and bootstrap aggregating using averaging and
       blending were tested and evaluated.


       Keywords: Bird Identification · Content Based Information Retrieval · Tem-
       plate Matching · Decision Tree based Feature Selection · Bagging · Blending


1      Introduction

Automatic identification of species from their sound can be a very useful computa-
tional tool for assessing biodiversity with many potential applications in ecology,
bioacoustic monitoring and behavioral science [1]. Some examples of previous stud-
ies on species identification, especially of birds, are given in [2,3,4,5]. The approach
towards automatic species identification presented here is a further development of
ideas and methods already successfully applied in previous challenges. The NIPS4B
2013 Multi-label Bird Species Classification Challenge [6] hosted by Kaggle for ex-
ample asked participants to identify 87 sound classes (songs, calls and instrumental
sounds) of more than 60 different species in a large number of wildlife recordings.
Last year the LifeClef2014 Bird Identification Task [7] challenged participants to
identify 501 different species in almost 5000 audio recordings. This year the number
of training and test files as well as the number of species to identify was increased
once again. With almost a thousand species and over 33,000 audio files it is the larg-
est computer-based bird identification challenge so far. A detailed description of the
task, dataset and experimentation protocol can be found in [8], [20]. The task is
among others part of the LifeCLEF 2015 evaluation campaign [9,10,11].
The features used for classification and methods to speed up feature extraction are
introduced in section 2. In section 3 a two-pass training approach is proposed includ-
ing feature ranking and selection. Bagging is used to improve classification results
and different methods for aggregating model predictions are compared. Finally, sub-
mission results are evaluated in section 4 and briefly discussed in section 5.


2      Feature Engineering

There are two main categories of features used for classification: parametric acoustic
features (see openSMILE Audio Statistics) and probabilities of species-specific spec-
trogram segments (see Segment-Probabilities). The feature sets are briefly described
in the following sections. Similar features have been already successfully used in
previous identification challenges and additional details can be found in [12,13,14].


2.1    openSMILE Audio Statistics
For each audio file a large number of acoustic features were extracted using the
openSMILE Feature Extractor Tool [15]. The configuration file emo_large.conf, orig-
inally designed by Florian Eyben for emotion detection in human speech, was modi-
fied in several ways to better capture the characteristics of bird sounds. The changes
relate primarily to the frame-wise calculated low-level descriptors (LLDs). For exam-
ple the maximum frequency for pitch and Mel-spectrum was set to 11 kHz (instead of
500 Hz and 8 kHz). Also, the number of Mel Frequency Cepstral Coefficients
(MFCC) was increased as well as the number of frequency bands for energy calcula-
tions. Furthermore, pitch- and spectral-related LLDs were added e.g. harmonics-to-
noise ratio, raw F0, spectral skewness, kurtosis, entropy, variance and slope.

The all in all 73 LLDs consist of:

 1 time domain signal feature
     o zero crossing rate (ZCR)
 39 spectral features
    o Mel-spectrum bins 0-25
    o 25%, 50%, 75% and 90% spectral roll-off points
    o centroid, flux, entropy, variance, skewness, kurtosis and slope
    o relative position of spectral minimum and maximum
 17 cepstral features
    o MFCC 0-16
 6 pitch-related features
     o F0 (fundamental frequency, pitch)
     o voicing probability (degree of harmonicity)
     o F0raw (raw F0 candidate without thresholding in unvoiced/noisy segments)
      o HNR (log harmonics-to-noise ratio computed from the ACF)
      o F0env (F0 envelope with exponential decay smoothing)
      o voice quality (fundamental frequency ‘quality’ (= ZCR of ACF))
 10 energy features
    o log energy
    o energy in frequency bands: 150-500 Hz, 400-1000 Hz, 800-1500 Hz, 1000-
       2000 Hz, 1500-4000 Hz, 3000-6000 Hz, 5000-8000 Hz, 7000-10000 Hz and
       9000-11000 Hz

LLDs are calculated per audio frame (FrameSize 25 ms, StepSize 10 ms). To describe
an entire audio recording, delta (velocity) and delta-delta (acceleration) coefficients
were added to each LLD and finally 39 statistical functionals e.g. means, extremes,
moments, percentiles and linear as well as quadratic regression were applied after
smoothing all 219 feature trajectories via moving average (window length = 3). All in
all, this sums up to 8541 (73×3×39) features per audio file. Further details regarding
openSMILE and the acoustic features extracted can be found in the openSMILE book
(http://www.audeering.com/research/opensmile) and the OpenSmileForBirds_v2.conf
configuration file (http://www.animalsoundarchive.org/RefSys/LifeCLEF2015).


2.2    Segment-Probabilities

The second source used for features are Segment-Probabilities (SegProbs). For each
species a set of representative segments also referred to as region of interests (ROIs)
or templates was extracted from spectrogram images representing the acoustic content
of audio files. These segments were then used to calculate Segment-Probabilities for
each target file by finding the maxima of the normalized cross-correlation [16] be-
tween all segments and the target spectrogram image via template matching. A more
detailed description regarding preprocessing, segmentation of spectrogram images
and extraction of Segment-Probabilities can be found in [12] and [14].

Due to the very large amount of audio data, not all files belonging to a certain species
were used as a source for segmentation. In a first session only short, good quality files
(metadata: Quality = 1) without background species were selected for segment extrac-
tion. If the number of segments was smaller than a given threshold another file be-
longing to the same species was selected and so on. To ensure diversity and to capture
the entire sound repertoire of a given species each file was chosen to belong to a dif-
ferent bird individual. To keep track of individuals an Individual-ID was assigned to
each audio file. Two audio files of the same species were assigned the same Individu-
al-ID if, according to the metadata, they were recorded by the same author on the
same day. Individual-IDs were also used to accomplish a somewhat individual-
independent species classification by creating "individual-independent" folds for
cross-validation during training.
In the first session 262,232 segments were extracted from 2027 audio files of the
training set with an average of 262 segments and 2 files (individuals) per species.

Fast Template Matching through prior Downsampling. When starting the tem-
plate matching to collect Segment-Probabilities as described in [12] it quickly became
apparent that sliding 262,232 templates over the spectrogram representation of all
audio recordings was too time consuming (33,203 files in total). Even with modifica-
tions described in [14] it would have taken too long. Both methods apply a Gaussian
blur on segments and target image before the actual template matching. This smooth-
ing is a form of low-pass filtering to reduce detail. Interestingly, Gaussian smoothing
is also used when reducing the size of an image. Before downsampling an image, it is
common to apply a low-pass filter to ensure that spurious high-frequency information
does not appear in the resampled image (aliasing). So if high-frequency information is
discarded anyway why not apply a Gaussian blur and then downsample both template
and target image by a factor 2 prior to the template matching? Together with other
speed-related optimizations introduced in [14] (e.g. short-time Fourier transform
(STFT) with only 50% overlap, restricting the template matching to a few pixels
above and below the original vertical segment position along the frequency axes) this
preprocessing reduces calculation time significantly while maintaining comparable
results in finding maxima of segments within spectrograms. Proportions of spectro-
gram images and effects of low-pass filtering are visualized in Fig. 1 for an audio file
of about 8 seconds taken from the training set (MediaId: 83).


Fig. 1. (a) Original spectrogram image: STFT with 75% overlap, (b) STFT with 50% overlap,
 (c) Image for template matching: Downsampling of (b) by factor 2, (d) Loss of information:
                 Expansion of (c) to the size of the original spectrogram image
3      Training, Feature Selection and Classification

Like in previous challenges classification was split up into several independent classi-
fication tasks by training one classifier per species (999 classes) following the binary
relevance approach. For each species the classification problem was formulated as a
multi-label regression task with the target function set to 1.0 for dominant species and
0.5 for all background species. For classification the scikit-learn library [17] was used
(ExtraTreesRegressor) by training ensembles of randomized decision trees [18] with
probabilistic outputs. Hyperparameter grid search was performed to improve classifi-
cation results. For each classifier the following parameters and variations of tree-
specific hyperparameters were used during training:

 number of "individual-independent" folds for cross-validation: 10
 number of estimators (trees in the forest): 500
 number of features to consider when looking for the best split: 10, 30
 minimum number of samples required to split an internal node: 5, 10

Best hyperparameters were chosen separately for each species by evaluating the Area
Under the Curve (AUC) on predictions of held-out training files. To have a more
realistic estimation and to improve generalization, Individual-IDs were used to create
"individual-independent" folds for cross-validation. This way, recordings of the same
bird were either part of the training or the validation set but not both. For the best runs
the probability of occurrence for that species was predicted in all test files and aver-
aged during cross-validation.


Decision Tree based Feature Ranking and Selection. For both feature sets
(SegProbs1 & openSMILE) training was performed in two passes. During the first
pass feature importances returned by the classifier were cumulated for each species
during hyperparameter variation and saved for later feature ranking. The importance
of a feature was computed as the total reduction of the mean squared error brought by
that feature. During the second pass classifiers were trained again, but this time with
only a limited number of features, starting with the most important ones. To deter-
mine the optimal number of SegProbs1 and openSMILE features for each species
features were added in decreasing order of importance. For SegProbs1 the number of
selected features considered for training was 10, 50, 100, 150 and 500 and for
openSMILE 50, 100, 150, 500, 3000 and 8541 (no feature selection). In Fig. 2 the
frequency distribution regarding the optimal number of selected features (= best AUC
per species) is given as bar chart for both feature sets.
 Fig. 2. Absolute frequency distribution regarding the optimal number of selected features for
                left: Segment-Probabilities and right: openSMILE feature set


For Segment-Probabilities this means using the 50 most important features only is
most likely better than using the most important 100, 150 or 500 features to identify a
species. For openSMILE using 500 important features on average is better than using
3000 or all of them. Figure 3 gives an impression to what extent the above described
feature selection method improves classification results over the entire training set.
Each column in the boxplot summarizes the best possible cross-validated AUC score
achieved for each species during hyperparameter grid search and the dotted line
shows the improvement regarding the mean AUC.


 Fig. 3. AUC statistics: without (w/o FS) vs. with (w/ FS) feature selection for SegProbs1 and
    openSMILE features, left: without background species, right: with background species


Bootstrap Aggregating. To further improve classification results additional feature
sets were created by calculating further Segment-Probabilities for smaller parts of the
training set (SegProbs2 & SegProbs3). For this pupose additional segments were
extracted from files belonging to species that did not exeeded an AUC score of 98%
during cross-validation without background species in the previous training steps.
Again, only files belonging to individuals not processed before were chosen for
further segmention. Segment-Probabilities calculated for last year’s challenge were
also reused as features (SegProbsOld). Properties of the different sets are summarized
in Table 1.

  Table 1. Properties of Segment-Probabilities feature sets (total number of features, average
                 number of features per species, number of species covered)

                               # Features                  # Features per
      Feature set                                                                         # Species
                              (Segments)                   Species (avg)
SegProbs1                                   262,232                           262                         999
SegProbs2                                   224,852                           557                         404
SegProbs3                                   245,003                           620                         395
SegProbsOld                                 492,753                           985                         500


For all sets feature selection based on prior feature ranking was performed as
described above. Finally, four data sets (C1_All, C2_2014, C3_2015 and C4_Old)
were created for each species by combining different feature sets. Those subsets can
be interpreted as bootstrap data sets where rows represent a subsampling of the
training files and columns a subsampling of the feature space. The number of training
files, test files and species associated with each data set are listed in the table below.

    Table 2. Number of audio files and number species used in different bootstrap data sets

                              BirdCLEF2014                                      BirdCLEF2015
Data set       #Train Files   #Test Files      # Species       #Train Files     #Test Files       # Species
C1_All                9596          4299               500           15011            4297                499
C2_2014               7919          4299               404            3956            4297                    -
C3_2015                   -             -                  -          8604            4297                395
C4_Old                9596          4299               500                -                   -               -


The bootstrap data sets were used to train somewhat independent predictive models.
The predictions of those models were than combined which is also known as boot-
strap aggregating or bagging. In bootstrap aggregating the subsets are usually ran-
domly drawn but here the data sets were chosen from a pragmatic point of view with
regards to BirdCLEF 2014 vs. 2015 data and extraction of additional features for
species with classification results below a certain threshold. Besides faster training,
this ensemble method helps to reduce variance and improves stability. For all boot-
strap data sets different models were trained using different feature sets and feature
set combinations:

 Data subset 1: C1_All
    o openSMILE without and with prior feature selection (w/o & w/ FS)
    o SegProbs1 (w/o & w/ FS)
 Data subset 2: C2_2014
    o SegProbs2 (w/o & w/ FS)
    o SegProbs2+SegProbs1 (w/ FS)
    o SegProbs2+SegProbs1+openSMILE (w/ FS)
 Data subset 3: C3_2015
    o SegProbs3 (w/o & w/ FS)
    o SegProbs3+SegProbs1 (w/ FS)
    o SegProbs3+SegProbs1+openSMILE (w/ FS)
 Data subset 4: C4_Old
    o SegProbsOld (w/o & w/ FS)


Combining the Outputs of Multiple Classifiers via Averaging and Blending. To
combine predictions of the different models two methods were tested: simple averag-
ing and blending (also known as stacking or stacked generalization [19]). In case of
blending the outputs of different classifiers were used as training data for another
classifier to approximate the same target function. For this second level classifier an
ordinary least squares linear regression model was trained to figure out the combining
mechanism or weighting of the individual predictions.

Post-processing of Predictions for Submission. The predictions returned by the
classifiers assign a score to each species within each audio file (probability of occur-
rence as real value). After blending some prediction values were not within the re-
quested interval [0,1]. To deal with this, all negative values were clipped to zero.
Additionally, all values greater 0.6 were replaced using a hyperbolic tangent function
(tanh). By passing predictions to this non-linear transfer function, values were all kept
below 1.0. This way ranking was preserved especially among the top ranks that are
most important when evaluating the Mean Average Precision (MAP). In a final step
predictions of species not part of last year’s challenge were set to zero for all files
marked with Year = ‘BirdCLEF2014’. Figures 4 and 5 show the progress of classifi-
cation results using simple averaging compared to blending for stepwise aggregating
predictions from and within bootstrap data sets followed by post-processing. Results
are presented via AUC statistics (summarized for all species as boxplots) and MAP
statistics (evaluated over the entire training set) within the same figure.
      Fig. 4. Progress of classification results: Aggregating predictions via Averaging,
                                   left: w/o BS, right: w/ BS


       Fig. 5. Progress of classification results: Aggregating predictions via Blending,
                                   left: w/o BS, right: w/ BS


It is worth mentioning that AUC and MAP statistics are not perfectly correlated. Al-
though averaging leads to better AUC statistics, blending yields much better MAP
scores. Also, setting predictions of the 499 species new in 2015 to zero during post-
processing for BirdCLEF2014 files increases MAP for both averaging and blending if
background species are included for evaluation, whereas AUC statistics are getting
worse in both cases (far more outliers below 1.5×IQR). This can be explained by the
fact that those species are not among the dominant species in BirdCLEF2014 files but
they do may appear as background species.


4     Submission Results

In Table 3 results of submitted runs are summarized using two evaluation statistics:
mean of the Area Under the Curve (AUC) calculated per species and mean Average
Precision (MAP) on the public training and the private test set. All four runs outper-
formed the results of the other participating teams [20].


        Table 3. Performance of submitted runs (without / with background species)

                           Public Training Set                   Private Test Set
          Run       Mean AUC [%]         MAP [%]                   MAP [%]
           1          95.2 / 90.0        43.3 / 40.8               42.4 / 38.8
           2          96.6 / 91.3        67.1 / 62.1               44.2 / 40.5
           3          98.1 / 93.5        61.2 / 57.7               44.2 / 41.1
           4          96.7 / 91.6        69.3 / 64.0               45.4 / 41.4


For the first run only SegProbs1 and openSMILE features were used for classification.
Different models were trained on the entire training set with and without prior feature
selection. The predictions of the models were than combined via blending followed
by post-processing as described above. For the second run additional classifiers were
trained on smaller subsets of the training files (C2_2014, C3_2015 and C4_Old). For
each subset a distinct set of additional features (SegProbs2, SegProbs3 and SegProbs-
Old) was used for training. For this run features from one subset were not part of any
other subset. Again, feature selection was performed individually for each species and
each bootstrap data set. Predictions of all subset models including the ones from the
first run were aggregated via blending and post-processed. For the third and fourth
run additional models were trained for each bootstrap data set including also selected
features from other subsets. The difference however between these two runs is that
averaging was used for run three whereas blending was used for the final and best
performing fourth run to aggregate model predictions. In Fig. 6 results for all submit-
ted runs are visualized as combination of AUC boxplots and MAP scores.


        Fig. 6. AUC and MAP statistics of submitted runs, left: w/o BS, right: w/ BS
5      Discussion

Downsampling spectrogram images prior to template matching significantly reduces
calculation time. Luckily, the here occurring loss of information can actually be con-
sidered as a feature, not a bug. Too much detail is rather disturbing and distracting
when comparing call or song elements between different individuals of the same spe-
cies. For some birds maybe even a downsampling factor greater than 2 produces equal
or even better results? Faster template matching means – especially when dealing
with so many species and such large amounts of audio data – being able to extract
more segments and calculate more Segment-Probabilities. But using more features
does not necessarily lead to better identification results. Reducing the number of fea-
tures to the most important ones has several advantages. Less but meaningful features
improve prediction accuracy, reduce over-fitting and therefore increase generalization
power. Besides faster training and prediction, a smaller set of features also reduces,
once the model is trained, prediction time of new and unseen audio material due to
faster feature extraction. Especially in case of Segment-Probabilities from now on
only relevant segments need to be considered for template matching.
Due to the large amount of data and limited training time the searching for the optimal
number of features per species was not very extensive. Only a few manual selected
candidates were evaluated. A finer grid search or a more sophisticated way to ap-
proach the true optimum, for example using a binary search algorithm, might have led
to better results.
When looking at the different MAP scores for training and test files in Fig. 6 it be-
comes clear that most of the progress achieved by bagging on the training set is due to
over-fitting. This could be partly explained by the fact that only for the first dataset
C1_All an "individual-independent" training approach with accordingly selected folds
was used whereas for all other subsets common stratified folds were used for cross-
validation. Another reason might be that all bootstrap data sets used similar features
and equal classification methods and therefore model predictions were not independ-
ent or uncorrelated enough to significantly boost classification results when combin-
ing them. Nevertheless bagging increased average prediction scores also for test files
and could clearly improve submission results.


Acknowledgments. I would like to thank Hervé Glotin, Hervé Goëau, Andreas
Rauber and Willem-Pier Vellinga for organizing this challenge, Alexis Joly and Hen-
ning Müller for coordination, the Xeno-Canto foundation for nature sounds for
providing the audio data and the French projects Pl@ntNet (INRIA, CIRAD, Tela
Botanica) and SABIOD Mastodons for supporting this task. I also want to thank Dr.
Karl-Heinz Frommolt for supporting my work and providing me with access to re-
sources of the Animal Sound Archive [21].
 References

 1. Frommolt K-H, Bardeli R, Clausen M (2008) ed. Computational bioacoustics for assessing
    biodiversity. Proc. of the int. expert meeting on IT-based detection of bioacoustical patterns
 2. Bardeli R et al. (2009) Detecting bird sounds in a complex acoustic environment and appli-
    cation to bioacoustic monitoring. Pattern Recognition Letter: 31, 23, 1524–1534
 3. Briggs F et al. (2012) Acoustic classification of multiple simultaneous bird species: A multi-
    instance multi-label approach. The Journal of the Acoustical Society of America 131: 4640.
 4. Potamitis I (2014) Automatic Classification of Taxon-Rich Community Recorded in the
    Wild. PLoS ONE 9(5): e96936. doi: 10.1371/journal.pone.0096936
 5. Stowell D, Plumbley MD (2014) Automatic large-scale classification of bird sounds is
    strongly improved by unsupervised feature learning. PeerJ 2:e488
 6. https://www.kaggle.com/c/multilabel-bird-species-classification-nips2013
 7. http://www.imageclef.org/2014/lifeclef/bird
 8. Goëau H, Glotin H, Vellinga WP, Rauber A (2015) LifeCLEF Bird Identification Task
    2015, In: CLEF working notes 2015
 9. Joly A, Müller H, Goëau H, Glotin H et al. (2015) LifeCLEF 2015: multimedia life species
    identification challenges, In: Proceedings of CLEF 2015
10. Cappellato L, Ferro N, Jones G, San Juan E, eds. (2015). CLEF 2015 Labs and Workshops,
    Notebook Papers. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073,
    http://ceur-ws.org/ Vol-1391/.
11. Mothe J, Savoy J, Kamps J et al. (ed.), Experimental IR meets Multilinguality, Multimodali-
    ty, and Interaction. Sixth International Conference of the CLEF Association, CLEF'15, Tou-
    louse, September 8-11, 2015. Proceedings. LNCS, vol. 9283. Springer, Heidelberg (2015)
12. Lasseck M (2013) Bird Song Classification in Field Recordings: Winning Solution for
    NIPS4B 2013 Competition, In: Glotin H. et al. (eds.). Proc. of int. symp. Neural Information
    Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013: 176-181
13. Lasseck M (2014) Large-scale identification of birds in audio recordings. In: Working notes
    of CLEF 2014 conference
14. Lasseck M (2015) Towards Automatic Large-Scale Identification of Birds in Audio Record-
    ings, In: Proceedings of CLEF 2015
15. Eyben F, Weninger F, Gross F, Schuller B (2013) Recent Developments in openSMILE, the
    Munich Open-Source Multimedia Feature Extractor, In: Proc. ACM Multimedia (MM),
    Barcelona, Spain, ACM, ISBN 978-1-4503-2404-5, pp. 835-838, October 2013,
    doi:10.1145/2502081.2502224
16. Lewis JP (1995) Fast Normalized Cross-Correlation, Industrial Light and Magic
17. Pedregosa F et al. (2011) Scikit-learn: Machine learning in Python. JMLR 12, pp. 2825-
    2830
18. Geurts P et al. (2006) Extremely randomized trees, Machine Learning, 63(1), 3-42
19. Wolpert DH (1992) Stacked generalization. Neural Networks, 5:241–259
20. http://www.imageclef.org/lifeclef/2015/bird
21. Animal Sound Archive, http://www.animalsoundarchive.org