=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Life-LengEt2014
|storemode=property
|title=Bird Classification using Ensemble Classifiers
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Life-LengEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/LengDD14
}}
==Bird Classification using Ensemble Classifiers==
<pdf width="1500px">https://ceur-ws.org/Vol-1180/CLEF2014wn-Life-LengEt2014.pdf</pdf>
<pre>
    Bird Classification using Ensemble Classifiers

        Leng Yi Ren1 , Jonathan William Dennis1 , and Tran Huy Dat1

              Institute for Infocomm Research, A*STAR, Singapore,
       yrleng@i2r.a-star.edu.sg, jonathan-dennis@i2r.a-star.edu.sg,
                             hdtran@i2r.a-star.edu.sg


      Abstract. This working note summarizes our submission to the Life-
      CLEF 2014 Bird Task which combines the outputs from a Python and
      Matlab classification system. The features used for both systems include
      Mel-Frequency Cepstral Coefficients (MFCC), time-averaged spectro-
      grams and the provided meta-data. The Python subsystem combines
      a large ensemble of different classifiers with different subsets of the fea-
      tures while the Matlab subsystem is an ensemble of the Random Forest
      and Linear Discriminant Analysis (LDA) classifiers using local spectral
      and meta features. By combining this disparate set of features and clas-
      sifiers, we managed to achieve a Mean Average Precision (MAP) score
      that is far superior to what is possible with any single classifier.

      Keywords: ensemble,audio,classification


1    Introduction

LifeCLEF 2014 Bird Task (BirdCLEF)[1] involves 14027 audio recordings con-
taining 501 bird species. In comparison, the 9th annual MLSP competition[3]
involves 645 recordings with 19 bird species while the Neural Information Pro-
cessing Scaled for Bioacoustics (NIPS4B) bird song competition[4] involves 1000
recordings with 87 bird species. The greatly increased size in both the available
data and the number of target classes for BirdCLEF introduces new challenges.
    The total size of the audio database for both the MLSP and NIPS4B chal-
lenges amounts to around 1GB while the BirdCLEF data is over 20GB in size.
The winning methods for the MLSP and NIPS4B challenges performed image
segmentation on the audio spectrograms followed by template matching to de-
rive class models. Based on the difference in data size and the number of classes,
it can be extrapolated that using the same methods for BirdCLEF will require at
least a hundred fold increase in computation time. Given that the time between
the release of the training data and the final test submission is only around three
months, it would not be efficient to try to reproduce the above methods given
our limited computing resources.
    All of our experiments on BirdCLEF are performed on standard 4-core desk-
top PCs with 8GB of RAM thus there are severe constraints on the possible
machine learning methods that can be applied. Instead of attempting to build
a strong learner using well-engineered features, we chose to work on building a


                                       654
number of weak learners with simple features and finding the optimal combina-
tion of such learners.

2     Front-end
We make use of both audio and meta data for our feature front-end as they
provide complementary information to the classifier. The total memory used in
the computation is heavily dependent on the size of the feature used thus it is
important to select a compact representation.

2.1     MFCC
The MFCC features provided by the organizers consist of a single log energy coef-
ficient and 15 cepstral coefficients, appended with the first and second derivatives
for a total of 48 dimensions per time frame. For each audio clip, we perform a
simple energy-based segmentation by dropping all the time frames where the
value of the log energy coefficient is below the median value. The remaining
frames are further reduced by taking the maximum value for each of the 48
dimensions to arrive at a final 48 dimension feature.

2.2     Spectral features
The spectral features we used are similar to the spectrum-summarizing features
described in [3]. The audio clips are converted into spectrograms by windowing
the audio into overlapping 200ms windows at 100ms intervals. The frequency
dimension is binned into n dimensions using a simple mean. The time dimension
is eliminated by taking the maximum value similar to the MFCC feature. Finally,
the natural logarithm is applied to the n dimension spectral features to reduce
its dynamic range.

2.3     Meta features
In addition to the audio features described above, we also make use of the meta-
data represented by the following eight fields:
    Latitude, Longitude, Elevation, Y ear, M onth, M onth + Day, T ime, Author
The M onth + Day field combines the two-digit M onth and Day fields into a
single four-digit field as comparing the Day fields across different months is
illogical.
    One of the challenges of BirdCLEF compared to MLSP which also provides
meta-data in the form of a numerical location index[3] is the presence of unreli-
able and missing data. The completion of the meta-data fields by the BirdCLEF
data contributors is optional with no enforced format. Unreliable information
such as “afternoon” or “am” in the T ime field and completely missing entries
have to be manually parsed into logical numeric values for use as features and
their presence will affect the performance of the classifiers.


                                     655
3     Back-end

Although the official labels for the audio data only gives a single bird species
as the ground truth given by ClassId, the BackgroundSpecies field often list
additional bird species. However, like the meta-data, this field is unreliable as
there is no indication if no additional species are present or identified, or if the
author simply did not complete this particular field. We can only assume that
an empty BackgroundSpecies field indicates that only the single species given
by ClassId is present.
    Another problem that arises due to the BackgroundSpecies field is the pres-
ence of bird species that are outside of the 501 species to be classified. For
the given training data, there are over three hundred additional species and we
chose to retain the sixteen extra species that have over eleven examples each
for a total of 517 target species. The remaining labels are ignored since there
are insufficient examples to train proper models for them. By making use of the
extra information from BackgroundSpecies, the number of examples for the
501 original target classes is also increased and the task becomes a multi-label
classification problem.


3.1   Python

The classifiers used in the Python subsystem are built from the scikit-learn
toolkit (sklearn)[5]. In order to build an ensemble of classifiers, we require the
classifiers to output probability estimates for each of the target classes so that the
final output is simply a mean of the individual classifier outputs. The subsequent
computation of the Mean Average Precision (MAP) is also straightforward with
such class probability outputs as opposed to max voting. Out of the available
classifiers in the toolkit, we selected the following list to evaluate:

ExtraTreesClassifier, RandomForestClassifier,
KNeighborsClassifier, LogisticRegression, SGDClassifier,
AdaBoostClassifier, GradientBoostingClassifier, SVC, GaussianNB,
BernoulliNB, LDA

From the above list, only ExtraT reesClassif ier, RandomF orestClassif ier
and KN eighborsClassif ier are able to handle multi-label classification by de-
fault. The OneV sRestClassif ier is used to transform the remaining binary
classifiers into multi-label multi-class classifiers.


3.2   Matlab

Two classifiers are used in the Matlab subsystem: Random Forests using the
T reeBagger function and LDA using the Classif icationDiscriminant.f it func-
tion, both of which were trained in a one-vs-rest configuration. Both of these
classifiers output probability estimates through the predict function and the fi-
nal output is the mean of the individual classifier outputs. The T reeBagger


                                      656
function was modified such that each tree in the forest was trained with a bal-
anced amount of negative data, as this was found to considerably improve the
speed of the classification without affecting performance. In particular, each tree
was trained with all of the positive class data, and a random subset of the neg-
ative data containing twenty times as much as the positive data. The number of
variables to sample at each node was set to 3, while the number of trees was set
to 200 for the final evaluation run.


4      Experiment Setup

To evaluate the performance of a feature and classifier combination, 10-fold
cross-validation is used on the given training data. The objective is to maximize
the MAP score of the system.


4.1     Python subsystem

From the list of feature front-ends and the selected classifiers from sklearn,
combinations of feature and classifier pairs are evaluated. This step took up
the bulk of our time as it involves the optimizing of feature settings, classifier
settings and the combinations of the two to build the final ensemble classifier.
Out of the available classifiers, the following six were selected:

 1. Linear Discriminant Analysis (LDA), LDA in sklearn
 2. Logistic Regression (LR), LogisticRegression in sklearn
 3. Support Vector Machines (SVM) with Radial Basis Function (RBF) kernel,
    SV C in sklearn
 4. AdaBoost (AdaB), AdaBoostClassif ier in sklearn
 5. K-Nearest Neighbours (KNN), KN eighborsClassif ier in sklearn
 6. Random Forests (RanF), RandomF orestClassif ier in sklearn

      For the feature front-end, three combinations were selected:

 1. M eta + M F CC + ∆M F CC + Spec10 + Spec10030 , used with LDA, LR and
    SVM
 2. M eta + M F CC + Spec10030 , used with AdaB and KNN
 3. M eta, used with RanF

M F CC refers to the first 16 dimensions of the described 48 dimension feature
while ∆M F CC are the next 16 dimensions (indices 17 to 32) which represents
the first derivatives or deltas. Spec10 refers to the spectral features with n = 10
and Spec10030 are the first 30 dimensions of the spectral features with n = 100.
M eta refers to the eight dimension meta features.


                                      657
4.2    Matlab subsystem

Only one feature configuration is used with the Matlab subsystem. Using the
same terminology as above, the features can be denoted M eta + Spec40, with
n = 40 dimensions used for the spectral feature concatenated with the eight
dimensions of the meta features. However, instead of taking the maximum over
the whole time frame, as with the Python subsystem, local spectral features are
generated by taking the maximum within selected local windows. Each window is
0.5 seconds long, and the step is equivalent to the frame increment. The energy of
each window is calculated simply as the average log-energy over the window, and
is then used as a metric for selecting the windows for classification. Only those
windows that are represent local maxima in the energy function are retained,
with the remaining windows discarded. The idea is that the windows containing
local maxima in the energy will correspond to local instances of the bird song
in the clip, hence provide a complimentary source of information to the global
features used in the Python subsystem.


5     Results and Discussion

Although we train the classifiers on 517 target species, the results presented only
consider the actual 501 target species (MAP with k = 501).


5.1    Python subsystem

The results shown in Table 1 give a rough breakdown of the individual compo-
nents of the Python subsystem. For the sake of brevity, the other combinations
of the six selected classifiers are not shown. The Audio and M eta columns show
how each classifier performs when the complementary meta or audio features
are removed. For all but the KNN classifier, combining both audio and meta


      Table 1. Selected results for 10-fold cross-validation for Python subsystem

          LDA LogR SVM AdaB KNN RanF                   Audio Meta      Both
            x                                          0.191   0.075 0.220
                   x                                   0.197   0.076 0.232
                          x                            0.199   0.085 0.236
                                  x                    0.085   0.110 0.133
                                         x             0.130   0.177 0.164
                                                 x       -     0.192    -
            x      x      x       x      x             0.220   0.195 0.274
            x                            x       x       -     0.195 0.290
            x      x      x       x      x       x       -     0.200 0.313


                                       658
features is shown to improve the classifier performance. Although the KNN clas-
sifier is shown to give a better MAP score with the meta feature alone, when
combined with the other classifiers in an ensemble, adding the audio features is
found to result in a better score overall. It is interesting to note that using the
8-dimension meta features alone with the chosen classifiers can give a reasonable
score of 0.200, highlighting the value of using meta-data for classification tasks.
     Table 2 shows the approximate computation time for the Python subsystem.
One of our objectives is to design a system with a short turnaround time as
the search space for the system parameters is huge. With each new dataset, it
is likely that the optimal system configuration in terms of feature and classifier
choices will be different thus it will be necessary to retune the entire system
from scratch. There is a large discrepancy in the computation time required for


Table 2. Approximate computation time for 10-fold cross-validation for Python sub-
system components

          LDA     LogR      SVM       AdaB     KNN     RanF      Total
          5 min 15 min 1h 10 min 11 min 17 sec 8 min           1h 45 min


each of the classifiers chosen that is partly attributed to the different feature
combinations used. The strength of using an ensemble of classifiers is the option
to mix and match different combinations of the components to suit particular
demands for performance and computation cost. The three fastest classifiers
(LDA, KNN and RanF) combine to give an ensemble which runs in less than 15
minutes with a respectable score of 0.290. In contrast, removing RanF from the
six classifier ensemble only saves 8 minutes but reduces the score from 0.313 to
0.274. There are no general rules determining the performance of each specific
combination, especially if the dataset is changed, thus it is usually necessary to
perform an exhaustive search to find the optimal system.
    Another strength of the ensemble classifier is how the individual classifiers
complement each other. For most of the classifiers tested, it is possible to improve
the individual score at the cost of computation time by changing parameters
such as increasing the number of iterations or the number of neighbours to
query. However, these improvements do not necessarily transfer to the ensemble
classifier when they are combined as the individual gains are made obsolete
through the contributions of the other classifiers. Depending on the classifiers
used, it is often possible to replicate a computation-intensive performance gain
from a classifier.

5.2   Matlab subsystem
The results shown in Table 3 give a breakdown of the individual components
of the Matlab subsystem using the combined Audio + M eta features. It can be
seen that the LDA classifier is performing relatively poorly, perhaps due the


                                     659
Table 3. Classification results for 10-fold cross-validation for the Matlab subsystem

                           LDA RanF Audio + Meta
                             x                  0.119
                                    x           0.216
                             x      x           0.222


number of false positives in the features. These false positives arise from using
local features that are extracted from local maxima in the energy, and hence may
represent noise or other information that is not related to the target bird species.
On the other hand, the random forest classifier is able to select the features that
best separate the target classes, and this allows it to handle a large number
of false positives in the labelled data. Combining the two classifiers together
increases the score, giving an MAP of 0.222 on the cross-validation.

5.3   Combined system


                    Table 4. Final results for proposed system

                                              Python Matlab Combined
              with BackgroundSpecies          0.284     0.166    0.289
            without BackgroundSpecies         0.267     0.159    0.272


    Due to time constraints, we did not manage to evaluate the performance of
the combined Python and Matlab system through cross-validation. Table 4 shows
the final results for our proposed systems on the evaluation data. Compared to
Table 1, the Python subsystem shows a 10% relative loss with the inclusion
of BackgroundSpecies. This might be due to the use of a lower k value when
evaluating the MAP score or a mismatch between the training and evaluation
data. The combined system is found to improve on the Python subsystem score
slightly despite the poor performance of the Matlab subsystem.
    Our system was trained as a multi-label classifier and this is reflected by the
lower score when the background species are removed (MAP 2). Interestingly,
we are the only group that has a lower MAP 2 score than MAP 1, indicating
that we are the only group that focussed only identifying the background species
as well as the dominant species.

6     Conclusion
The challenges presented by BirdCLEF include the large amount of provided
data, the large number of target species to be classified, incomplete or inaccu-
rate multi-class labels and missing and unreliable meta-data. The limited time


                                        660
between the release of the training data and the final submission and the lim-
ited computing resources available to our group are additional obstacles we faced.
With these limitations in mind, we designed an ensemble classifier that combines
the output from a number of individually weak learners using simple features
that are fast to train and test. The final system is shown to provide greatly
improved results over the component classifiers and the fast turnaround time
allows the setup to be quickly calibrated for new tasks and datasets.


References
1. Goëau, H., Glotin, H., Vellinga, W. and Rauber, A.: LifeCLEF Bird Identification
  Task 2014. CLEF working notes 2014
2. Cappellato, L., Ferro, N., Halvey, M., and Kraaij, W., editors (2014): CLEF 2014
  Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings (CEUR-
  WS.org), ISSN 1613-0073, http://ceur-ws.org/ Vol-1180/
3. Briggs, F., et al.: The 9th annual MLSP competition: New methods for acous-
  tic classification of multiple simultaneous bird species in a noisy environment. Ma-
  chine Learning for Signal Processing (MLSP), 2013 IEEE International Workshop
  on, pp.1,8, 22-25 Sept. 2013
4. Glotin H, LeCun Y, Artières T, Mallat S, Tchernichovski O, et al: Proc. of Neural
  Information Processing Scaled for Bioacoustics: from Neurons to Big Data - NIP4B,
  2013
5. Pedregosa, F., et al.: Scikit-learn: Machine Learning in Python. Journal of Machine
  Learning Research, vol 12, 2825–2830, 2011


                                      661

</pre>