<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Bird Identification using Multiresolution Template Matching and Feature Selection during Training</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mario Lasseck</string-name>
          <email>Mario.Lasseck@mfn-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Animal Sound Archive Museum für Naturkunde Berlin</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This working note describes methods to automatically identify a large number of different bird species by their songs and calls. It focuses primarily on new techniques introduced for this year's task like advanced spectrogram segmentation and decision tree based feature selection during training. Considering the identification of dominant species, previous results of the LifeCLEF Bird Identification Task could be further improved by 29%, achieving a mean Average Precision of 59% (mAP). The proposed approach ranked second place among all participating teams and provided the best system to identify birds in soundscape recordings.</p>
      </abstract>
      <kwd-group>
        <kwd>Bird Identification</kwd>
        <kwd>Multimedia Information Retrieval</kwd>
        <kwd>Spectrogram Segmentation</kwd>
        <kwd>Multiresolution Template Matching</kwd>
        <kwd>Feature Selection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Automated acoustic methods of species identification can serve as a useful tool for
biodiversity assessments. Within the scope of the LifeCLEF 2016 Bird Identification
Task researchers are challenged to identify 999 different species in a large and highly
diverse set of audio files. The audio recordings forming the training and test data set
are built from the Xeno-canto collaborative database (www.xeno-canto.org). A
novelty in this year’s challenge is the enrichment of the test data set by including a new set
of soundscape recordings. These soundscapes are not targeting any specific species
during recording and can contain an arbitrary number of singing birds. To establish
reliable acoustic methods for assessing biodiversity it is essential to improve the
automated identification of birds in general but especially within these soundscape
recordings. An overview and further details about the Bird Identification Task are given
in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The task is among others part of the LifeCLEF 2016 evaluation campaign [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Some methods referred to in the following sections are further developments of
approaches already successfully applied in previous identification tasks. A more detailed
description of these approaches can be found in [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3,4,5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Feature Engineering</title>
      <p>Two main categories of features (the same as last year) were used for training and
prediction: matching probabilities of species-specific 2D spectrogram segments (see
2.1 Segment-Probabilities) and acoustic features extracted with openSMILE (see 2.2
Parametric Acoustic Features). For this year’s task a large number of new
SegmentProbability features were added for training by extracting new sound segments from
audio files using the following two methods.</p>
      <p>
        Re-segmentation of large segments. Using the automated segmentation method of
spectrograms described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] some of the extracted segments turned out to be quite
large and in some cases not very useful for template matching – especially when
processing audio files with a lot of background noise, low signal to noise ratio or many
overlapping sounds. To overcome this problem all segments having a duration longer
than half a second or a frequency range greater than 6 kHz were treated as separate
spectrogram images and re-segmented again with a slightly different image
preprocessing technique. The preprocessing steps for these too large segments differed in the
following ways from the original preprocessing of the spectrogram image:
 transform spectrogram into a binary image via Median Clipping by setting each
pixel to 1, if it is 3.5 (instead of 3) times the median of its corresponding row AND
column, otherwise to 0
 apply binary closing with structuring element of size 2x2 (instead of 6x10) pixel
 no dilation (instead of binary dilation with structuring element of size 3x5 pixel)
 apply median filter with window size of 4x4 (instead of 5x3) pixel
 remove small objects if smaller then 10 (instead of 50) pixel
 enlarge segments in each direction by 10 (instead of 12) pixels
Basically the image preprocessing was adjusted to be more sensitive and to capture
smaller sound components and species-specific sub-elements within larger song
structures and call sequences. Figure 1 visualizes an example of the new segmentation
method.
      </p>
      <p>Fig. 1. Spectrogram re-segmentation example (MediaID: 8). top: initial segmentation (large
segments marked in yellow), bottom: via re-segmentation of large segments extracted
additional segments in red
With re-segmentation of previously segmented files 1,671,600 new segments were
extracted and subsequently used for template matching to generate additional
Segment-Probability features.</p>
      <p>Extracting more features by segmenting files with low average precision. Besides
re-segmentation of segmented files, additional files from the training set were chosen
for segment extraction. However, instead of a random selection, a small number of
files were chosen for each species (approx. 2 to 4) by selecting the ones having the
lowest average precision score calculated during cross validation in previous training
steps. This was done in two iterations (with new training and feature selection steps in
between) increasing the number of features by additional 1,375,928
SegmentProbabilities.
2.1</p>
      <sec id="sec-2-1">
        <title>Segment-Probabilities</title>
        <p>
          For each species an individual feature set was formed by sweeping all segments
related to that particular species over the spectrogram representations of all training and
test recordings. The features were extracted via multiresolution template matching
followed by selecting the maxima of the normalized cross-correlation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In this
context, multiresolution has a double meaning. On one hand it is referring to the time and
frequency resolution of the spectrogram image itself. For 492,753 segments (already
used for the BirdCLEF 2014 identification task) a time resolution of Δt = 11.6 ms
(approx. 86 pixel per seconds) and a frequency resolution of Δf = 43.07 Hz (approx.
23 pixel per kHz) was used. For all other and newly extracted segments both time and
frequency resolution was halved through downsampling the spectrogram image by a
factor of 2. On the other hand the template matching can be also interpreted as
multiresolution in terms of time and frequency range or size of the different spectrogram
patches. Because further re-segmenting large segments, matching is performed for
both: larger sound combinations (song syllables, call sequences) and smaller, rather
fine-grained sound sub-elements (song elements, single calls).
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Parametric Acoustic Features</title>
        <p>
          Besides Segment-Probabilities, for some models also parametric acoustic features
were used for prediction. To extract these features the openSMILE Feature Extractor
Tool [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] was utilized again. The configuration file originally designed for emotion
detection in speech signals was adapted to capture the characteristics of bird sounds.
It first calculates 57 low-level descriptors (LLDs) per frame, adds delta (velocity) and
delta-delta (acceleration) coefficients to each LLD and finally applies 39 statistical
functionals on all, via moving average smoothened, feature trajectories.
The all in all 73 LLDs consist of: 1 time domain signal feature (zero crossing rate), 39
spectral features (Mel-spectrum bins 0-25; 25%, 50%, 75% and 90% spectral roll-off
points; spectral centroid, flux, entropy, variance, skewness, kurtosis and slope;
relative position of spectral minimum and maximum), 17 cepstral features (MFCC 0-16),
6 pitch-related features (F0, F0 envelope, F0 raw, voicing probability, voice quality,
log harmonics-to-noise ratio computed from the ACF) and 10 energy-related features
(logarithmic energy as well as energy in frequency bands: 150-500 Hz, 400-1000 Hz,
800-1500 Hz, 1000-2000 Hz, 1500-4000 Hz, 3000-6000 Hz, 5000-8000 Hz,
700010000 Hz and 9000-11000 Hz). To summarize an entire recording, statistics are
calculated from all LLD, velocity and acceleration trajectories by 39 functionals
including e.g. means, extremes, moments, percentiles and linear as well as quadratic
regression. In total this sums up to 8541 (73·3·39) features per recording. Further details
regarding openSMILE and the features extracted for bird identification can be found
in the openSMILE book [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and the OpenSmileForBirds_v2.conf configuration file
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Training and Feature Selection</title>
      <p>
        The classification task was transformed to 999 one-vs-rest multi-label regression
tasks. This way the number of selected features could be optimized separately and
independently for each species during training. For each audio file in the training set
the target function was set to 1.0 for the dominant species and 0.5 for all background
species. Ensembles of randomized decision trees (ExtraTreesRegressor [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) of the
scikit-learn machine learning library were used for training and prediction [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Feature Selection during Training. Feature importance returned by the ensemble of
decision trees was cumulated during training and used to rank individual features. The
importance of each feature is determined by the total reduction of the mean squared
error brought by that particular feature. After a complete training pass, including cross
validation, the number of features was reduced by keeping only the N highest scoring
and therefore most important features. The number N of features kept for the next
training iteration was set to select 85% of the best features from the previous iteration.
Different percentages were tested (75% to 90%) to find a good compromise between
time of training and finding the optimal number of features. After the time consuming
feature reduction procedure (the number of training iterations was repeated until there
were only 5 features left to predict each species) the optimal number and best
performing features per species were selected by finding either the maximum of the Area
Under the Curve (AUC) or alternatively the maximum mAP score calculated over the
entire training set. Figure 2 shows two examples of resulting AUC (with and without
background species), mAP and R2 (coefficient of determination) score trajectories
when successively discarding 15% of the least important features. The maximum of
each evaluation criteria is marked with a red square. The features used in the
corresponding training iteration (maximum of AUC or mAP score) were then chosen for
predicting the test files.
      </p>
      <p>Fig. 2. Progress of AUC, mAP and R2 scores during feature selection for left: Scytalopus
latrans (SpeciesID: lzezgo) and right: Psarocolius decumanus (SpeciesID: cxyhrl)</p>
    </sec>
    <sec id="sec-4">
      <title>Submission Results</title>
      <p>In Table 1 results of the submitted runs are summarized using two evaluation
statistics: mean of the Area Under the Curve calculated per species and mean Average
Precision on the public training and the private test sets. For all runs no external
resources and only audio features (features extracted from audio files) were used for
training and prediction.
Run 1. For the best performing first run just a single model was used. This model was
trained using only a small but highly optimized selection of Segment-Probabilities (as
described in the previous section). For this run, features were selected per species by
optimizing the mAP score on the training set. A total of 125,402 features (with a
minimum of 20, a maximum of 1833 and an average of 126 features per species) were
used to predict all species in the test files.</p>
      <p>
        Run 2. The second run was submitted quite early as an interim result and is therefore
not worthy of being discussed here. It was actually supposed to be replaced by the
submission of another run averaging the predictions of several different models.
Unfortunately uploading could not be completed before the submission deadline.
Run 3. For the third submitted run blending of different models followed by
postprocessing was used as described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Predictions from all models created during
training as well as predictions from the two best performing models submitted last
year were included (in total 24 models). Some models used Segment-Probabilities or
openSMILE features only, others a combination of both. Also different feature sets
were used with the number of features included for training and prediction optimized
regarding either AUC (with and without using background species) or mAP score.
Run 4. The fourth run also used blending to aggregate model predictions. But unlike
run 3 only those predictions were included that after blending resulted in the highest
possible mAP score calculated on the entire training set (13 models including the best
model from 2015).
      </p>
      <p>
        Figure 3 visualizes the official scores of the LifeCLEF 2016 Bird Identification Task.
The here proposed approach ranked second place among all teams (MarioTsaBerlin
Run 1 &amp; 4) and provided the best system to identify birds in soundscape recordings.
Interestingly, for the best performing submission (Run 1) just a single model was used
with only one category of features. Although using a good selection of features for
this model one would expect that blending several models with different feature sets
would perform better than just a single one. One possible explanation for the
comparatively weak results achieved with blending could be the inclusion of the best
combined models from 2015. Those combined and post-processed predictions already
showed a fairly high overfitting on the training set and blending was perhaps done in
favor of these predictions at the cost of the maybe better generalizing new models.
On the other hand achieving an improvement by almost 30% on the mAP score with a
single model (25% if taking background species into account) clearly shows that the
techniques introduced this year could be applied very successfully. They also seem to
complement each other quite well. Extracting additional, fine-grained spectrogram
segments for template matching by re-segmenting larger segments captures typical
sub-elements of songs or call sequences. Matching these sub-elements can give better
identification performance than matching larger song structures, especially if those
show a high variability between different individuals of the same species. The
downside of the new segmentation method is a collection of many redundant or even
useless segments e.g. when dealing with noisy recordings or overlapping sounds from
other species or sources. However, the proposed feature selection method can
compensate for that by successively discarding irrelevant features during training.
This year also deep learning techniques were successfully applied to the BirdCLEF
dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. By using convolutional neural networks (CNNs) the best performing
system achieved a mAP score of almost 70% when ignoring background species. It
outperformed the here described approach by 17%, or 7% when also identifying all
background species (see Fig.3 Cube Run 4). For soundscape recordings, however, the
technique proposed in this paper achieved a 76% better performance than the best run
using CNNs. Although identification performance for the new introduced test set was
generally low among all teams, in the case of soundscapes, template matching seems
to be better suited. The matching of rather small templates is not so much affected by
surrounding sound events (e.g. coming from many simultaneously vocalizing
animals) and therefore can create features more robust to various background noises.
Compared to the black box architecture of a neuronal network classifier, using
template matching and decision tree based feature selection also has some additional
advantages. By visually or acoustically examining the most important and best
discriminating sound elements of a species (typical calls, syllables or song phrases) one
can gain a better insight into its sound repertoire and learn more about its call or song
characteristics. The following figures visualize sound elements most suitable to
identify a certain species. Each spectrogram segment is positioned at its original frequency
position within a box representing the frequency range of 0 to 11025 Hz. More figures
and additional material can be found at [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>Fig. 33. Rufous-collared Sparrow / Zonotrichia capensis, (ID: yyjrms)
Acknowledgments. I would like to thank Hervé Glotin, Hervé Goëau, Willem-Pier
Vellinga, Alexis Joly and Henning Müller for organizing this task and the
XenoCanto foundation for nature sounds as well as the French projects Floris'Tic (INRIA,
CIRAD, Tela Botanica) and SABIOD Mastodons for their support. I also want to
thank the BMUB (Bundesministerium für Umwelt, Naturschutz, Bau und
Reaktorsicherheit), the Museum für Naturkunde and especially Dr. Karl-Heinz Frommolt for
supporting my research and Wolfram Fritzsch for providing me with additional
hardware resources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Goëau</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planqué</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            <given-names>WP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            <given-names>A</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>LifeCLEF Bird Identification Task 2016</article-title>
          , In: CLEF working notes
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Joly</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goëau</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            <given-names>H</given-names>
          </string-name>
          et al. (
          <year>2016</year>
          )
          <article-title>LifeCLEF 2016: multimedia life species identification challenges</article-title>
          ,
          <source>In: Proceedings of CLEF 2016</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lasseck</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Bird Song Classification in Field Recordings: Winning Solution for NIPS4B 2013 Competition</article-title>
          , In: Glotin H. et al. (eds.).
          <source>Proc. of int. symp. Neural Information Scaled for Bioacoustics</source>
          , sabiod.org/nips4b, joint to NIPS, Nevada, dec.
          <year>2013</year>
          :
          <fpage>176</fpage>
          -
          <lpage>181</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lasseck</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2015a</year>
          )
          <article-title>Towards Automatic Large-Scale Identification of Birds in Audio Recordings</article-title>
          ,
          <source>In Lecture Notes in Computer Science</source>
          Vol.
          <volume>9283</volume>
          : pp
          <fpage>364</fpage>
          -
          <lpage>375</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lasseck</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2015b</year>
          )
          <article-title>Improved Automatic Bird Identification through Decision Tree based Feature Selection and Bagging</article-title>
          , In: Working notes of CLEF 2015 conference
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lewis</surname>
            <given-names>JP</given-names>
          </string-name>
          (
          <year>1995</year>
          )
          <article-title>Fast Normalized Cross-Correlation, Industrial Light</article-title>
          and Magic
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Eyben</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weninger</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuller</surname>
            <given-names>B</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor</article-title>
          ,
          <source>In: Proc. ACM Multimedia (MM)</source>
          , Barcelona, Spain,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <source>ISBN 978-1-4503-2404-5</source>
          , pp.
          <fpage>835</fpage>
          -
          <lpage>838</lpage>
          ,
          <year>October 2013</year>
          , doi:10.1145/2502081.2502224
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>8. http://www.audeering.com/research/opensmile</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>9. http://www.animalsoundarchive.org/RefSys/LifeCLEF2015</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Geurts</surname>
            <given-names>P</given-names>
          </string-name>
          et al. (
          <year>2006</year>
          )
          <article-title>Extremely randomized trees</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>63</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3</fpage>
          -
          <lpage>42</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Pedregosa</surname>
            <given-names>F</given-names>
          </string-name>
          et al. (
          <year>2011</year>
          )
          <article-title>Scikit-learn: Machine learning in</article-title>
          <source>Python. JMLR 12</source>
          , pp.
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Sprengel</surname>
            <given-names>E</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Audio Based Bird Species Identifcation using Deep Learning Techniques</article-title>
          , In: Working notes of CLEF 2016 conference
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>13. http://www.animalsoundarchive.org/RefSys/LifeCLEF2016</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>