<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LifeCLEF Bird Identi cation Task 2014</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Herve Goeau</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Herve Glotin</string-name>
          <email>glotin@univ-tln.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Willem-Pier Vellinga</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Planque</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Rauber</string-name>
          <email>rauber@ifs.tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexis Joly</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aix Marseille Univ., ENSAM, CNRS LSIS, Univ. Toulon, Institut Univ. de France</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inria ZENITH team</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LIRMM</institution>
          ,
          <addr-line>Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Vienna University of Technology</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Xeno-canto Foundation</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <fpage>585</fpage>
      <lpage>597</lpage>
      <abstract>
        <p>The LifeCLEF bird identi cation task provides a testbed for a system-oriented evaluation of 501 bird species identi cation. The main originality of this data is that it was speci cally built through a citizen science initiative conducted by Xeno-Canto, an international social network of amateur and expert ornithologists. This makes the task closer to the conditions of a real-world application than previous, similar initiatives. This overview presents the resources and the assessments of the task, summarizes the retrieval approaches employed by the participating groups, and provides an analysis of the main evaluation results. With a total of ten groups from seven countries and with a total of twenty-nine runs submitted, involving distinct and original methods, this rst year task con rms the interest of the audio retrieval community for biodiversity and ornithology, and highlights further challenging studies in bird identi cation.</p>
      </abstract>
      <kwd-group>
        <kwd>LifeCLEF</kwd>
        <kwd>bird</kwd>
        <kwd>song</kwd>
        <kwd>call</kwd>
        <kwd>species</kwd>
        <kwd>retrieval</kwd>
        <kwd>audio</kwd>
        <kwd>collection</kwd>
        <kwd>identi cation</kwd>
        <kwd>ne-grained classi cation</kwd>
        <kwd>evaluation</kwd>
        <kwd>benchmark</kwd>
        <kwd>bioacoustics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Accurate knowledge of the identity, the geographic distribution and the
evolution of bird species is essential for a sustainable development of humanity as
well as for biodiversity conservation. Unfortunately, such basic information is
often only partially available for professional stakeholders, teachers, scientists
and citizens. In fact, it is often incomplete for ecosystems that possess the
highest diversity, such as tropical regions. A noticeable cause and consequence of
this sparse knowledge is that identifying birds is usually impossible for the
general public, and often a di cult task for professionals like park rangers, ecology
consultants, and of course, the ornithologists themselves. This "taxonomic gap"
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] was actually identi ed as one of the main ecological challenges to be solved
during United Nations Conference in Rio de Janeiro, Brazil, in 1992.
      </p>
      <p>
        The use of multimedia identi cation tools is considered to be one of the most
promising solutions to help bridging this taxonomic gap [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
With the recent advances in digital devices, network bandwidth and
information storage capacities, the collection of multimedia data has indeed become an
easy task. In parallel, the emergence of "citizen science" and social networking
tools has fostered the creation of large and structured communities of nature
observers (e.g. eBird6, Xeno-canto7, etc.) that have started to produce outstanding
collections of multimedia records. Unfortunately, the performance of the
state-ofthe-art multimedia analysis techniques on such data is still not well understood
and it is far from reaching the real world's requirements in terms of identi
cation tools. Most existing studies or available tools typically identify a few tens of
species with moderate accuracy whereas they should be scaled-up to take one,
two or three orders of magnitude more, in terms of number of species.
      </p>
      <p>
        The LifeCLEF Bird task proposes to evaluate one of these challenges [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
based on big and real-world data and de ned in collaboration with biologists
and environmental stakeholders so as to re ect realistic usage scenarios.
      </p>
      <p>
        Using audio records rather than bird pictures is justi ed by current practices
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Birds are actually not easy to photograph; audio calls and
songs have proven to be easier to collect and su ciently species speci c.
      </p>
      <p>
        Only three notable previous worldwide initiatives on bird species identi
cation based on their songs or calls have taken place, all three in 2013. The rst
one was the ICML4B bird challenge joint to the International Conference on
Machine Learning in Atlanta, June 2013 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It was initiated by the SABIOD
MASTODONS CNRS group8, the University of Toulon and the National
Natural History Museum of Paris [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It included 35 species, and 76 participants
submitted their 400 runs on the Kaggle interface. The second challenge was
conducted by F. Brigs at MLSP 2013 workshop, with 15 species, and 79
participants in August 2013. The third challenge, and biggest in 2013, was organised
by University of Toulon, SABIOD and Biotope [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], with 80 species from the
Provence, France. More than thirty teams participated, reaching 92% of average
AUC. Descriptions of the best systems of ICML4B and NIPS4B bird identi
cation challenges are given in the on-line books [
        <xref ref-type="bibr" rid="ref1 ref2">2,1</xref>
        ] including, in some cases,
references to useful scripts.
      </p>
      <p>In collaboration with the organizers of these previous challenges, BirdCLEF 2014
goes one step further by (i) signi cantly increasing the species number by almost
an order of magnitude (ii) working on real-world data collected by hundreds of
recordists (iii) moving to a more usage-driven and system-oriented benchmark
by allowing the use of meta-data and de ning information retrieval oriented
metrics. Overall, the task is expected to be much more di cult than previous
benchmarks because of the higher confusion risk between the classes, the higher
background noise and the higher diversity in the acquisition conditions (devices,
recordists uses, contexts diversity, etc.). It will therefore probably produce
sub6 http://ebird.org/
7 http://www.xeno-canto.org/
8 http://sabiod.univ-tln.fr
stantially lower scores and o er a better progression margin towards building
real-world generalist identi cation tools.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>The training and test data of the bird task is composed by audio recordings
hosted on Xeno-canto (XC). Xeno-canto is a web-based community of bird sound
recordists worldwide with about 1800 active contributors that have already
collected more than 175,000 recordings of about 9040 species. 501 species from
Brazil are used in the BirdCLEF dataset. They represent the species of that
country with the highest number of recordings on XC, totalling 14,027
recordings recorded by hundreds of users. The dataset has between 15 and 91 recordings
per species, recorded by between 10 and 42 recordists.</p>
      <p>
        To avoid any bias in the evaluation related to the audio devices used, each
audio le has been normalized to a constant bandwidth of 44.1 kHz and coded
over 16 bits in .wav mono format (the right channel was selected by default).
The conversion from the original Xeno-canto data set was done using mpeg, sox
and matlab scripts. An optimized 16 Mel Filter Cepstrum Coe cients for bird
identi cation (according to an extended benchmark [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) have been computed
with their rst and second temporal derivatives on the whole set. They were
used in the best systems run in ICML4B and NIPS4B challenges [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Audio records are associated with various meta-data including the species
of the most active singing bird, the species of the other birds audible in the
background, the type of sound (call, song, alarm, ight, etc.), the date and
location of the observations (from which rich statistics on species distribution
can be derived), common names and collaborative quality ratings. All of them
were produced collaboratively by the Xeno-canto community.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Task Description</title>
      <p>Participants were asked to determine the species of the most active singing birds
in each query le. The background noise can be used as any other meta-data,
but it is forbidden to correlate the test set of the challenge with the original
annotated Xeno-canto data base (or with any external content as many of them
are circulating on the web). More precisely, the whole BirdCLEF dataset has
been split in two parts, one for training (and/or indexing) and one for testing.
The test set was built by randomly choosing 1/3 of the observations of each
species whereas the remaining observations were kept in the reference training
set. Recordings of the same species done by the same person the same day are
considered as being part of the same observation and cannot be split across the
test and training set. The xml les containing the meta-data of the query
recordings were purged so as to erase the foreground and background species names
(the ground truth), the vernacular names (common names of the birds) and the
collaborative quality ratings (that would not be available at query stage in a
real-world mobile application). Meta-data of the recordings in the training set
are kept unaltered.</p>
      <p>The groups participating to the task were asked to produce up to 4 runs
containing a ranked list of the most probable species for each record of the test
set. Each species had to be associated with a normalized score in the range [0; 1]
re ecting the likelihood that this species was singing in the sample. For each
submitted run, participants had to say if the run was performed fully
automatically or with a human assistance in the processing of the queries, and if they
used a method based on only audio analysis or with the use of the metadata.
The metric used to compare the runs was the Mean Average Precision averaged
across all queries. Since the audio records contain a main species and often some
background species belonging to the set of 501 species in the training, we
decided to use two metrics, one focusing on all species (MAP1) and a second one
focusing only on the main species (MAP2).
4</p>
      <p>
        Participants and methods
87 research groups worldwide registered for the task and downloaded the data
(from a total of 127 groups that registered for at least one of the three LifeCLEF
tasks). 42 of the 87 registered groups were exclusively registered to the bird task
and not to the other LifeCLEF tasks. This shows the high attractiveness of the
task in both the multimedia community (presumably interested in several tasks)
and in the audio and bioacoustics community (presumably registered only to
the bird songs task). Finally, 10 of the 87 registrants, coming from 9 distinct
countries, crossed the nish by submitting runs (with a total of 29 runs). These
10 were mainly academics, specialized in bioacoustics, audio processing or
multimedia information retrieval. We list them hereafter in alphabetical order and
give a brief overview of the techniques they used in their runs. We would like
to point out that the LifeCLEF benchmark is a system-oriented evaluation and
not a deep or ne evaluation of the underlying algorithms. Readers interested in
the scienti c and technical details of the implemented methods should refer to
the LifeCLEF 2014 working notes or to the research papers of each participant
(referenced below):
BiRdSPec, Brazil/Spain, 4 runs: The 4 runs submitted by this group were
based on audio features extracted by the Marsyas framework9 (Time
ZeroCrossings features, Spectral Centroid, Flux and Rollo , and Mel-Frequency Cepstral
Coe cients). The runs then di er in two major things: (i) Flat vs.
Hierarchical multi-class Support Vector Machine (i.e. using a multi-class Support Vector
Machines at each node of the taxonomy as discussed in a research paper of the
authors [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]) (ii) classi cation of full records vs. classi cation of automatically
detected segments (and majority voting on the resulting local predictions). The
9 http://marsyas.info/
detail of the runs is the following:
BirdSPec Run 1 : at classi er, no segmentation
BirdSPec Run 2 : at classi er, segmentation
BirdSPec Run 3 : hierarchical classi er, no segmentation
BirdSPec Run 4 : hierarchical classi er, segmentation
Their results (see section 5) show that (i) the segments oriented classi cation
approach brings slight improvements (ii) using the hierarchical classi er does not
improve the performances over the at one (at least using our at evaluation
measure). Note that in every submitted run, only one species was proposed for
each query involving lower performances that they should expected with several
species propositions.
      </p>
      <p>
        Golem, Mexico, 3 runs [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]: The audio-only classi cation method used by
this group consists of four stages: (i) pre-processing of the audio signal based
on down-sampling and bandpass ltering (between 500hz and 4500hz) (ii)
segmentation in syllables (iii) candidate species generation based on HOG features
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] extracted from the syllables and Support Vector Machine (iv) nal
identication using a Sparse Representation-based classication of HOG features [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
or LBP features [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Runs Golem Run 1 and Golem Run 2 di er only in the
number of candidate species kept at the third stage (100 vs. 50). Golem Run 3
uses LBP features rather than HOG features for the last step. Best performances
were achieved by Golem Run 1.
      </p>
      <p>
        HTL, Singapore, 3 runs [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]: This group experimented several ensembles
of classi ers on spectral audio features ( ltered MFCC features &amp;
spectrumsummarizing features) and metadata features (using 8 elds: Latitude,
Longitude, Elevation, Year, Month, Month + Day, Time, Author). The 3 runs mainly
di er in the used ensemble of classi ers and the used features:
HLT Run 1 : &amp; LDA on audio features locally pooled within 0.5 seconds
windows, Random Forest on Metadata (matlab implementation)
HLT Run 2 : &amp; LDA, Logistic Regression, SVM, Adaboost and Knn classi er
on Metadata and audio features globally pooled with a max pooling strategy,
Random Forest on Metadata only (sklearn implementation)
HLT Run 3 : &amp; combination of HLT Run 1 and HLT Run 2
Interestingly, in further experiments reported in their working note [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], the
authors show that using only the metadata features can perform as well as using
only the audio features they experimented.
      </p>
      <p>
        Inria Zenith, France, 3 runs [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]: This group experimented a ne-grained
instance-based classi cation scheme based on the dense indexing of individual
26-dimensional MFCC features and the pruning of the non-discriminant ones.
To make such strategy scalable to the 30M of MFCC features extracted from
the tens of thousands audio recordings of the training set, they used
highdimensional hashing techniques coupled with an e cient approximate nearest
neighbors search algorithm with controlled quality. Further improvements were
obtained by (i) using a sliding classi er with max pooling (ii) weighting the
query features according to their semantic coherence (iii) making use of the
metadata to post- lter incoherent species (geo-location, altitude and
time-ofday). Runs INRIA Zenith Run 1 and INRIA Zenith Run 2 di er in whether the
post- ltering based on metadata is used or not.
      </p>
      <p>
        MNB TSA, Germany, 4 runs [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]: This participant rst used the
openSMILE audio features extraction tool [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to extract 57-dimensional low level
audio features per frame (35 spectral features, 13 ceptral features, 6 energy
features, 3 voicing related features) and then describe an entire audio recording by
calculating statistics from the low level features trajectories (as well as their
velocity and accelaration trajectories) through 39 functionals including e.g. means,
extremes, moments, percentiles and linear as well as quadratic regression. This
sums up to 6669-dimensional global features (57 x 3 x 39) per recording that were
reduced to 1277-dimensional features through an unsupervised dimension
reduction technique. A second type of audio features, namely segment-probabilities,
was then extracted. This method consists in using the matching probabilities
of segments as features (or more precisely the maxima of the normalized
crosscorrelation between segments and spectrogrm images using a template matching
approach). The details of the di erent steps including the audio signal
preprocessing, the segmentation process and the template matching can be found in
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Besides, they also extracted 8 features from the metadata (Year, Month,
Time, Latitude, Longitude, Elevation, Locality Index, Author Index). The
nal classi cation was done by rst selecting the most discriminant features per
species (from 100 to 300 features per class) and using the scikit-learn library
(ExtraTreesRegressor) for training ensembles of randomized decision trees with
probabilistic outputs. Details of the di erent parameters settings used in each
run are detailed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. On average the use of Segment-Probabilities
outperforms the other feature sets but for some species the openSMILE and in rare
cases even the Metadata feature set was a better choice.
      </p>
      <p>
        QMUL, UK, 4 runs [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]: This group focused on unsupervised feature learning
in order to learn regularities in spectro-temporal content without reference to
the training labels and further help the classi er to generalise to further content
of the same type. MFCC features and several temporal variants are rst
extracted from the audio signal after a median-based thresholding pre-processing.
Extracted low level features were then reduced through PCA whitening and
clustered via spherical k-means (and a two-layer variant of it) to build the
vocabulary. During classi cation, MFCC features are pooled by projecting them
on the vocabulary with di erent temporal pooling strategies. Final supervised
classi cation is achieved thanks to a random forest classi er. This method is the
subject of a full-length article which can be read at [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Details of the di erent
parameters settings used in each run are detailed in the working note [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
Randall, France, 1 run: This run Randall Run 1 is below the ones of the
random classi er, which can be explained because of errors in the use of the
labels and also by the fact that only one species was proposed for each query,
thus this participant did not submit a working note.
      </p>
      <p>
        SCS, UK, 3 runs [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]: By participating in the LifeCLEF 2014 Bird Task this
participant was hoping to demonstrate that spectrogram correlation as
implemented in the Ishmael v2.3 library10 can be very useful for the automatic
detection of certain bird calls. Using this method, each test audio record required
approximately 12 hours to be processed. The submitted run was consequently
restricted to only 14 of the 4339 test audio records, explaining the close to zero
evaluation score. This demonstrates the limitation of the approach in the context
of large-scale classi cation.
      </p>
      <p>
        Utrecht Univ., The Netherlands, 1 run [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] This participant is the only one
who experimented with a deep neural network within the task (for the last steps
of the method, i.e. feature learning and classi cation). Their whole framework
rst includes a decimating and dynamic ltering of the audio signal followed by
an energy-based segment detection. Detected segments are then clustered into
higher temporal structures through a simple gap-wise merging of smaller
sections. MFCC features and several extended variants were then extracted from
the consolidated segments before being trained individually by the deep
neural network. At query time, an activation-weighted voting strategy was nally
used to pool the predictions of the di erent segments into a nal strong classi er.
Yellow Jackets, USA, 1 run As this participant did not submit a working
note, we don't have any meaningful information about the submitted run Yellow
Jackets Run 1. We only know that it achieved very low performances, close to
the random classi er. Note that only one species was proposed for each query
explaining also these low performances.
B
N SA M a C t n le ca
      </p>
      <p>U d
n S re iv
l
l
L a
t
ch .</p>
      <p>s
low tek
T
E g R in se
o
r
t
c
e
p
s
d
e
g
a
r
e
v
a
e
m
i
t
,
- - t
n + is a
e t
l, )s ta g
ra re s in
t u 9 n
s t 3 r
n a e
a h
e n t</p>
      <p>i
m</p>
      <p>s nd
,t C a
n C
ecp fea lea )ss emF ce</p>
      <p>n e g Man .s
liiitse ,lrtca ltead irtao treau rcop rsee teh irav itcon
b e e lee ls fe e p fo ,n se</p>
      <p>s h
roab (ssp i-rgn cca itoan ised lcea itn mCC cen eamreeh
-tnp treu ico + c v se rro rag FMira ,t ft
C e a v ity fun rep itmr(e ro av en o</p>
      <p>t n
C s
FC am FC gmfe ,gy lco la su o A ec ea d gmean</p>
      <p>r e 7 r e ic n w p n e
Mg M S 5 e v t u t N s ma s m
l
a
c
o
l
,
h
t
n r
o o</p>
      <p>h
mt
r+ au
a ,
e y
Y it
s
e
e
r
t
n
o
i
s
i
c
e
d
d
e
z
i
m
o
d
n
a
r
&amp; m</p>
      <p>a
n r
o g
i o
t r
c t
u c g
d e n
e p i
r s d
l
o
e
s h
i m s
o o e
n fr r
h
t
,
g
in n ed
l io s
p t a</p>
      <p>a b
mt
a n s n
s e e a
n
odw segmiagm im
d
e
g
n
i
l
p
m
a
s
n
w
o
d
&amp;
n
o
i
t
a
t
n
e
m
g
e
l
a
r
e
e p
S s
a
t
a
d
a
t
e
g
n
i
s
s
e
c
o
r
p
e
r
m
a
e
n
o
i
t</p>
      <p>3 r
ca VMic n ep</p>
      <p>S mu R
s lta" ooxn rr(e rse
i
s
a a
l</p>
      <p>F a i p
C " T s S
- ,
c f
ep llo eo
S o C
,
s R l</p>
      <p>a
e r</p>
      <p>L t Y MAp
- - ,
) s M e n
2 la</p>
      <p>V R n
n
n
&amp; c S c K k</p>
      <p>, i ,
1 M d t , n
n V se is ts io
u S a g o t
r l b o o a
( a - L B c
re ich ion :rs ad A is
is rc ta e A LD lsa
s a ) tn is , , c
lca ire 4 e s t</p>
      <p>a Ms d
H &amp; se l V re se</p>
      <p>C S o a
)
s
1
:
0
&gt;
(
n ,
o r
i e
t
a lt ed
t
n
e</p>
      <p>v
d o
e m
m
g lis re
e
s ia s</p>
      <p>c e
d e g
e
s sp ssa
a
b d a</p>
      <p>e p
L
T</p>
      <p>n e
H I Z</p>
      <p>h
ira itn
the methods making use of the metadata from the purely audio-based methods.</p>
      <p>
        The rst main outcome is that the two best performing methods were
already among the best performing methods in previous bird identi cation
challenges [
        <xref ref-type="bibr" rid="ref1 ref10 ref2 ref3">2,10,1,3</xref>
        ] although LifeCLEF dataset is much bigger and more complex.
This clearly demonstrates the generic nature and the stability of the underlying
methods. The best performing runs of the MNB TSA group notably con rmed
that using matching probabilities of segments as features was once again a good
choice. In their working note [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Lassek et al. actually show that the use of such
Segment-Probabilities clearly outperforms the other feature sets they used (0:49
mAP compared to 0:30 for the OpenSmile features [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and 0:12 for the metadata
features). The approach however remains very time consuming as several days
on 4 computers were required to process the whole LifeCLEF dataset.
Then, the best performing (purely) audio-based runs of QMUL con rmed that
unsupervised feature learning is a simple and e ective method to boost classi
cation performance by learning spectro-temporal regularities in the data. They
actually show in their working note [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] that their pooling method based on
spherical k-means actually produces much more e ective features than the raw
initial low level features (MFCC based). The principal practical issue with such
unsupervised feature learning is that it requires large data volumes to be e
ective. However, this exhibits a synergy with the large data volumes used within
LifeCLEF. This might also explain the rather good performances obtained by
the runs of Inria ZENITH group who used hash-based indexing techniques of
MFCC features and approximate nearest neigbours classi ers. The underlying
hash-based partition and embedding method actually works as an unsupervised
feature learning method.
      </p>
      <p>As could be expected, the MAP1 evaluation measure (with the background
species) scores are generally lower than the MAP2 scores (without the
background species). Only the HTL group did not observe this, and demonstrated
the ability of their method to perform a multi-label classi cation.
A last interesting remark we derived so far from the results comes from the
runs submitted by the BirdSPec group. As their two rst runs were based on
using at SVM classi ers whereas the 3rd and 4th runs were based on using a
hierarchical multi-class SVM classi er it is possible to assess the contribution of
using the taxonomy hierarchy within the classi cation process. Unfortunately,
their results show that this rather tends to slightly degrade the results, at least
when using a at classi cation evaluation measure as the one we are using. On
the other side, we cannot conclude on whether the mistakes done by the at
classi er are further from the correct species compared to the hierarchical one.
This would require using a hierarchical evaluation measure (such as the Tree
Induced Error) and might be considered in next campaigns.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>This paper presented the overview and the results of the rst LifeCLEF bird
identi cation task. With a number of 87 registrants, it did show a high interest of
the multimedia and the bio-accoustic communities in applying their technologies
to real-world environmental data such as the ones collected by Xeno-canto. The
main outcome of this evaluation is a snapshot of the performances of
state-ofthe-art techniques that will hopefully serves a guideline for developers interested
in building end-user applications. One important conclusion of the campaign is
that the two best performing methods were already among the best performing
methods in previous bird identi cation challenges although LifeCLEF dataset
is much bigger and more complex. This clearly demonstrates the generic nature
of the underlying methods as well as their stability. On the other side, the size
of the data was a problem for many registered groups who were not able to
produce results within the allocated time and nally abandoned. Even the best
performing method of the task (used in the best run) was ran on only 96:8%
of the test data and had to be completed by an alternative faster solution for
the remaining recordings to be identi ed. For the next years, we believe is it
important to continue working on such large scales and even try to scale up the
challenge to thousand species. Maintaining the pressure on the training set size
is actually the only way to guaranty that the evaluated technologies could be
soon integrated in real-world applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Proc.
          <article-title>of Neural Information Processing Scaled for Bioacoustics: from Neurons to Big Data, joint to NIPS (</article-title>
          <year>2013</year>
          ), http://sabiod.univ-tln.fr/NIPS4B2013_book. pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <source>Proc. of the rst workshop on Machine Learning for Bioacoustics</source>
          , joint to ICML (
          <year>2013</year>
          ), http://sabiod.univ-tln.fr/ICML4B2013_book.pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bas</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dufour</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
          </string-name>
          , H.:
          <article-title>Overview of the nips4b bird classi cation</article-title>
          .
          <source>In: Proc. of Neural Information Processing Scaled</source>
          for
          <article-title>Bioacoustics: from Neurons to Big Data, joint to NIPS</article-title>
          . pp.
          <volume>12</volume>
          {
          <issue>16</issue>
          (
          <year>2013</year>
          ), http://sabiod.univ-tln.fr/NIPS4B2013_ book.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Briggs</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neal</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fern</surname>
            ,
            <given-names>X.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hadley</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hadley</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betts</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          :
          <article-title>Acoustic classi cation of multiple simultaneous bird species: A multi-instance multi-label approach</article-title>
          .
          <source>The Journal of the Acoustical Society of America</source>
          <volume>131</volume>
          ,
          <issue>4640</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roe</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang, J.:
          <article-title>Sensor network for the monitoring of ecosystem: Bird species recognition</article-title>
          .
          <source>In: Intelligent Sensors, Sensor Networks and Information</source>
          ,
          <year>2007</year>
          .
          <source>ISSNIP</source>
          <year>2007</year>
          . 3rd International Conference on. pp.
          <volume>293</volume>
          {
          <issue>298</issue>
          (Dec
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dalal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Triggs</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Histograms of oriented gradients for human detection</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>2005</year>
          .
          <article-title>CVPR 2005</article-title>
          . IEEE Computer Society Conference on. vol.
          <volume>1</volume>
          , pp.
          <volume>886</volume>
          {
          <fpage>893</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dufour</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artieres</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giraudet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Clusterized mel lter cepstral coe cients and support vector machines for bird song iden cation</article-title>
          .
          <source>In: Soundscape Semiotics - Localization and Categorization</source>
          , Glotin (Ed.) (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , Wollmer,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Opensmile: the munich versatile and fast open-source audio feature extractor</article-title>
          .
          <source>In: Proceedings of the international conference on Multimedia</source>
          . pp.
          <volume>1459</volume>
          {
          <fpage>1462</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gaston</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.J.</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Neill</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <source>Automated species identi cation: why not? Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences</source>
          <volume>359</volume>
          (
          <issue>1444</issue>
          ),
          <volume>655</volume>
          {
          <fpage>667</fpage>
          (
          <year>2004</year>
          ), http://rstb.royalsocietypublishing.org/ content/359/1444/655.abstract
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sueur</surname>
          </string-name>
          , J.:
          <article-title>Overview of the 1st int'l challenge on bird classi cation</article-title>
          .
          <source>In: Proc. of the rst workshop on Machine Learning for Bioacoustics</source>
          , joint to ICML. pp.
          <volume>17</volume>
          {
          <issue>21</issue>
          (
          <year>2013</year>
          ), http://sabiod.univ-tln.fr/ICML4B2013_book.pdf
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Champ</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buisson</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Instance-based bird species identi cation with undiscriminant features pruning - lifeclef2014</article-title>
          . In: Working notes of CLEF 2014 conference (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Muller, H., Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spampinato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          , Fisher,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Lifeclef 2014: multimedia life species identi cation challenges</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lasseck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Large-scale identi cation of birds in audio recordings</article-title>
          .
          <source>In: Working notes of CLEF 2014 conference (</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schoenberger</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shiozawa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Contour matching for a sh recognition and migration-monitoring system</article-title>
          .
          <source>In: Optics East</source>
          . pp.
          <volume>37</volume>
          {
          <fpage>48</fpage>
          . International Society for Optics and
          <string-name>
            <surname>Photonics</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Martinez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silvan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villarreal</surname>
            ,
            <given-names>E.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuentes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meza</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Svm candidates and sparse representation for bird identi cation</article-title>
          .
          <source>In: Working notes of CLEF 2014 conference (</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Northcott</surname>
          </string-name>
          , J.:
          <article-title>Overview of the lifeclef 2014 bird task</article-title>
          .
          <source>In: Working notes of CLEF 2014 conference (</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>L.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>William</given-names>
            <surname>Dennis</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , Huy Dat,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Bird classi cation using ensemble classi ers</article-title>
          .
          <source>In: Working notes of CLEF 2014 conference (</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Silla</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A survey of hierarchical classi cation across di erent application domains</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <volume>22</volume>
          ,
          <issue>31</issue>
          {
          <fpage>72</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Stowell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plumbley</surname>
          </string-name>
          , M.D.:
          <article-title>Audio-only bird classi cation using unsupervised feature learning</article-title>
          .
          <source>In: Working notes of CLEF 2014 conference (</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Stowell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plumbley</surname>
          </string-name>
          , M.D.:
          <article-title>Automatic large-scale classi cation of bird sounds is strongly improved by unsupervised feature learning</article-title>
          .
          <source>arXiv preprint arXiv:1405.6524</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Towsey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planitz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nantes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wimmer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roe</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A toolbox for animal call recognition</article-title>
          .
          <source>Bioacoustics</source>
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <volume>107</volume>
          {
          <fpage>125</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Trifa</surname>
            ,
            <given-names>V.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirschel</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , C.E.,
          <string-name>
            <surname>Vallejo</surname>
            ,
            <given-names>E.E.</given-names>
          </string-name>
          :
          <article-title>Automated species recognition of antbirds in a mexican rainforest using hidden markov models</article-title>
          .
          <source>The Journal of the Acoustical Society of America</source>
          <volume>123</volume>
          ,
          <issue>2424</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>Vincent</given-names>
            <surname>Koops</surname>
          </string-name>
          , H., van Balen,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Wiering</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>A deep neural network approach to the lifeclef 2014 bird task</article-title>
          .
          <source>In: Working notes of CLEF 2014 conference (</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          :
          <article-title>Texture classi cation using texture spectrum</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>23</volume>
          (
          <issue>8</issue>
          ),
          <volume>905</volume>
          {
          <fpage>910</fpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Wheeler</surname>
            ,
            <given-names>Q.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raven</surname>
            ,
            <given-names>P.H.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <surname>E.O.</surname>
          </string-name>
          :
          <source>Taxonomy: Impediment or expedient? Science</source>
          <volume>303</volume>
          (
          <issue>5656</issue>
          ),
          <volume>285</volume>
          (
          <year>2004</year>
          ), http://www.sciencemag.org/content/303/5656/ 285.short
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>