-

LifeCLEF Bird Identi cation Task 2016

Herve Goeau

Herve Glotin

glotin@univ-tln.fr 0

Willem-Pier Vellinga

Robert Planque

Alexis Joly

alexis.joly@inria.fr 2 3 0 Aix Marseille Univ., ENSAM, CNRS LSIS, Univ. Toulon, Institut Univ. de France 1 IRD, UMR AMAP , Montpellier , France 2 Inria ZENITH team , Montpellier , France 3 LIRMM , Montpellier , France 4 Xeno-canto Foundation , The Netherlands

The LifeCLEF bird identi cation challenge provides a largescale testbed for the system-oriented evaluation of bird species identi cation based on audio recordings. One of its main strength is that the data used for the evaluation is collected through Xeno-Canto, the largest network of bird sound recordists in the world. This makes the task closer to the conditions of a real-world application than previous, similar initiatives. The main novelty of the 2016-th edition of the challenge was the inclusion of soundscape recordings in addition to the usual xeno-canto recordings that focus on a single foreground species. This paper reports the methodology of the conducted evaluation, the overview of the systems experimented by the 6 participating research groups and a synthetic analysis of the obtained results.

LifeCLEF bird song call species retrieval audio collection identi cation ne-grained classi cation evaluation benchmark bioacoustics ecological monitoring

Accurate knowledge of the identity, the geographic distribution and the evolution of bird species is essential for a sustainable development of humanity as well as for biodiversity conservation. The general public as well as professionals like park rangers, ecological consultants and of course the ornithologists themselves are potential users of an automated bird identifying system, typically in the context of wider initiatives related to ecological surveillance or biodiversity conservation. The LifeCLEF Bird challenge proposes to evaluate the state-ofthe-art of audio-based bird identi cation systems at a very large scale. Before LifeCLEF started in 2014, three previous initiatives on the evaluation of acoustic bird species identi cation took place, including two from the SABIOD6 group 6 Scaled Acoustic Biodiversity http://sabiod.univ-tln.fr [ 4,3,1 ]. In collaboration with the organizers of these previous challenges, BirdCLEF 2014, 2015 and 2016 challenges went one step further by (i) signi cantly increasing the species number by an order of magnitude, (ii) working on realworld social data built from thousands of recordists, and (iii) moving to a more usage-driven and system-oriented benchmark by allowing the use of meta-data and de ning information retrieval oriented metrics. Overall, the task is much more di cult than previous benchmarks because of the higher confusion risk between the classes, the higher background noise and the higher diversity in the acquisition conditions (di erent recording devices, contexts diversity, etc.). It therefore produces substantially lower scores and o ers a better progression margin towards building real-world generalist identi cation tools. The main novelty of the 2016-th edition of the challenge with respect to the two previous years was the inclusion of soundscape recordings in addition to the usual xeno-canto recordings that focus on a single foreground species (usually thanks to mono-directional recording devices). Soundscapes, on the other hand, are generally based on omnidirectional recording devices that continuously monitor a speci c environment over a long period. This new kind of recording ts better to the (possibly crowdsourced) passive acoustic monitoring scenario that could augment the number of collected records by several orders of magnitude. In this paper, we report the methodology of the conducted evaluation as well as the synthetic analysis of the results achieved by the 6 participating groups. 2

Dataset

The training and test data of the challenge consists of audio recordings collected by Xeno-canto (XC)7. Xeno-canto is a web-based community of bird sound recordists worldwide with about 3,000 active contributors that have already collected more than 300,000 recordings of about 9550 species (numbers for June 2016). Nearly 1000 (in fact 999) species were used in the BirdCLEF dataset, representing the 999 species with the highest number of recordings in October 2014 (14 or more) from the combined area of Brazil, French Guiana, Surinam, Guyana, Venezuela and Colombia, totalling 33,203 recordings produced by thousands of users. This dataset includes the entire dataset from the 2015 BirdCLEF challenge [ 5 ], which contained about 33,000 recordings. The newly introduced test data in 2016, contains 925 soundscapes provided by 7 xeno-canto members, sometimes working in pairs. Most of the soundscapes have a length of (more or less) 10 minutes, each coming often from a set of 10-12 successive recording made at one location. The total duration of new testing data to process and analyse is thus equivalent to approximately 6 days of continuous sound recording. The number of known species (i.e. belonging to the 999 species in the training dataset) varies from 1 to 25 species, with an average of 10.1 species per soundscape.

To avoid any bias related to the used audio devices in the evaluation , each audio le was normalized to a constant bandwidth of 44.1 kHz and coded with 16 bits in wav mono format (the right channel is selected by default). The conversion 7 http://www.xeno-canto.org/ from the original Xeno-canto data set was done using mpeg, sox and matlab scripts. The optimized 16 Mel Filter Cepstrum Coe cients for bird identi cation (according to an extended benchmark [ 2 ]) were computed with their rst and second temporal derivatives on the whole set. They were used in the best systems run in ICML4B and NIPS4B challenges. However, due to some technical limitations, the soundscapes were not normalized and directly provided to the participants in mp3 format (shared on the xeno-canto website, the original raw les being not available).

All audio records are associated with various meta-data including the species name of the most active singing bird, the species of the other birds audible in the background, the type of sound (call, song, alarm, ight, etc.), the date and location of the observations (from which rich statistics on species distribution may be derived), some textual comments by the authors, multilingual common names and collaborative quality ratings. All of them were produced collaboratively by the Xeno-canto community. 3

Task Description

Participants were asked to determine all the active singing birds species in each query le. It was forbidden to correlate the test set of the challenge with the original annotated Xeno-canto database (or with any external content as many of them are circulating on the web). The whole data was split in two parts, one for training (and/or indexing) and one for testing. The test set was composed of (i) all the newly introduced soundscapes recordings and (ii), the entire test set used in 2015 (equal to about 1/3 of the observations in the whole 2015 dataset). The training set was exactly the same as the one used in 2015 (i.e. the remaining 2/3 of the observations). Note that recordings of the same species made by the same person on the same day are considered as being part of the same observation and cannot be split across the test and training set. The XML les containing the meta-data of the query recordings were purged so as to erase the taxon name (the ground truth), the vernacular name (common name of the bird) and the collaborative quality ratings (that would not be available at query stage in a real-world mobile application). Meta-data of the recordings in the training set were kept unaltered.

The groups participating in the task were asked to produce up to 4 runs containing a ranked list of the most probable species for each query records of the test set. Each species was associated with a normalized score in the range [0; 1] re ecting the likelihood that this species is singing in the sample. For each submitted run, participants had to say if the run was performed fully automatically or with human assistance in the processing of the queries, and if they used a method based only on audio analysis or with the use of the metadata.

The primary metric used was the mean Average Precision (mAP) averaged across all queries, considering each audio le of the test set as a query and computed as: mAP =

PQ q=1 AveP (q)

Q ; where Q is the number of test audio les and AveP (q) for a given test le q is computed as

AveP (q) = Pkn=1(P (k) rel(k)) :

number of relevant documents Here k is the rank in the sequence of returned species, n is the total number of returned species, P (k) is the precision at cut-o k in the list and rel(k) is an indicator function equaling 1 if the item at rank k is a relevant species (i.e. one of the species in the ground truth). 4

Participants and methods

84 research groups worldwide registered for the task and downloaded the data (from a total of 130 groups that registered for at least one of the three LifeCLEF tasks). This shows the high attractiveness of the challenge in both the multimedia community (presumably interested in several tasks) and in the audio and bioacoustics community (presumably registered only to the bird songs task). Finally, 6 of the registrants crossed the nish line by submitting runs and 5 of them submitted working notes explaining their runs in detail. We list them hereafter in alphabetical order and give a brief overview of the techniques they used in their runs. We would like to point out that the LifeCLEF benchmark is a system-oriented evaluation and not a deep or ne evaluation of the underlying algorithms. Readers interested in the scienti c and technical details of the implemented methods should refer to the LifeCLEF 2016 working notes or to the research papers of each participant (referenced below): BME TMIT, Hungary, 4 runs [ 11 ]: BME TMIT is one of the three teams who used a Convolutional Neural Network with CUBE and WUT teams. As pre-processing, they rst downsampled each audio le to 16 kHz frequency and applied a low-pass lter with cuto frequency of 6250 Hz in order to reduce the size of the training data. Then they subdivided the spectograms into cells of 0.5 seconds x 10 bands of frequency, and removed the cells with few information (according to the mean and variance). After these preprocessing steps, they assembled and re-split the remaining parts of the spectrograms to ve second long pieces, and obtained arrays of 200310 (where 310 samples corresponds to ve seconds), used as input of the CNN. They used two distincts CNN architectures: the well-know AlexNet [ 6 ] with the addition of a batch normalisation (run 1 & 2), and a CNN more inspired by audio recognition systems based on 4 convolutional layers, one full connected layer, ReLU activation functions and batch normalisation (run 3 & 4).

CUBE, Switzerland, 4 runs: This system is based on a CNN architecture of 5 convolutional layers combined with the use of a rectify activation function followed by a max-pooling layer. Based on spectrogram analysis and morphological operations, silent and noisy parts were rst detected and separated from the call and song parts. Spectrograms were then split into chunks of 3 seconds that were used as inputs of the CNN after several data augmentation techniques. Each chunk identi ed as a singing bird was rst concatenated with 3 randomly selected chunks of background noise. Time shift, pitch sift and mixes of audio les from the same species were then used as complementary data augmentation techniques. Considering one test record, all predictions from its distinct chunks are nally averaged. Run 1 was an intermediate result obtained after only one day of training. Run 2 di ers from run 3 by using 50% smaller spectrograms in (pixel) size for doubling the batch size and thus allowing to have more iterations for the same training time (4 days). Run 4 is the average of predictions from run 2 and 3 and reaches the best performance, showing the bene t of bagging. DYNI LSIS, France, 1 runs [ 10 ]: The algorithm presented here is quite standard and was initially used on smaller datasets to improve, in a late fusion scheme, a classi er based on pairs of spectrogram peaks, described in the context of audio ngerprinting. The method is based on the bag-of-words approach: rst the 44.1 kHz audio les were split in 0.2s segments with 50% overlap, and only the segments having energy values higher than a relative (to the whole audio le) value and spectral atness values smaller than an absolute thresh-old were kept for Mel Frequency Cepstral Coe cient computation (MFCC). A k-means clustering was performed on all the MFCC and their derivatives with k=500, in order to extract for every les the normalized histogram of MFCC-based words (i.e. the 500 clusters), using only segments kept in step 2. The resulting feature vectors were then fed to a random forest classi er.

MNB TSA, Germany, 4 runs [ 8 ]: As in 2014 and 2015, this participant used two hand-crafted parametric acoustic features and probabilities of speciesspeci c spectrogram segments in a template matching approach. Long segments extracted during BirdCLEF2015 were re-segmented with a more sensitive algorithm. The segments were then used to extract Segment-Probabilities for each le by calculating the maxima of the normalized cross-correlation between all segments and the target spectrogram image via template matching. Due to the very large amount of audio data, not all les were used as a source for segmentation (i.e. only good quality les without background species were used). The classication problem was then formulated as a multi-label regression task solved by training ensembles of randomized decision trees with probabilistic outputs. The training was performed in 2 passes, one selecting a small subset of the most discriminant features by optimizing the internal mAP score on the training set, and one training the nal classi ers on the selected features. Run 1 used one single model on a small but highly optimized selection of Segment-Probabilities. A bagging approach was used consisting in calculating further Segment-Probabilities from additional segments and to combine them either by blending (24 models in Run 3). Run 4 also used blending to aggregate model predictions, but the predictions were included that after blending resulted in the highest possible mAP score calculated on the entire training set (13 models including the best model from 2015).

WUT, Poland, 4 runs [ 9 ]: as the Cube and the BME TMIT teams, they used a Convolutional Neural Network learning framework. Starting from denoised spectrograms, silent parts were removed with percentile thresholding, giving thus around 86,000 training segments varying in length and associated each with a single main species. As a data augmentation technique and for tting the 5 seconds xed input size of the CNN, segments were adjusted by either trimming or padding. The 3 rst successive runs are produced by deeper and deeper, or/and, wider and wider lters. Run 4 is as an ensemble of neural networks averaging the predictions of the 3 rst runs. 5

Results

Figure 1 reports the performance measured for the 18 submitted runs. For each run (i.e. each evaluated system), we report the overall mean Average Precision (o cial metric) as well as the mAP for the two categories of queries: the soundscapes recordings (newly introduced) and the common observations (the same as the one used in 2015). To measure the progress over last year, we also plot on the graph the performance of the last year best system [ 7 ] (orange dotted line). The rst noticeable conclusion is that, after two years of resistance of bird songs identi cation systems based on engineering features, convolutional neural networks nally managed to outperform them (as in many other domains). The best run based on CNN (Cube Run 4) actually reached an impressive mAP of 0:69 on the 2015 testbed to be compared to respectively 0:45 and 0:58 for the best systems based on hand-crafted features evaluated in 2015 and 2016. To our knowledge, BirdCLEF is the rst comparative study reporting such an important performance gap in bioacoustic large-scale classi cation. A second important remark is that this performance of CNN's was achieved without any ne-tuning contrary to most computer vision challenges in which the CNN is generally pretrained on a large training data such as ImageNet. Thus, we could hope even better performance, e.g. by transferring knowledge from other bio-acoustic contexts or other domains. Now, it is important to notice that the other systems based on CNN (WUT and BME TMIT) did not perform as well as the Cube system and did not outperformed the system of TSA based on hand-crafted features. Looking at the detailed description of the three CNN architectures and their learning framework, it appears that the way in which audio segment extraction and data augmentation is performed does play a crucial role. Cube system does notably include a randomized background noise addition phase which makes it much more robust to the diversity of noise encountered in the test data. If we now look at the scores achieved by the evaluated systems on the soundscape recordings only (yellow plot), we can draw very di erent conclusions. First of all, we can observe that the performance on the soundscapes is much lower than on the classical queries, whatever the system. Although the classical recordings also include multiple species singing in the background, the soundscapes appear to be much more challenging. Several tens of species and even much more individual birds can actually be singing simultaneously. Separating all these sources seem to be beyond the scope of state-of-the-art audio representation learning methods. Interestingly, the best system on the soundscape queries was the one of TSA based on the extraction of very short species-speci c spectrogram segments and a template matching approach. This very ne-grained approach allows the extracted audio patterns to be more robust to the species overlap problem. On the contrary, the CNN of Cube and WUT systems were optimized for the mono-species segment classi cation problem. The data augmentation method of the Cube system was in particular only designed for the single species case. It addressed the problem of several individual birds of the same species singing together (by mixing di erent segments of the same class) but it did not address the multi-label issue (i.e. several species singing simultaneously). To study in more details the dynamic of the identi cation performance across the diversity of species, Figure 2 presents the scores achieved by the best system of each team on a selection of 3x10 species: (i) the top-10 best recognized ones (according to the performance of the best system Cube Run 4 ), (ii) 10 species of intermediate di culties and (iii) the worst-10 recognized ones (still based on the performance of Cube Run 4 ). For a better interpretation of the chart, we also included for each of the 30 selected species, the number of audio recordings in the training set (ranging from 10 to 37 recordings). The graph rst shows that there is a huge performance gap between the best recognized species and the worst cases. Some species are actually perfectly classi ed by 4 of the 6 systems whereas some others are never recognized by none of the systems. Interestingly, one can see that the performance does not seem to be correlated to the number of training samples. In the same way, we did observed that it is not correlated to the average length of the recordings in the class. This means that the high variability in performance is more related to other factors such as (i) the bird sounds variability (some birds are more audible than others), (ii) the acquisition di culty (some birds are easier to record than others), (iii) the degree of confusion across close species. Another interesting remark is that two of the species that are not recognized at all by the CNN are comparatively pretty well recognized by the template matching kernel approach of MNB TSA. Thus, it would be interesting to study in more details the kind of audio patterns that have been matched by their method so as to understand what the CNN missed and how such patterns could be automatically learned as well. This paper presented the overview and the results of the LifeCLEF bird identi cation challenge 2016. The main outcome was that after two years of resistance of bird song identi cation systems based on engineering features, convolutional neural networks nally managed to outperform them with a signi cant margin. It is noticeable that the best performing CNN did not used any ne-tuning so that it did not bene t from the transfer learning capacities of that techniques. We could thus expect even better performances. Also, the used CNN architecture was mostly inspired by the ones which perform the best on computer vision tasks. Our detailed analysis of the results tend to show that some audio patterns might not be learned accurately through such network whereas they are detected through template matching techniques. Anyway, it is obvious that, as in many domains beforehand, deep learning is rede ning the boundaries of the state-of-the-art and opens the door to further progress in the next years.

1. Briggs , F. , Huang , Y. , Raich , R. , Eftaxias , K. , et al., Z.L. : The 9th mlsp competition: New methods for acoustic classi cation of multiple simultaneous bird species in noisy environment . In: IEEE Workshop on Machine Learning for Signal Processing (MLSP) . pp. 1 { 8 ( 2013 )

2. Dufour , O. , Artieres , T. , Glotin , H. , Giraudet , P. : Clusterized mel lter cepstral coe cients and support vector machines for bird song iden cation . In: Soundscape Semiotics - Localization and Categorization , Glotin (Ed.) ( 2014 ), http://www.intechopen.com/books/ soundscape-semiotics -localisation-and-categorisation

3. Glotin , H. , Clark , C. , LeCun , Y. , Dugan , P. , Halkias , X. , Sueur , J.: Bioacoustic challenges in icml4b . In: in Proc. of 1st workshop on Machine Learning for Bioacoustics. No. USA, ISSN 979-10-90821-02-6 ( 2013 ), http://sabiod.org/ ICML4B2013_proceedings.pdf

4. Glotin , H. , Dufour , O. , Bas , Y. : Overview of the 2nd challenge on acoustic bird classi cation . In: Proc. Neural Information Processing Scaled for Bioacoustics. NIPS Int. Conf ., Ed. Glotin H., LeCun Y., Artieres

, Mallat

, Tchernichovski

, Halkias

, USA ( 2013 ), http://sabiod.univ-tln. fr/nips4b

5. Goeau, H., Glotin , H. , Vellinga , W.P. , Planque , R. , Rauber , A. , Joly , A. : Lifeclef bird identi cation task 2015 . In: CLEF working notes 2015 ( 2015 )

6. Krizhevsky , A. , Sutskever , I. , Hinton , G.E.: Imagenet classi cation with deep convolutional neural networks . In: Advances in neural information processing systems . pp. 1097 { 1105 ( 2012 )

7. Lasseck , M. : Improved automatic bird identi cation through decision tree based feature selection and bagging . In: Working notes of CLEF 2015 conference ( 2015 )

8. Lasseck , M. : Improving bird identi cation using multiresolution template matching and feature selection during training . In: Working notes of CLEF conference ( 2016 )

9. Piczak , K. : Recognizing bird species in audio recordings using deep convolutional neural networks . In: Working notes of CLEF 2016 conference ( 2016 )

10. Ricard , J. , Glotin , H.: Bag of mfcc-based words for bird identi cation . In: Working notes of CLEF 2016 conference ( 2016 )

11. Toth , B.P. , Czeba , B. : Convolutional neural networks for large-scale bird song classi cation in noisy environment . In: Working notes of CLEF conference ( 2016 )