-

LifeCLEF Bird Identi cation Task 2015

Herve Goeau

Herve Glotin

glotin@univ-tln.fr 0

Willem-Pier Vellinga

Robert Planque

Andreas Rauber

rauber@ifs.tuwien.ac.at 3

Alexis Joly

1 2 0 Aix Marseille Univ., ENSAM, CNRS LSIS, Univ. Toulon, Institut Univ. de France 1 Inria ZENITH team , France 2 LIRMM , Montpellier , France 3 Vienna University of Technology , Austria 4 Xeno-canto Foundation , The Netherlands

The LifeCLEF bird identi cation task provides a testbed for a system-oriented evaluation of 999 bird species identi cation. The main originality of this data is that it was speci cally built through a citizen science initiative conducted by Xeno-Canto, an international social network of amateur and expert ornithologists. This makes the task closer to the conditions of a real-world application than previous, similar initiatives. This overview presents the resources and the assessments of the task, summarizes the retrieval approaches employed by the participating groups, and provides an analysis of the main evaluation results.

LifeCLEF bird song call species retrieval audio collection identi cation ne-grained classi cation evaluation benchmark bioacoustics

Accurate knowledge of the identity, the geographic distribution and the evolution of bird species is essential for a sustainable development of humanity as well as for biodiversity conservation. Unfortunately, such basic information is often only partially available for professional stakeholders, teachers, scientists and citizens. In fact, it is often incomplete for ecosystems that possess the highest diversity, such as tropical regions. A noticeable cause and consequence of this sparse knowledge is that identifying birds is usually impossible for the general public, and often a di cult task for professionals like park rangers, ecology consultants, and of course, the ornithologists themselves. This "taxonomic gap" [ 21 ] was actually identi ed as one of the main ecological challenges to be solved during United Nations Conference in Rio de Janeiro, Brazil, in 1992.

The use of multimedia identi cation tools is considered to be one of the most promising solutions to help bridging this taxonomic gap [ 14 ], [ 8 ], [ 6 ], [ 20 ], [ 19 ], [ 12 ]. With the recent advances in digital devices, network bandwidth and information storage capacities, the collection of multimedia data has indeed become an easy task. In parallel, the emergence of "citizen science" and social networking tools has fostered the creation of large and structured communities of nature observers (e.g. eBird6, Xeno-canto7, iSpot 8, etc.) that have started to produce outstanding collections of audio and/or visual records. Unfortunately, the performance of the state-of-the-art multimedia analysis techniques on such data is still not well understood and it is far from reaching the real world's requirements in terms of identi cation tools. Most existing studies or available tools typically identify a few tens of species with moderate accuracy whereas they should be scaled-up to take one, two or three orders of magnitude more, in terms of number of species.

The LifeCLEF Bird task proposes to evaluate one of these challenges [?] based on big and real-world data and de ned in collaboration with biologists and environmental stakeholders so as to re ect realistic usage scenarios.

Using audio records rather than bird pictures is justi ed by current practices [ 6 ], [ 20 ], [ 19 ], [ 5 ]. Birds are actually not easy to photograph; audio calls and songs have proven to be easier to collect and su ciently species speci c.

Only three notable previous worldwide initiatives on bird species identi cation based on their songs or calls have taken place, all three in 2013. The rst one was the ICML4B bird challenge joint to the International Conference on Machine Learning in Atlanta, June 2013 [ 2 ]. It was initiated by the SABIOD MASTODONS CNRS group9, the University of Toulon and the National Natural History Museum of Paris [ 9 ]. It included 35 species, and 76 participants submitted their 400 runs on the Kaggle interface. The second challenge was conducted by F. Brigs at MLSP 2013 workshop, with 15 species, and 79 participants in August 2013. The third challenge, and biggest in 2013, was organised by University of Toulon, SABIOD and Biotope [ 4 ], with 80 species from the Provence, France. More than thirty teams participated, reaching 92% of average AUC. Descriptions of the best systems of ICML4B and NIPS4B bird identi cation challenges are given in the on-line books [ 2,1 ] including, in some cases, references to useful scripts.

In collaboration with the organizers of these previous challenges, BirdCLEF 2014 and 2015 go one step further by (i) signi cantly increasing the species number by almost an order of magnitude (ii) working on real-world data collected by hundreds of recordists (iii) moving to a more usage-driven and system-oriented benchmark by allowing the use of meta-data and de ning information retrieval oriented metrics. Overall, the task is expected to be much more di cult than previous benchmarks because of the higher confusion risk between the classes, the higher background noise and the higher diversity in the acquisition conditions (devices, recordists uses, contexts diversity, etc.). It will therefore probably produce substantially lower scores and o er a better progression margin towards building real-world generalist identi cation tools. 6 http://ebird.org/ 7 http://www.xeno-canto.org/ 8 http://www.ispotnature.org/communities/global 9 http://sabiod.univ-tln.fr

Dataset

The training and test data of the bird task is composed by audio recordings hosted on xeno-canto.org (XC). Xeno-canto is a web-based community of bird sound recordists worldwide with more than 2300 active contributors that have already collected more than 240,000 recordings of about 9330 species (may 2015). 999 species from Brazil are used in the BirdCLEF dataset. They represent the species of that country with the highest number of recordings on XC, totalling 33,862 recordings contributed by hundreds of users. The dataset has between 13 and 234 recordings per species, recorded by between 1 and 72 recordists. This dataset also contains the entire dataset from the 2014 BirdCLEF challenge [ 10 ], which contained about 14,000 recordings from 501 species.

To avoid any bias in the evaluation related to the audio devices used, each audio le has been normalized to a constant bandwidth of 44.1 kHz and coded over 16 bits in .wav mono format (the right channel was selected by default). The conversion from the original Xeno-canto data set was done using mpeg, sox and matlab scripts. An optimized 16 Mel Filter Cepstrum Coe cients for bird identi cation (according to an extended benchmark [ 7 ]) have been computed with their rst and second temporal derivatives on the whole set. They were used in the best systems run in ICML4B and NIPS4B challenges [ 2 ], [ 1 ],[ 4 ], [ 9 ].

Audio records are associated with various meta-data including the species of the most active singing bird, the species of the other birds audible in the background, the type of sound (call, song, alarm, ight, etc.), the date and location of the observations (from which rich statistics on species distribution can be derived), common names and collaborative quality ratings. All of them were produced collaboratively by the Xeno-canto community. 3

Task Description

Participants were asked to determine the species of the most active singing birds in each query le. The background noise can be used as any other meta-data, but it is forbidden to correlate the test set of the challenge with the original annotated Xeno-canto data base (or with any external content as many of them are circulating on the web). More precisely, the whole BirdCLEF dataset has been split in two parts, one for training (and/or indexing) and one for testing. The test set was built by randomly choosing 1/3 of the observations of each species whereas the remaining observations were kept in the reference training set. Recordings of the same species done by the same person the same day are considered as being part of the same observation and cannot be split across the test and training set. The xml les containing the meta-data of the query recordings were purged so as to erase the foreground and background species names (the ground truth), the vernacular names (common names of the birds) and the collaborative quality ratings (that would not be available at query stage in a real-world mobile application). Meta-data of the recordings in the training set are kept unaltered.

The groups participating to the task were asked to produce up to 4 runs containing a ranked list of the most probable species for each record of the test set. Each species had to be associated with a normalized score in the range [0; 1] re ecting the likelihood that this species was singing in the sample. For each submitted run, participants had to say if the run was performed fully automatically or with a human assistance in the processing of the queries, and if they used a method based on only audio analysis or with the use of the metadata. The metric used to compare the runs was the Mean Average Precision averaged across all queries. Since the audio records contain a main species and often some background species belonging to the set of 501 species in the training, we decided to use two metrics, one focusing on all species (MAP1) and a second one focusing only on the main species (MAP2). 4

Participants and methods

137 research groups worldwide registered for the task and downloaded the data (from a total of 189 groups that registered for at least one of the three LifeCLEF tasks). This shows the high attractiveness of the challenge in both the multimedia community (presumably interested in several tasks) and in the audio and bioacoustics community (presumably registered only to the bird songs task). Finally, 6 of the registrants crossed the nish line by submitting runs and 5 of them submitted working notes explaining their runs in details. We list them hereafter in alphabetical order and give a brief overview of the techniques they used in their runs. We would like to point out that the LifeCLEF benchmark is a system-oriented evaluation and not a deep or ne evaluation of the underlying algorithms. Readers interested in the scienti c and technical details of the implemented methods should refer to the LifeCLEF 2015 working notes or to the research papers of each participant (referenced below): CHIN. AC. SC., China, 3 runs: This participant attempted to experiment a baseline audio classi cation system based on the classi cation of Mel-bands representations and their scattering re nements [ 3 ] using a Gaussian Mixture Model. The rst run used only MFCC features with 128 Gaussian mixtures, the second run used the scattering re nements with 32 Gaussian mixtures, the third run used the scattering re nements with 128 Gaussian mixtures. Golem, Mexico, 3 runs [ 15 ]: This participant experimented a simple yet highly scalable system based on the classi cation of Mel-bands representations using a random forest. The extracted Mel bands per recording were actually pooled through simple statistics (i.e. mean, standard deviation, median and skewness), resulting in time- and space-e cient 320-dimensional features to be trained by the classi er.

Inria Zenith, France, 3 runs [ 11 ]: Inspired by recent works on ne-grained image classi cation, this group introduced a new match kernel based on the shared nearest neighbors of the low level audio features extracted at the frame level. To make such strategy scalable to the tens of millions of MFCC features extracted from the training set, they make use of high-dimensional hashing techniques coupled with an e cient approximate nearest neighbors search algorithm with controlled quality. Further improvements are obtained by (i) using a sliding window for the temporal pooling of the raw matches (ii) weighting each low level feature according to the semantic coherence of its nearest neighbors. The nal classi cation was then completed thanks to a support vector machine trained on top of the resulting matching-based representations.

MNB TSA, Germany, 4 runs [ 13 ]: This participant combined two main categories of features for the classi cation: parametric acoustic features (see openSMILE Audio Statistics) and probabilities of species-speci c spectrogram segments (see Segment-Probabilities). This second source of information, which performs the best, consists in extracting for each species, a set of representative segments from spectrogram images. These segments are then used to extract Segment-Probabilities for each le by calculating the maxima of the normalized cross-correlation between all segments and the target spectrogram image via template matching. Due to the very large amount of audio data not all les belonging to a certain species were used as a source for segmentation (i.e. only good quality les without background species were used). Additionally, to further reduce the computation time, the spectrogram images were downsmapled before computing the template matching. The classi cation problem was then formulated as a multi-label regression task completed by training ensembles of randomized decision trees with probabilistic outputs. The training was performed in two passes, one selecting a small subset of the most discriminant features, and one training the nal classi ers on the selected features (Run 1). To further improve classi cation results a bagging approach was used consisting in calculating further Segment-Probabilities from additional segments and to combine them either by averaging (Run 2) or by blending (Run 3 and Run 4 with more blends). QMUL, UK, 1 run [ 17 ]: This group focused on unsupervised feature learning in order to learn regularities in spectro-temporal content without reference to the training labels and further help the classi er to generalise to further content of the same type. MFCC features and several temporal variants are rst extracted from the audio signal after a median-based thresholding pre-processing. Extracted low level features were then reduced through PCA whitening and clustered via spherical k-means (and a two-layer variant of it) to build the vocabulary. During classi cation, MFCC features are pooled by projecting them on the vocabulary with di erent temporal pooling strategies. Final supervised classi cation is achieved thanks to a random forest classi er. This method is the subject of a full-length article which can be read at [ 18 ]. Details of the di erent parameters settings used in each run are detailed in the working note [?]. MARF, Canada, 4 runs : These participants mainly attempted to transpose a speech processing method they developed earlier to the birds case (Modular Audio Recognition Framework (MARF)'s API, [ 16 ]). The rst run was using only 20 LPC coe cients as features and the Chebyshev distance. The second run was using only the meta-data features using the MARFCAT approach [ 16 ] to represent the XML meta-data as a wave form without pre-processing, and using 512-window FFT features and cosine similarity measure. The third run was a concatenation of Run 1 and Run 2. The fourth run used the same set up as Run 1 but split the training data by quality ratings attributes. 5

Results

The main outcome of the evaluation is that the use of matching-based scores as high-dimensional features to be classi ed by supervised classi ers (as done by MNB TSA and INRIA ZENITH) provides the best results, with a Mean Average Precision up to 0:454 for the fourth run of the MNB TSA group. These approaches notably outperform the unsupervised feature learning framework of the QMUL group as well as the baseline method of the Golem group. The matching of all the audio recordings however remains a very time-consuming process that had to be carefully designed in order to process a large-scale dataset such as the one deployed within the challenge. The MNB TSA group notably reduced as much as possible the number of audio segments to be matched thanks to an e ective audio pre-processing and segmentation framework. They also restricted the extraction of these segments to the les having the best quality according to the user ratings and that do not have background species. On the other side, the INRIA ZENITH group did not use any segmentation but attempted to speed-up the matching though the use of a hash-based approximate k-nearest neighbors search scheme (on top of MFCC features). The better performance of the MNB TSA runs shows that cleaning the audio segments vocabulary before applying the matching is clearly bene cial. But using a scalable knn-based matching as the one of the INRIA ZENITH runs could be a complementary way to speed up the matching phase.

It is interesting to notice that the rst run of the MNB TSA group is roughly the same method than the one they used within the BirdCLEF challenge of the previous year [ 10 ] and which achieved the best results (with a MAP1 equals to 0:511 vs. 0:424 this year). This shows that the impact of the increasing di culty of the challenge (with twice the number of species) is far from negligible. The performance loss is notably not compensated by the bagging extension of the method which resulted in a MAP1 equals to 0:454 for MNB TSA run 4. 6

Conclusion

This paper presented the overview and the results of the rst LifeCLEF bird identi cation challenge 2015. With a number of registrant exceeding hundred, it showed a high interest of the multimedia and the bio-acoustic communities in applying their technologies to real-world environmental data such as the ones collected by Xeno-canto. The main outcome of this evaluation is a snapshot of the performances of state-of-the-art techniques that will hopefully serve as a guideline for developers interested in building end-user applications. One important conclusion of the campaign is that the two best performing methods were based on matching approaches attempting to construct high-dimensional representations of the audio recordings based on their matching scores in a large vocabulary of audio segments. The results of the evaluation clearly show the superiority of these approaches in terms of e ectiveness but also point out the underlying scalability issues in terms of e ciency. The increasing complexity of the challenge over the previous year in terms of the number species and items, notably conducted to a consistent loss of the raw identi cation performance despite the progress of the underlying methods. Considering that the number of bird species on earth is more than 10,000 and that the number of singing insects is even much larger, we believe it is important to continue working on such large-scale identi cation issues in the next years.

1. Proc. of Neural Information Processing Scaled for Bioacoustics: from Neurons to Big Data, joint to NIPS ( 2013 ), http://sabiod.univ-tln.fr/NIPS4B2013_book. pdf

2. Proc. of the rst workshop on Machine Learning for Bioacoustics , joint to ICML ( 2013 ), http://sabiod.univ-tln.fr/ICML4B2013_book.pdf

3. Anden , J. , Mallat , S. : Multiscale scattering for audio classi cation . In: ISMIR . pp. 657 { 662 ( 2011 )

4. Bas , Y. , Dufour , O. , Glotin , H.: Overview of the nips4b bird classi cation . In: Proc. of Neural Information Processing Scaled for Bioacoustics: from Neurons to Big Data, joint to NIPS . pp. 12 { 16 ( 2013 ), http://sabiod.univ-tln.fr/NIPS4B2013_ book.pdf

5. Briggs , F. , Lakshminarayanan , B. , Neal , L. , Fern , X.Z. , Raich , R. , Hadley , S.J. , Hadley , A.S. , Betts , M.G. : Acoustic classi cation of multiple simultaneous bird species: A multi-instance multi-label approach . The Journal of the Acoustical Society of America 131 , 4640 ( 2012 )

6. Cai , J. , Ee , D. , Pham , B. , Roe , P. , Zhang, J.: Sensor network for the monitoring of ecosystem: Bird species recognition . In: Intelligent Sensors, Sensor Networks and Information , 2007 . ISSNIP 2007 . 3rd International Conference on. pp. 293 { 298 (Dec 2007 )

7. Dufour , O. , Artieres , T. , Glotin , H. , Giraudet , P. : Clusterized mel lter cepstral coe cients and support vector machines for bird song iden cation . In: Soundscape Semiotics - Localization and Categorization , Glotin (Ed.) ( 2014 )

8. Gaston , K.J. , O 'Neill , M.A. : Automated species identi cation: why not? Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 359 ( 1444 ), 655 { 667 ( 2004 ), http://rstb.royalsocietypublishing.org/ content/359/1444/655.abstract

9. Glotin , H. , Sueur , J.: Overview of the 1st int'l challenge on bird classi cation . In: Proc. of the rst workshop on Machine Learning for Bioacoustics , joint to ICML. pp. 17 { 21 ( 2013 ), http://sabiod.univ-tln.fr/ICML4B2013_book.pdf

10. Goeau, H., Glotin , H. , Vellinga , W.P. , Rauber , A. : Lifeclef bird identi cation task 2014

11. Joly , A. , Champ , J. , Buisson , O. : Shared nearest neighbors match kernel for bird songs identi cation - lifeclef 2015 challenge . In: Working notes of CLEF 201 conference ( 2015 )

12. Joly , A. , Goeau, H., Bonnet , P. , Bakic , V. , Barbe , J. , Selmi , S. , Yahiaoui , I. , Carre , J. , Mouysset , E. , Molino , J.F. , et al.: Interactive plant identi cation based on social image data . Ecological Informatics 23 , 22 { 34 ( 2014 )

13. Lasseck , M. : Improved automatic bird identi cation through decision tree based feature selection and bagging . In: Working notes of CLEF 2015 conference ( 2015 )

14. Lee , D.J. , Schoenberger , R.B. , Shiozawa , D. , Xu , X. , Zhan , P. : Contour matching for a sh recognition and migration-monitoring system . In: Optics East . pp. 37 { 48 . International Society for Optics and Photonics ( 2004 )

15. Meza , I. , Espino-Gamez , A. , Solano , F. , Villarreal , E.:

16. Mokhov , S.A. : Study of best algorithm combinations for speech processing tasks in machine learning using median vs. mean clusters in marf . In: Proceedings of the 2008 C 3 S 2 E conference . pp. 29 { 43 . ACM ( 2008 )

17. Stowell , D. : Birdclef 2015 submission: Unsupervised feature learning from audio . In: Working notes of CLEF 2015 conference ( 2015 )

18. Stowell , D. , Plumbley , M.D.: Automatic large-scale classi cation of bird sounds is strongly improved by unsupervised feature learning . arXiv preprint arXiv:1405.6524 ( 2014 )

19. Towsey , M. , Planitz , B. , Nantes , A. , Wimmer , J. , Roe , P.: A toolbox for animal call recognition . Bioacoustics 21 ( 2 ), 107 { 125 ( 2012 )

20. Trifa , V.M. , Kirschel , A.N. , Taylor , C.E., Vallejo , E.E. : Automated species recognition of antbirds in a mexican rainforest using hidden markov models . The Journal of the Acoustical Society of America 123 , 2424 ( 2008 )

21. Wheeler , Q.D. , Raven , P.H. , Wilson, E.O. : Taxonomy: Impediment or expedient? Science 303 ( 5656 ), 285 ( 2004 ), http://www.sciencemag.org/content/303/5656/ 285.short