Bag of MFCC-based words for bird identification Julien Ricard and Hervé Glotin LSIS/DYNI University of Toulon, France http://www.lsis.org/dyni Abstract. The algorithm used by the authors in the bird identification task of LifeCLEF 2016 consists in creating a dictionary of MFCC-based words using k-means clustering, computing histograms of these words over short audio segments and feeding them to a random forest classifier. The official score achieved is 0.15 MAP. Keywords: bird identification, MFCC, k-means, bag-of-words, random forest 1 Foreword The algorithm presented here is quite standard and was initially used on smaller datasets to improve, in a late fusion scheme, a classifier based on pairs of spec- trogram peaks, described in the context of audio fingerprinting in [1]. Because of some memory issues we have not been able to run this latter algorithm on the challenge dataset. We propose in the discussion some options to possibly overcome this issue. 2 Algorithm The method is based on the bag-of-words approach, initially used in text anal- ysis to model long-term distribution of words [2] and more recently in audio signal analysis for tasks including spoken language identification [3] or urban soundscape identification [4]. The different steps of the algorithm are: 1. The original 44.1 kHz audio files were split in 0.2s segments with 50% overlap. 2. Only the segments having energy values higher than a relative (to the whole audio file) value and spectral flatness values smaller than an absolute thresh- old were kept. This method assumes that a segment containing bird vocaliza- tions is voiced (i.e. is made of stable time-frequency components, or partials) and has high energy compared to the environmental noise. Even though not all bird vocalizations are voiced, this simple technique has proven to give high precision (close to 1) in a bird presence/absence classification1 . 1 The recall was not great (about 0.5), but we were more interested in making sure the detected segments actually contained bird vocalizations than in detecting all these segments. 3. The Mel-Frequency Cepstral Coefficients (MFCC) were computed, with the following parameters: – analysis window size: 11.6ms – analysis window overlap: 50% – min frequency: 0Hz – max frequency: 22050Hz – number of mel bands: 32 – number of MFCC: 15 (the first coefficient, related to the energy, was removed) 4. A k-means clustering was performed on all the MFCC and their derivatives, with k=500. 5. For every files the normalized histogram of MFCC-based words (i.e. the 500 clusters) was computed (using only segments kept in step 2). 6. The resulting feature vectors were then fed to a random forest classifier with the following parameters: – number of trees: 400 – minimum number of samples required to split an internal node: 10 The algorithm was implemented in Python, using Numpy, PySoundFile2 , librosa3 and scikit-learn4 . 3 Results Our method achieved the following results (MAP): – with background species: 0.149 – only species: 0.183 – soundscape: 0.037 The full list of results is given in http://www.imageclef.org/lifeclef/2016/bird. 4 Discussion The proposed method is quite naive and achieve low performance compared to the other participants. The implementation of the random forest algorithm used5 required to be fed with the whole training dataset, which, because of the large amount of data, lead to memory issues. We had to limit the number of trees and branches, which probably decreased the performance of the models. An online (mini-batch) im- plementation, such as the ones proposed in [5] or [6] could be used to overcome this problem. 2 https://github.com/bastibe/PySoundFile 3 https://github.com/librosa/librosa 4 http://scikit-learn.org 5 http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html As mentioned in the foreword, the predictions were supposed to be fused with those of another algorithm. Some preliminary tests on the training set (splitting the training set in a new 70% training/30% test set) showed that this fusion improved the performance from 0.13 to 0.2 MAP, using the same random forest parameters as shown earlier. Combining this with a mini-batch implementation of random forest to use a larger number of trees and larger trees would possibly have improved further the performance. We believe that the weaknesses of our method lie in two main aspects. First, apart from the MFCC derivatives, which describe only very short-term changes, we did not encode any temporal information. It has been shown in [7] that extracting some signal modulations helped in automatic bird identification, and we assume 6 that the analysis of acoustic sequences, as described for instance in [8], might also help. Secondly, while we focused here on the voiced parts of the signal, some bird vocalizations are unvoiced and should also be taken into account, by adding some features that could describe them and modifying our detection of the segments of interest accordingly. References 1. Wang, A.: An Industrial Strength Audio Search Algorithm. In: ISMIR, pp. 7–13 (2003) 2. Sebastiani, F.. Machine learning in automated text categorization. In: ACM Com- puting Surveys 34, pp 1-47 (2002). 3. Ma, B. and Haizhou L. : Spoken language identification using bag-of-sounds (2005). 4. Aucouturier, J.-J., Defreville, B. and Pachet F.: The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music. In: Journal of the Acoustical Society of America 122.2, pp 881– 891 (2007). 5. Saffari, A. et al.: Online Random Forests. In 3rd IEEE ICCV Workshop on On-line Computer Vision (2009). Code: https://github.com/amirsaffari/online-multiclass- lpboost 6. Lakshminarayanan, B. et al.: Mondrian forests: Efficient online random forests. In Advances in Neural Information Processing Systems (2014). Code: https://github.com/balajiln/mondrianforest 7. Stowell, D. and Plumbley, M. D.: Largescale analysis of frequency modulation in birdsong data bases. In Methods in Ecology and Evolution 5.9, pp. 901–912 (2014). 8. Kershenbaum, A. et al.: Acoustic sequences in nonhuman animals: a tutorial review and prospectus. In Biological Reviews (2014). 9. Katahira, K. et al.: Complex sequencing rules of birdsong can be explained by simple hidden Markov processes. In PloS one 6.9, e24516 (2011). 6 We have not yet run any experiment to prove that sequence analysis helped in automatic bird identification, and we have not found any studies showing it. However, it has been shown that some acoustic sequences in bird songs can be explained by simple hidden Markov processes [9], and while it does not prove that it can help in our task, it proves that some sequencing rules exist, and we assume that they might be helpful.