SVM Candidates and Sparse Representation for Bird Identification Rodrigo Martinez1 , Laura Silva 2 , Esau Villarreal2,3 , Gibran Fuentes3 , and Ivan Meza3 1 Facultad de Ciencias (FC) http://ciencias.unam.mx 2 Facultad de Estudios Superiores - Zaragoza (FES-Zaragoza) http://www.zaragoza.unam.mx 3 Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas (IIMAS) http://www.iimas.unam.mx Universidad Nacional Autonoma de Mexico (UNAM) http://www.unam.mx No Institute Given Abstract We present a description of our approach for the “Bird task Identifica- tion LifeCLEF 2014”. Our approach consists of four stages: (1) a filtering stage for the filtering of audio bird recordings; (2) segmentation stage for the extraction of syllables; (3) a candidate generation based on HOG features from the syllables using SVM; and (4) a species identification using a Sparse Representation-based Classification of HOG and LBP features. Our approach ranked seventh team-wise in the challenge and showed a poor performance in the fourth stage. 1 Introduction In this work we present the description of our system submitted to the LifeCLEF 2014 Bird task [2] part of the LifeCLEF 2014 Laboratory [3]. This task is concerned with the identification of bird species based on their signing. This setting has potential appli- cations on ecological surveillance or biodiversity conservation. This year the task was formally defined as: The task will be focused on bird identification based on different types of audio records over 501 species from South America centered on Brazil. Addi- tional information includes contextual meta-data (author, date, locality name, comment, quality rates). The main originality of this data is that it was specif- ically built through a citizen sciences initiative conducted by Xeno-canto, an international social network of amateur and expert ornithologists. This makes the task closer to the conditions of a real-world application: (i) audio records of the same species are coming from distinct birds living in distinct areas (ii) audio records by different users that might not used the same combination of microphones and portable recorders (iii) audio records are taken at different 662 periods in the year and different hours of a day involving different background noise (other bird species, insect chirping, etc).1 At the core of our approach is the Sparse Representation-based Classification (SRC) [5], a methodology that has been quite successful in face recognition. We adapted SRC to work at the syllable-level. In addition, our approach is composed of filtering, syllable extraction and candidate generation. In the filtering stage, the audio recordings are uni- formly processed to be on the same bandwidth and to eliminate stationary noise. In the syllable extraction, the system identifies a set of syllables based on short time energy filter. Finally, we generate a set of candidate species based on the syllable information. For a recording, a set of candidates per syllable is ranked to generate a unique set. The outline of this paper is as follows. Section 2 presents the architecture of our approach. Section 3 explains the preprocessing stage, section 4 the extraction of sylla- bles stage, section 5 the candidates generation stage, section 5 the candidates generation stage, section 6 the identification stage. Section 7 presents our results. Finally, section 8 presents some conclusions and discusses about future work. 2 Architecture of the approach Our approach is composed of four stages represented in Figure 1. The first stage filters the recording in the frequency domain. The second stage extracts bird syllables from the filtered signal. We have two settings for this, a coarse and fined grained setting. The third stage has the goal of creating a set of n candidates given a syllable. A final set of candidates for the recording is produced combining the candidate sets of each syllable. The model for the candidates is generated using the fine grained syllables. Finally, the four stage has the goal of doing the identification using the Sparse Representation-based Classification of the syllables. It tries to select from the candidate set the species with more resemblance to the examples from a dictionary based on hand picked syllables. The syllables are based on the coarse segmentation, and it relies on the representation of the syllable as a visual feature. 3 Filtering There are two aspects we follow for the filtering of the signal. First we re-sample the original recordings from 44100hz to 16000hz. After this we apply a bandpass FIR fil- ter between 500hz and 4500hz frequencies. Empirically we identified that most of the singing frequencies were located in this bandwidth. However, the low performance of our approach points to review this assumption. Figure 2 shows the effect of this filtering in one of the recordings. 4 Segmentation of syllables The segmentation of syllables is done using a short time energy filter. A threshold de- fines what is considered activity. Each recording in the database is segmented after 1 From http://www.imageclef.org/node/180 (May, 2014) 663 Audios Fined syllables Preprocessing Candidates HOG Audios Coarse syllables + Species Segmentation Identification Feat Syllables Ranked Species Figure 1. System architecture for the identification of species of bird signing. 7.1k 5.3k Frequency (Hz) 3.5k 1.8k 0.0k0.0s 0.1s 0.2s 0.3s 0.4s 0.5s 0.6s 0.7s 0.8s 0.9s 1.0s Time 4.5k 3.8k 3.2k Frequency (Hz) 2.6k 1.9k 1.3k 0.6k 0.0k0.0s 0.1s 0.2s 0.3s 0.4s 0.5s 0.6s 0.7s 0.8s 0.9s 1.0s Time Figure 2. Example of the filtering of a recording. The spectrogram above represents the original recording. The spectrogram above represents the recording after filtering. 664 being filtered from the previous stage. Figure 3 shows the syllable identified by this stage. 4.5k 3.8k 3.2k Frequency (Hz) 2.6k 1.9k 1.3k 0.6k 0.0k0.0s 0.1s 0.2s 0.3s 0.4s 0.5s 0.6s 0.7s 0.8s 0.9s 1.0s Time Figure 3. Segmented syllables for a recording using a short time energy filter. Syllables are normalized by resizing them into a specific size in the frequency do- main (100x100 pixels). Figure 4 shows a syllable after the normalization process. We have defined two thresholds for fined and coarse segmentation. Table 3 summarizes the number of syllables extracted for each type of segmentation in both databases. Table 1. Number of syllables extracted from the recordings. Train Test Fined grained 52, 666 23, 953 Coarse grained 47, 899 5 Candidates generation A set of candidates of the possible species is generated using a Support Vector Machine. For this we extract visual features from the segmented syllables. In particular we exper- imented with Histograms of Oriented Gradients (HOG) [1] and Local Binary Patterns (LBP) [4]. For each syllable in a recording we extract the candidates. These candidate sets are agglomerated into a final set. Figure 5 shows the effect on the size of the set containing the target species. For our experiments we define the size of the set to be 200 candidate species. 6 Sparse Representation-based Classification For the identification stage we follow the method of Wright et. al. [5], originally pro- posed for the face recognition problem in which has been successful. We adapted the methodology to work at the syllable level on this bird task. 665 80 60 Frequency (Hz) 40 20 0 0 20 40 60 80 Time Figure 4. Normalized syllable. 0.8 0.7 0.6 0.5 %Species in candidates 0.4 0.3 0.2 0.1 0.00 50 100 150 200 250 Number of candidates Figure 5. Percentage of times the target species belongs to the candidates (blue curve). Percentage of times the target species is the first candidate (red line) 666 The method relies on a dictionary representation of the syllables of the i species. Each specie is represented by j instances of species’ syllables, which is given by a vector of arity m. Together candidates and instances define the dictionary matrix A of M XN dimension (where M = ixj). Given an unknown instance of a syllable y the goal of the method is to identify the vector x which represents the contributions of ele- ments of the dictionary A to generate the syllable y. In other words, the contribution of each syllable to generate the unknown syllable. Once the contribution of each element of the dictionary is identified, it is a matter of quantifying the contribution by each can- didate and decide if the contribution is enough to conclude that they represent the same person. In order to identify the contribution of each candidate, SRC uses the `1 minimiza- tion: minimize x = argmin||x||1 (1) subject to Ax = y This minimizes the sum of the individual contributions of a candidate such that the multiplication of the metric dictionary A and the contributions x generate the unknown instance. To perform this minimization we use the homotopy method since it is fast to create a good approximation of the vector x [6]. Once the vector x is identified the method proposes to calculate the square residuals per instance in the dictionary: ri = |y − Axi | (2) In which xi is the contribution vector with the values for the candidates different to i zeroed. In this way, the ri represents a score of the contribution of the instances of the candidate i. After calculating the residual per candidate, we look to identify the candidate which has the lower residual, this represent the one which is less different from the candidate and it is our identified person. We adapted this setting for the identification of bird species. First, the candidate species were extracted from the third stage of our system. Given the list of candidates from this stage we generated the matrix A. For our experiments i was variable but j was set to 5. In particular the instances of A were hand picked for experts in the field as good syllables examples for a species. The size m for the vector depended on the representation HOG or LBP features. At this stage we used the coarse syllables extracted in the second stage. We performed the methodology explained above for each syllable. We collected the identified species and ranked them by the probability obtained in the stage of candidate generation. This sorted list was used to produce the output required by the challenge. 7 Experimental Results We submitted three configurations of our system: 100 HOG + HOG The sparse system used the 100 top candidates generated with HOG features, and the HOG features to identify the species. 50 HOG + HOG The sparse system used the 50 top candidates with HOG features, and the HOG features to identify the species. 667 50 HOG + LBP The sparse system used the 50 top candidates generated with HOG features, and the LBP features to identify the species. Table 2. Mean average precision of identification of bird species in testing. Background species without backbround species 100 HOG + HOG 10.5% 12.9% 50 HOG + HOG 10.4% 12.8% 50 HOG + LBP 7.4% 8.9% As you can notice the performance of our system was poor. To reduce the amount of candidates did not have a significant improvement. On the other hand, the use of LBP affected the performance. In order to analyse the labellings produced by our system, we analyse the results over a subset of the training corpus, those marked with more than one bird singing. We found that only 209 bird species were correctly recover. Table ?? shows the 10 most successful species. However, our performance is so poor that it is hard to account for the errors on the rest of the species at the moment. Table 3. List of best identified species in recording with more than one species registered. Laterallus viridis Emberizoides herbicola Cercomacra melanaria Cyanocorax cristatellus Nyctibius griseus Setopagis parvulus Hypocnemis hypoxantha Procnias nudicollis Synallaxis cinerascens Crypturellus tataupa 8 Conclusions and Future work These working notes present our system proposal for the identification of bird species through singing. This proposal was built in the context of the LifeCLEF 2014 Bird task [2], a part of the LifeCLEF 2014 Laboratory[3]. Our approach first generates can- didates using SVM and the identification of the species at the syllable level using a sparse representation. As result of the challenge we identify several problems with our setting which had a poor performance in the challenge 25% of the best. We have several hypothesis of what could had been wrong. First, our filtering of the recording was to aggressive. Second, the segmentation of the syllables was not at the level and in many cases segmented more than syllable. Third, the identification stage fail to identify the species at a good rate 17% with 100 candidates. Fourth, we did not use information of the metadata or at the song level of the species. 668 In the future we aim to generate a better setting for the filtering and syllable seg- mentation, maybe by incorporating elements of other approaches. We also would like to continue experimenting with the setting of identification through SRC to fully discarded or to find the correct way to set it up in the task of bird identification. References 1. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. Conference on Computer Vision and Pattern Recognition, San Diego, USA (Junio 2005) 2. Goëau, H., Glotin, H., Vellinga, W.P., Rauber, A.: Lifeclef bird identification task 2014. In: CLEF working notes 2014 (2014) 3. Joly, A., Müller, H., Goëau, H., Glotin, H., Spampinato, C., Rauber, A., Bonnet, P., Vellinga, W.P., Fisher, B.: Lifeclef 2014: multimedia life species identification challenges. In: Proceedings of CLEF 2014 (2014) 4. Wang, L., He, D.: Texture classification using texture spectrum. Pattern Recognition (8), 905 – 910 (1990) 5. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on 31(2), 210–227 (2009) 6. Yang, A., Zhou, Z., Balasubramanian, A., Sastry, S., Ma, Y.: Fast `1 -minimization algorithms for robust face recognition. Image Processing, IEEE Transactions on 22(8), 3234–3246 (Aug 2013) 669