Bird-Species Audio Identification, Ensembling 1D + 2D Signals Gyanendra Das1 , Saksham Aggarwal1 1 Indian Institue of Technology, Dhanbad, India Abstract In this paper, a method for recognizing bird species in audio recordings is described. we have experimented with 4 different approaches. Model on Spectrograms and Waveform domain consists of two main models: 1) A binary classifier for predicting if bird call is present in the audio or not; 2) A multiclass classifier for predicting which bird is present. Combining these two approaches, 1D and 2D signals, gives strong results. We also experiment on ATDemucs which extends Demucs , replacing the BiLSTM with self-attention. In this approach, we first do source separation of multiple birds along with noise separation as Universal Source Separation. Then we classify each source, both using a 1D waveform model ReSE-Multi, with self-attention and a 2D spectrogram model. We also discuss how we handle different thresholds for different models by a postprocessing technique. Ensembling techniques like Voting, Scaling and Direct Averaging gave us a good boost in our results. Our combined architecture including 1D and 2D signals achieves 0.6179 micro-averaged F1 in the task that asked for classification of 397 bird species. Keywords Deep Learning, Bird Species Classification, Transfer Learning, Attention Mechanism, Sound Detection, Audio Source Detection, Demucs, Resnet 50, Efficient Net, Ensembling, Multi Domain Meta Training 1. Introduction There are about 10,000 different bird species in this world, and they all play an important role in the natural world. They serve as good indicators of declining habitat quality and pollution. It is often easier to hear birds than it is to see them. BirdCLEF 2021[1] - Birdcall Identification is a Kaggle competition organized by The Cornell Lab of Ornithology in collaboration with LifeCLEF 2021[1] whose challenge is to identify which birds are calling in long recordings, given training data generated in meaningfully different contexts. This paper is structured in a way that it first gives details of the competition and the given data so that there is a clear understanding of the challenges posed by the train and test data. Also, we will provide a detailed solution to the approaches we have used for this challenge including data preparation, augmentations, model building, training procedure, and post-processing techniques. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " gyanendralucky9337@gmail.com (G. Das); sakshamaggarwal20@gmail.com (S. Aggarwal) ~ https://luckygyana.github.io/Portfolio/ (G. Das); https://github.com/saksham20aggarwal (S. Aggarwal) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Data This section gives a brief overview of the data provided in the competition. Training on the data posed a lot of challenges since the train and test data were of different types. 2.1. Training Data The training data is mainly comprised of two types of audio recordings: Train short audio: The bulk of the training data consists of short recordings of individual bird calls generously uploaded by users of xeno-canto.org. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format. Information of 397 unique species has been given. Along with audio files, metadata is also provided which consists of primary label, secondary labels, type, latitude, longitude, scientific name, common name, author, date, filename, license, rating, time, and URL. Train Soundscapes: There is a distinct shift in acoustic domains between the training and test set. So, some examples of soundscape recordings from the test set have been provided for training and validation purposes. These 20 recordings represent 2 of the 4 test recording locations and are of length 10 minutes each. In the metadata, information has been given as to which birds are present in each of the 5 second timestamps in the training soundscapes. Nocall label has been assigned if no bird is present. All labels for train short audio had to be considered as weak labels since we did know which species is audible in the recording, but we did not know the exact timestamps of the vocalizations. Training with weakly labeled data was one of the core challenges of this competition. Secondary label lists the number of audible background species as annotated by the author. These lists might be incomplete and not very reliable. Also, the training data had a long-tail distribution making the dataset highly imbalanced as the head classes contained some species having train sequences more than 500 whereas some in the tail region had around mere 10 -20 sequences. 2.2. Test Data It has approximately 80 audio recordings similar to train soundscapes. They are of 10 minutes each. We need to identify the birds present in each of the 5 second timestamps throughout the audio. These recordings are from 4 locations . 3. Our Approach We used 4 different approaches to train our model • Model on Spectrograms • Model on Waveform Domain • Multi-Domain Meta Training • ATDemucs 4. Model on Spectrograms In this approach, we trained the model on Mel-Spectrograms. We trained 2 types of models- model A and model B. Model A was trained to predict whether a bird is present or not in an audio clip i.e. it was a binary classification model. To train the model, we used an external dataset Freefield1010 [2] along with competition data. Model B was trained to classify the birds’ species. Official competition data [3] was used for this model and we tried not to input any case of nocall making use of the weak labels generated by Model A when run on the competition dataset 1. 4.1. Data Preparation • Resample the dataset to 22050 Hz sampling rate. • Let 𝑡𝑖𝑚𝑒𝑑 be the accepted minimum duration of anaudio sample. We choose a random 𝑡𝑖𝑚𝑒𝑑 length of chunk from audio sample • Let 𝑚𝑖𝑛𝑠 be the accepted minimum duration of the subimage. If the duration is less than 𝑚𝑖𝑛𝑠 , then we convert it back to same length of 𝑚𝑖𝑛𝑠 by padding. • Compute three Mel-Spectrogram 𝑀𝑖 (𝑥) with window sizes 𝑊𝑖 ∈ (128, 512, 2048). • Concatenate the three 𝑀𝑖 (𝑥) into one 3 channels RGB multiscale image I 4.2. Model Building Transfer learning from State of The Art Image-net Models to Sound Classification. For Model type A we took 3 pretrained models i.e. Efficient B0 [4], Resnet50 [5] and Densenet [6]. We noticed that SpecAugment [7] was not giving good results, but SpecChannelShuffle increased the model performance by 0.07. We got the highest score of 0.91 F1 score by Efficientnet-b0 and by blending three models we got 0.93 F1 Score on Freefield1010 [2] Dataset to classify if there is any bird present or not. For Model type B we experimented with many pretrained models including Efficientnet B0, B1, B2, B3, B4, Resnet 50, Nfnet [8] and Resnet WSL[9] . We mention the result of this in result section 1. Here SpecAgument worked very well. 4.3. Augmentation We executed data augmentation during the training stage. • SpecAugment: SpecAugment is a popular augmentation technique applied on spectrogram. The spectrogram is transformed by warping it in the time direction, masking blocks of consecutive frequency channels, and masking blocks of utterances in time. We noticed that SpecAugment increased model performance without requiring any further model or training parameter tweaks. – TimeMasking: In time masking, t consecutive time steps [𝑡0 , 𝑡0 + t) are masked where t is chosen from a uniform distribution from 0 to the time mask parameter T, and 𝑡0 is chosen from [0, 𝜏 − t) where 𝜏 is the time steps. – FrequencyMasking: In frequency masking, frequency channels [𝑓0 , 𝑓0 + f) are masked where f is chosen from a uniform distribution from 0 to the frequency mask parameter F, and 𝑓0 is chosen from (0, 𝜐− f) where 𝜐 is the number of frequency channels. • SpecChannelShuffle: Shuffle the channels of a multichannel spectrogram (channels last). This can help combat positional bias. • MixUp[10]: We did mixup according to primary labels that is we combined the mel- spectrograms according to a parameter alpha which had been taken from beta distribution and also took weighted average of the target label according to the same alpha. Mixup helps in reducing memorization of corrupt labels and acts as a good regularizer during training. 𝐼𝑚𝑎𝑔𝑒𝑖 = 𝛼 * 𝐼𝑚𝑎𝑔𝑒𝑖 + (1 − 𝛼) * 𝐼𝑚𝑎𝑔𝑒𝑗 𝑇 𝑎𝑟𝑔𝑒𝑡𝑖 = 𝛼 * 𝑇 𝑎𝑟𝑔𝑒𝑡𝑖 + (1 − 𝛼) * 𝑇 𝑎𝑟𝑔𝑒𝑡𝑗 Here Image represents the raw input image array and target represents the label (one-hot encodings) of the corresponding image. 4.4. Training Procedure The training procedure used for both the models is as follows: Model A: The model was fed with both Freefield1010 as well as Competition data and the above augmentations were applied on them. Smaller models were trained for 15 epochs while larger models were trained for 8 epochs. We used linear learning rate for the first few epochs to provide warmup and after reaching its peak i.e. 0.002, it was linearly reduced. Adam [11] optimizer was giving the best result for this model. Model B: The model was fed with competition data only and augmentations similar to that of Model A were applied. Smaller models were trained for 40 epochs while larger models were trained for 25 epochs. A similar strategy was used for learning rate scheduler as that of Model A. The optimizer used was Adam. While training we froze all the layers but the last few for the initial few epochs to help the model converge faster. Then all the layers were unfrozen and trained for the remaining epochs. 5. Model on Waveform Domain In this approach, we trained the model on raw audio sample in Waveform domain. Here also we train 2 types of models- model A and model B 1. Model A was trained to predict whether a bird is present or not in an audio clip i.e. it was a binary classification model. Model B was trained to classify the birds’ species. Figure 1: Pipeline Of Spectograms And WaveForm Domain Model Training. 5.1. Data Preparation We resampled the raw wave to 16000 Hz sampling rate. Then we let 𝑚𝑎𝑥𝑙 be the max length of the audio. If the length was less than 𝑚𝑎𝑥𝑙 , we padded it with 0 at one end whereas if the length was greater than 𝑚𝑎𝑥𝑙 , we cut the audio from both the side. 5.2. Model Building This Model is highly motivated by ReSE-2-Multi [12]. With this frame-level raw waveform input, the bottom layer filters should learn all conceivable phase variations of (pseudo-)periodic waveforms that are likely to be present in audio signals. This has hampered the usage of raw waveforms as input over spectrogram-based representations in which the phase fluctuation within a frame is taken into account (i.e. time shift of periodic waveforms) is removed by taking merely the magnitude. So we added an Attention layer between two FC (Fully Connected Layer) 2. It’s a simple Convolutional Long short-term memory Deep Neural Network (CLDNN) Model [13], with residual Connections which will impact high level features of Audio data. 5.3. Augmentation • AddImpulseResponse: Convolve the audio with a random impulse response. • TimeMask: Make a randomly chosen part of the audio silent. • AddGaussianSNR: Add gaussian noise to the samples with random Signal to Noise Ratio (SNR) [14] • AddGaussianNoise: Add gaussian noise to the samples • We add pink noise at variable volumes, as well as random soundscape • We also used a Butterworth filter with stochastic cutoffs (randomly lowpass, highpass, bandpass, bandstop). Figure 2: ReSE-2-Multi With Attention for WaveForm Domain Model Training 5.4. Training Procedure Model A used both Freefield1010 data as well as competition data for training whereas Model B used for competition data. the augmentations already stated above were applied to these raw audio samples.The rest of the training procedure is very similar to that of Wave-gram model. 6. Multi-Domain Meta Training After training the whole dataset in spectrograms domain and waveform domain, we check our hypothesis of combining the result from both domains so that both the models have other model’s domain knowledge. For training, we froze all layers but the last 5 layers for Wave-gram Training Model 𝑀𝑔 , and in case of Wave-Form Domain Training Model 𝑀𝑓 , we froze all the layers except the last 3 layers. We calculated the loss using the below method which would back-propagate through both the models. 𝑂𝑔𝑖 = 𝑀𝑔 (𝑆𝑝𝑒𝑐(𝑋𝑖 )) 𝑂𝑓 𝑖 = 𝑀𝑓 (𝑋𝑖 ) 𝐿𝑜𝑠𝑠𝑖 = 𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛(𝑂𝑔𝑖 , 𝑇𝑖 ) + 𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛(𝑂𝑓 𝑖 , 𝑇𝑖 ) We got a boost by 0.05 in Cross Validation Score with this technique. 7. ATDemucs 7.1. Motivation In test set and train soundscapes, the audio file contains different types of birds. We thought of separating them and then training the classification models. We decided to introduce the music source separation concept in the multi-class classification task and experimented on it. The model is highly motivated by Demucs [15]. We provide the code in our GitHub repository ??. 7.2. Data Preparation We discovered that an audio sample in Train Soundscapes data typically contained a maximum of 5 birds. So we took a hyper-parameter 𝑆𝑒𝑝𝑁 𝑜 to mix 𝑆𝑒𝑝𝑁 𝑜 short audios of birds. We did another experiment of mixing the nocall data from Freefield1010 and considered nocall as another bird that needs to be separated. We did the same steps for data preparation as in the Wave Form ∑︀ Domain dataset. We took different 𝑆𝑒𝑝𝑁 𝑜 of Short Audio of Data and mixed it according to 𝑆𝑒𝑝𝑛=1 𝐴𝑖 For second stage training of this model, we prepared the train 𝑁𝑜 soundscapes data by dividing it into chunks of data of length 𝑚𝑎𝑥𝑙 and trained with the pseudo labels predicted by the first-stage model. 7.3. Model Building What is the difference between Demucs and ATDemucs? In Demucs there is downsample block and then a BiLSTM[16] layer and then upsample block. In ATDemucs Figure 3 there is attention in the LSTM layer and upsample block. In our method, we did cross attention between downsample output and upsample output. Downsample Block: The downsample block is made up of a convolution with kernel size K=8, stride S=4, 𝐶𝑖−1 input channels, 𝐶𝑖 output channels, and ReLU activation, followed by a 1x1 convolution with GLU[17] activation. We doubled the number of channels in the 1x1 convolution since the GLU outputs C/2 channels with C channels as input. Horizontal Trans Block: We replace the Bi-LSTM Layer with Self Attention [18] Layer consisting of 8 heads and Dropout 0.2 and hidden size 𝐶𝐿 . This block outputs 2𝐶𝐿 channels per time position. We use a 1x1 convolution with ReLU activation to take that number down to 𝐶𝐿 . Upsample Block: The Upsample Block is nearly symmetrical to the Downsample Block. It is made up of a convolution with kernel size 3 and stride 1, as well as input/output channels 𝐶𝑖 and a ReLU [19] activation. By eliminating simple concatenation like Demucs, we introduce a cross attention layer in which we take a query from the downsample block and a key and value from the upsample block. In addition, return the number of channels 𝐶𝑖 by doing a 1x1 convolution using GLU activation. Finally, we employ a transposed convolution with K = 8 kernel size and S = 4 stride, 𝐶𝑖−1 outputs, and ReLU activation. Instead of using an activation function, we output 4𝐶0 channels for the final layer. √︀ 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝐷 , 𝐾𝑈 , 𝑉𝑈 ) = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑄𝐷 𝐾𝑈𝑇 / 𝑑𝑘 )𝑉𝑈 Where 𝑄𝐷 is Corresponding Upsample layers value, 𝐾𝑈 is Downsample layers value and 𝑉𝑈 is Downsample layers value. Figure 3: ATDemucs (Attention + Demucs). It consists of Three types of blocks- Downblock, HorizontalTransBlock and UpTransBlock. As the name, suggest We use Attention in HorizontalTransBlock and UpTransBlock. 7.4. Augmentation • Shift: Randomly shift audio in time by up to ‘shift‘ samples. • FlipChannels: Flip left-right channels. • FlipSign: Random sign flip. • Remix: Within a batch, shuffle the sources. Each batch is divided into groups of size group size, and shuffling is done separately inside each group. 7.5. Training Procedure We trained this model in two stages. First Stage: First we trained the model on mixing short audio. In this, we have training data input as a combination of 5 different birds’ short audios. We train our model to differentiate between these different recordings and separated them. We trained the model for 150 epochs with a learning rate 0.003. We used cosine annealing as the LR Scheduler which starts with a large learning rate which is relatively rapidly decreased to a minimum value before being increased rapidly again. AdamW [20] optimizer gave good results than others. Second Stage: In the train soundscapes, we were given primary labels for the audio recordings at each timestamp of 5 seconds. So, after the first stage we took inference of our model on the train soundscapes and did pseudo labeling so as to finetune the model. We trained the model for 5 epochs with a low learning rate taking AdamW as optimizer. During training, we froze some of the initial layers. 7.6. Classification After Separation Once our model has been trained to separate the different bird sounds from the main audio recording, we run a classification model on the separated audios so as to classify which bird species it is. For this, we used Resnet50 model with pre-trained weights. We trained the model for approx 20 epochs with Adam optimizer. We got a Cross Validation Score of 0.62. 8. Post Processing We used two Post Processing Technique Scaling Method: We noticed different models have different best thresholds. So we decided to take them into some scale then add the logits. Let 𝑀 𝑖𝑛𝑇 ℎ be the minimum threshold of all the models to be ensembled. Then we scaled the logits such a way that all the models’ predictions below their respective best thresholds are converted into range 0 to 𝑀 𝑖𝑛𝑇 ℎ , whereas all the models’ predictions above their best threshold are converted int the range 𝑀 𝑖𝑛𝑇 ℎ to 1. Then we average all the logits thus obtained after scaling and predict all the birds which have more probability than 𝑀 𝑖𝑛𝑇 ℎ . Voting Ensemble: Let 𝑀 𝑖𝑛𝐶 is the minimum count of bird should present in all models N. We predict all those birds which have 𝑁 𝑖 > 𝑀 𝑖𝑛𝐶 . ⋂︀ 𝑖=1 𝑀 𝑜𝑑𝑒𝑙 We submit three type of inference models: • Spectrograms Model + Waveform Model: We ensemble all the models by above scaling method, Which gave us Cross Validation Score of 0.732 and LeaderBoard Score of 0.6179. • Multi-Domain Meta Trained Model: We optimize the best threshold for the CV and get Cross Validation Score of 0.705 and LeaderBoard Score of 0.6167 by 0.15 threshold. • ATDemucs: We get the Cross Validation Score score of 0.623 and LeaderBoard Score of 0.59. There are many whereabouts to increase the model accuracy. 9. Results Table 1 Shows Cross Validation Score of Spectrograms Based Model (Type Model B). After Scaling All the models, we ensembled with a threshold of 0.20 and we got 0.716 accuracy. Direct Table 1 Spectogram Domain Cross Validation Score Model Best Threshold Scaling method Direct Averaging EFF B3 0.1 0.666 0.668 EFF B2 0.25 0.676 0.678 NFNET 0.35 0.666 0.668 EFF B4 0.09 0.661 0.666 EFF B1 0.45 0.663 0.664 Resnext101 WSL 0.4 0.676 0.675 Resnest50 32x4D 0.35 0.686 0.683 Resnet 50 0.25 0.690 0.690 EFF B0 0.3 0.667 0.666 AVG 0.716 0.708 Table 2 Waveform Domain Croos Validation Score Model Method Cross Validation Sore Wavenet Classifier 397 Bird Classification 0.655 Top 200 frequent Bird Classification 0.679 ReSE-2-Multi Nocall Binary Classification (Freefield Dataset) 0.895 (AUC score) 397 Bird Classification 0.693 Averaging Method gave us an accuracy of 0.708 whereas Voting Classifier gave 0.699 accuracy. Ensembling Both Spectogram Domain Result 1 and Waveform Domain result we get Cross Validation Score of 0.732 and LeaderBoard Score of 0.6179. We are figuring out the method a good method to ensemble other than averaging of all 3 method along with ATDemucs. We will update all our key findings in the Source code 1 . 10. Conclusion and future work We compose several approaches, specifically a spectrogram architecture, a raw-waveform architecture, and multi-domain meta training. In the spectrogram model as well as the raw waveform model, we used two downstream modules: one for predicting whether a bird is present or not and the other for multi-label classification of the birds. We then combined both these approaches using a loss method that back-propagates through both the models. Also, we experimented with the Demucs model and extended the model architecture by adding an attention layer in upsampling block. Ensembling methods including voting and scaling methods helped achieve better results than any individual model. The spectrogram model along with scaling and downstream modules gave us the best result on the Private Leaderboard which helped us reach 67th position in the competition. 1 Github Repo https://github.com/Luckygyana/Bird-Species-Audio-Identification-Ensembling-and-1D-2D-Signals References [1] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, R. Ruiz De Castañeda, I. Bolon, H. Glotin, R. Planqué, W.-P. Vellinga, A. Dorso, H. Klinck, T. Denton, I. Eggel, P. Bonnet, H. Müller, Overview of lifeclef 2021: a system-oriented evaluation of automated species identification and species distribution prediction, in: Proceedings of the Twelfth International Conference of the CLEF Association (CLEF 2021), 2021. [2] D. Stowell, M. D. Plumbley, freefield1010 - an open dataset for research on audio field recording archives, in: Proceedings of the Audio Engineering Society 53rd Conference on Semantic Audio (AES53), Audio Engineering Society, 2014. [3] S. Kahl, T. Denton, H. Klinck, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly, Overview of birdclef 2021: Bird call identification in soundscape recordings, in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, 2021. [4] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, 2015. arXiv:1512.03385. [5] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, 2020. arXiv:1905.11946. [6] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional networks, 2018. arXiv:1608.06993. [7] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le, Specaugment: A simple data augmentation method for automatic speech recognition, Interspeech 2019 (2019). URL: http://dx.doi.org/10.21437/Interspeech.2019-2680. doi:10. 21437/interspeech.2019-2680. [8] A. Brock, S. De, S. L. Smith, K. Simonyan, High-performance large-scale image recognition without normalization, 2021. arXiv:2102.06171. [9] D. K. Mahajan, R. B. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, L. van der Maaten, Exploring the limits of weakly supervised pretraining, in: ECCV, 2018. [10] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, 2018. arXiv:1710.09412. [11] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2017. arXiv:1412.6980. [12] J. Lee, T. Kim, J. Park, J. Nam, Raw waveform-based audio classification using sample-level cnn architectures, 2017. arXiv:1712.00866. [13] T. N. Sainath, O. Vinyals, A. Senior, H. Sak, Convolutional, long short-term memory, fully connected deep neural networks, in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4580–4584. doi:10.1109/ICASSP.2015. 7178838. [14] N. Elkum, M. Shoukri, Signal-to-noise ratio (snr) as a measure of reproducibility: Design, estimation, and application, Health Services and Outcomes Research Methodology 8 (2008) 119–133. doi:10.1007/s10742-008-0030-2. [15] A. Défossez, N. Usunier, L. Bottou, F. Bach, Demucs: Deep extractor for music sources with extra unlabeled data remixed, 2019. arXiv:1909.01174. [16] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging, 2015. arXiv:1508.01991. [17] Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, 2017. arXiv:1612.08083. [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, 2017. arXiv:1706.03762. [19] A. F. Agarap, Deep learning using rectified linear units (relu), 2019. arXiv:1803.08375. [20] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, 2019. arXiv:1711.05101.