Bird-Species Audio Identification, Ensembling 1D + 2D Signals

Bird-Species Audio Identification, Ensembling 1D + 2D Signals GyanendraDas Indian Institue of Technology

Dhanbad India

SakshamAggarwal sakshamaggarwal20@gmail.com Indian Institue of Technology

Dhanbad India

Bird-Species Audio Identification, Ensembling 1D + 2D Signals 1613-0073 734DDD9029637EC30F354F55E92CF283 GROBID - A machine learning software for extracting information from scholarly documents Deep Learning Bird Species Classification Transfer Learning Attention Mechanism Sound Detection Audio Source Detection Demucs Resnet 50 Efficient Net Ensembling Multi Domain Meta Training

In this paper, a method for recognizing bird species in audio recordings is described. we have experimented with 4 different approaches. Model on Spectrograms and Waveform domain consists of two main models: 1) A binary classifier for predicting if bird call is present in the audio or not; 2) A multiclass classifier for predicting which bird is present. Combining these two approaches, 1D and 2D signals, gives strong results. We also experiment on ATDemucs which extends Demucs , replacing the BiLSTM with self-attention. In this approach, we first do source separation of multiple birds along with noise separation as Universal Source Separation. Then we classify each source, both using a 1D waveform model ReSE-Multi, with self-attention and a 2D spectrogram model. We also discuss how we handle different thresholds for different models by a postprocessing technique. Ensembling techniques like Voting, Scaling and Direct Averaging gave us a good boost in our results. Our combined architecture including 1D and 2D signals achieves 0.6179 micro-averaged F1 in the task that asked for classification of 397 bird species.

Introduction

There are about 10,000 different bird species in this world, and they all play an important role in the natural world. They serve as good indicators of declining habitat quality and pollution. It is often easier to hear birds than it is to see them. BirdCLEF 2021 [1] -Birdcall Identification is a Kaggle competition organized by The Cornell Lab of Ornithology in collaboration with LifeCLEF 2021 [1] whose challenge is to identify which birds are calling in long recordings, given training data generated in meaningfully different contexts. This paper is structured in a way that it first gives details of the competition and the given data so that there is a clear understanding of the challenges posed by the train and test data. Also, we will provide a detailed solution to the approaches we have used for this challenge including data preparation, augmentations, model building, training procedure, and post-processing techniques.

Data

This section gives a brief overview of the data provided in the competition. Training on the data posed a lot of challenges since the train and test data were of different types.

Training Data

The training data is mainly comprised of two types of audio recordings:

Train short audio: The bulk of the training data consists of short recordings of individual bird calls generously uploaded by users of xeno-canto.org. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format. Information of 397 unique species has been given. Along with audio files, metadata is also provided which consists of primary label, secondary labels, type, latitude, longitude, scientific name, common name, author, date, filename, license, rating, time, and URL.

Train Soundscapes: There is a distinct shift in acoustic domains between the training and test set. So, some examples of soundscape recordings from the test set have been provided for training and validation purposes. These 20 recordings represent 2 of the 4 test recording locations and are of length 10 minutes each. In the metadata, information has been given as to which birds are present in each of the 5 second timestamps in the training soundscapes. Nocall label has been assigned if no bird is present.

All labels for train short audio had to be considered as weak labels since we did know which species is audible in the recording, but we did not know the exact timestamps of the vocalizations. Training with weakly labeled data was one of the core challenges of this competition. Secondary label lists the number of audible background species as annotated by the author. These lists might be incomplete and not very reliable. Also, the training data had a long-tail distribution making the dataset highly imbalanced as the head classes contained some species having train sequences more than 500 whereas some in the tail region had around mere 10 -20 sequences.

Test Data

It has approximately 80 audio recordings similar to train soundscapes. They are of 10 minutes each. We need to identify the birds present in each of the 5 second timestamps throughout the audio. These recordings are from 4 locations .

Our Approach

We used 4 different approaches to train our model

Model on Spectrograms

In this approach, we trained the model on Mel-Spectrograms. We trained 2 types of modelsmodel A and model B. Model A was trained to predict whether a bird is present or not in an audio clip i.e. it was a binary classification model. To train the model, we used an external dataset Freefield1010 [2] along with competition data. Model B was trained to classify the birds' species. Official competition data [3] was used for this model and we tried not to input any case of nocall making use of the weak labels generated by Model A when run on the competition dataset 1.

Data Preparation

Model Building

Transfer learning from State of The Art Image-net Models to Sound Classification.

For Model type A we took 3 pretrained models i.e. Efficient B0 [4], Resnet50 [5] and Densenet [6]. We noticed that SpecAugment [7] was not giving good results, but SpecChannelShuffle increased the model performance by 0.07. We got the highest score of 0.91 F1 score by Efficientnet-b0 and by blending three models we got 0.93 F1 Score on Freefield1010 [2] Dataset to classify if there is any bird present or not.

For Model type B we experimented with many pretrained models including Efficientnet B0, B1, B2, B3, B4, Resnet 50, Nfnet [8] and Resnet WSL [9] . We mention the result of this in result section 1. Here SpecAgument worked very well.

Augmentation

We executed data augmentation during the training stage.

• SpecAugment: SpecAugment is a popular augmentation technique applied on spectrogram.

The spectrogram is transformed by warping it in the time direction, masking blocks of consecutive frequency channels, and masking blocks of utterances in time. We noticed that SpecAugment increased model performance without requiring any further model or training parameter tweaks.

-TimeMasking: In time masking, t consecutive time steps [𝑡 0 , 𝑡 0 + t) are masked where t is chosen from a uniform distribution from 0 to the time mask parameter T, and 𝑡 0 is chosen from [0, 𝜏 − t) where 𝜏 is the time steps.

-FrequencyMasking: In frequency masking, frequency channels [𝑓 0 , 𝑓 0 + f) are masked where f is chosen from a uniform distribution from 0 to the frequency mask parameter F, and 𝑓 0 is chosen from (0, 𝜐− f) where 𝜐 is the number of frequency channels.

• SpecChannelShuffle: Shuffle the channels of a multichannel spectrogram (channels last). This can help combat positional bias. • MixUp [10]: We did mixup according to primary labels that is we combined the melspectrograms according to a parameter alpha which had been taken from beta distribution and also took weighted average of the target label according to the same alpha. Mixup helps in reducing memorization of corrupt labels and acts as a good regularizer during training.

𝐼𝑚𝑎𝑔𝑒 𝑖 = 𝛼 * 𝐼𝑚𝑎𝑔𝑒 𝑖 + (1 − 𝛼) * 𝐼𝑚𝑎𝑔𝑒 𝑗 𝑇 𝑎𝑟𝑔𝑒𝑡 𝑖 = 𝛼 * 𝑇 𝑎𝑟𝑔𝑒𝑡 𝑖 + (1 − 𝛼) * 𝑇 𝑎𝑟𝑔𝑒𝑡 𝑗

Here Image represents the raw input image array and target represents the label (one-hot encodings) of the corresponding image.

Training Procedure

The training procedure used for both the models is as follows:

Model A: The model was fed with both Freefield1010 as well as Competition data and the above augmentations were applied on them. Smaller models were trained for 15 epochs while larger models were trained for 8 epochs. We used linear learning rate for the first few epochs to provide warmup and after reaching its peak i.e. 0.002, it was linearly reduced. Adam [11] optimizer was giving the best result for this model.

Model B: The model was fed with competition data only and augmentations similar to that of Model A were applied. Smaller models were trained for 40 epochs while larger models were trained for 25 epochs. A similar strategy was used for learning rate scheduler as that of Model A. The optimizer used was Adam. While training we froze all the layers but the last few for the initial few epochs to help the model converge faster. Then all the layers were unfrozen and trained for the remaining epochs.

Model on Waveform Domain

In this approach, we trained the model on raw audio sample in Waveform domain. Here also we train 2 types of models-model A and model B 1. Model A was trained to predict whether a bird is present or not in an audio clip i.e. it was a binary classification model. Model B was trained to classify the birds' species.

Data Preparation

We resampled the raw wave to 16000 Hz sampling rate. Then we let 𝑚𝑎𝑥 𝑙 be the max length of the audio. If the length was less than 𝑚𝑎𝑥 𝑙 , we padded it with 0 at one end whereas if the length was greater than 𝑚𝑎𝑥 𝑙 , we cut the audio from both the side.

Model Building

This Model is highly motivated by ReSE-2-Multi [12]. With this frame-level raw waveform input, the bottom layer filters should learn all conceivable phase variations of (pseudo-)periodic waveforms that are likely to be present in audio signals. This has hampered the usage of raw waveforms as input over spectrogram-based representations in which the phase fluctuation within a frame is taken into account (i.e. time shift of periodic waveforms) is removed by taking merely the magnitude. So we added an Attention layer between two FC (Fully Connected Layer) 2. It's a simple Convolutional Long short-term memory Deep Neural Network (CLDNN) Model [13], with residual Connections which will impact high level features of Audio data.

Augmentation

• AddImpulseResponse: Convolve the audio with a random impulse response.

• TimeMask: Make a randomly chosen part of the audio silent.

• AddGaussianSNR: Add gaussian noise to the samples with random Signal to Noise Ratio (SNR) [14] • AddGaussianNoise: Add gaussian noise to the samples • We add pink noise at variable volumes, as well as random soundscape • We also used a Butterworth filter with stochastic cutoffs (randomly lowpass, highpass, bandpass, bandstop).

Training Procedure

Model A used both Freefield1010 data as well as competition data for training whereas Model B used for competition data. the augmentations already stated above were applied to these raw audio samples.The rest of the training procedure is very similar to that of Wave-gram model.

Multi-Domain Meta Training

After training the whole dataset in spectrograms domain and waveform domain, we check our hypothesis of combining the result from both domains so that both the models have other model's domain knowledge. For training, we froze all layers but the last 5 layers for Wave-gram Training Model 𝑀 𝑔 , and in case of Wave-Form Domain Training Model 𝑀 𝑓 , we froze all the layers except the last 3 layers. We calculated the loss using the below method which would back-propagate through both the models.

𝑂 𝑔𝑖 = 𝑀 𝑔 (𝑆𝑝𝑒𝑐(𝑋 𝑖 ))

𝑂 𝑓 𝑖 = 𝑀 𝑓 (𝑋 𝑖 ) 𝐿𝑜𝑠𝑠 𝑖 = 𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛(𝑂 𝑔𝑖 , 𝑇 𝑖 ) + 𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛(𝑂 𝑓 𝑖 , 𝑇 𝑖 )

We got a boost by 0.05 in Cross Validation Score with this technique.

ATDemucs

Motivation

In test set and train soundscapes, the audio file contains different types of birds. We thought of separating them and then training the classification models. We decided to introduce the music source separation concept in the multi-class classification task and experimented on it. The model is highly motivated by Demucs [15]. We provide the code in our GitHub repository ??.

Data Preparation

We discovered that an audio sample in Train Soundscapes data typically contained a maximum of 5 birds. So we took a hyper-parameter 𝑆𝑒𝑝 𝑁 𝑜 to mix 𝑆𝑒𝑝 𝑁 𝑜 short audios of birds. We did another experiment of mixing the nocall data from Freefield1010 and considered nocall as another bird that needs to be separated. We did the same steps for data preparation as in the Wave Form Domain dataset. We took different 𝑆𝑒𝑝 𝑁 𝑜 of Short Audio of Data and mixed it according to ∑︀ 𝑆𝑒𝑝 𝑁 𝑜 𝑛=1 𝐴 𝑖 For second stage training of this model, we prepared the train soundscapes data by dividing it into chunks of data of length 𝑚𝑎𝑥 𝑙 and trained with the pseudo labels predicted by the first-stage model.

Model Building

What is the difference between Demucs and ATDemucs? In Demucs there is downsample block and then a BiLSTM [16] layer and then upsample block. In ATDemucs Figure 3 there is attention in the LSTM layer and upsample block. In our method, we did cross attention between downsample output and upsample output. Downsample Block: The downsample block is made up of a convolution with kernel size K=8, stride S=4, 𝐶 𝑖−1 input channels, 𝐶 𝑖 output channels, and ReLU activation, followed by a 1x1 convolution with GLU [17] activation. We doubled the number of channels in the 1x1 convolution since the GLU outputs C/2 channels with C channels as input. Horizontal Trans Block: We replace the Bi-LSTM Layer with Self Attention [18] Layer consisting of 8 heads and Dropout 0.2 and hidden size 𝐶 𝐿 . This block outputs 2𝐶 𝐿 channels per time position. We use a 1x1 convolution with ReLU activation to take that number down to 𝐶 𝐿 . Upsample Block: The Upsample Block is nearly symmetrical to the Downsample Block. It is made up of a convolution with kernel size 3 and stride 1, as well as input/output channels 𝐶 𝑖 and a ReLU [19] activation. By eliminating simple concatenation like Demucs, we introduce a cross attention layer in which we take a query from the downsample block and a key and value from the upsample block. In addition, return the number of channels 𝐶 𝑖 by doing a 1x1 convolution using GLU activation. Finally, we employ a transposed convolution with K = 8 kernel size and S = 4 stride, 𝐶 𝑖−1 outputs, and ReLU activation. Instead of using an activation function, we output 4𝐶 0 channels for the final layer.

Augmentation

• Shift: Randomly shift audio in time by up to 'shift' samples.

• FlipChannels: Flip left-right channels.

• FlipSign: Random sign flip.

• Remix: Within a batch, shuffle the sources. Each batch is divided into groups of size group size, and shuffling is done separately inside each group.

Training Procedure

We trained this model in two stages.

First Stage: First we trained the model on mixing short audio. In this, we have training data input as a combination of 5 different birds' short audios. We train our model to differentiate between these different recordings and separated them. We trained the model for 150 epochs with a learning rate 0.003. We used cosine annealing as the LR Scheduler which starts with a large learning rate which is relatively rapidly decreased to a minimum value before being increased rapidly again. AdamW [20] optimizer gave good results than others.

Second Stage: In the train soundscapes, we were given primary labels for the audio recordings at each timestamp of 5 seconds. So, after the first stage we took inference of our model on the train soundscapes and did pseudo labeling so as to finetune the model. We trained the model for 5 epochs with a low learning rate taking AdamW as optimizer. During training, we froze some of the initial layers.

Classification After Separation

Once our model has been trained to separate the different bird sounds from the main audio recording, we run a classification model on the separated audios so as to classify which bird species it is. For this, we used Resnet50 model with pre-trained weights. We trained the model for approx 20 epochs with Adam optimizer. We got a Cross Validation Score of 0.62.

Figure 1 :1Figure 1: Pipeline Of Spectograms And WaveForm Domain Model Training.

Figure 2 :2Figure 2: ReSE-2-Multi With Attention for WaveForm Domain Model Training

𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝐷 , 𝐾 𝑈 , 𝑉 𝑈 ) = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑄 𝐷 𝐾 𝑇 𝑈 / √︀ 𝑑 𝑘 )𝑉 𝑈 Where 𝑄 𝐷 is Corresponding Upsample layers value, 𝐾 𝑈 is Downsample layers value and 𝑉 𝑈 is Downsample layers value.

Figure 3 :3Figure 3: ATDemucs (Attention + Demucs). It consists of Three types of blocks-Downblock, HorizontalTransBlock and UpTransBlock.As the name, suggest We use Attention in HorizontalTransBlock and UpTransBlock.

• Resample the dataset to 22050 Hz sampling rate. • Let 𝑡𝑖𝑚𝑒 𝑑 be the accepted minimum duration of anaudio sample. We choose a random 𝑡𝑖𝑚𝑒 𝑑 length of chunk from audio sample • Let 𝑚𝑖𝑛 𝑠 be the accepted minimum duration of the subimage. If the duration is less than 𝑚𝑖𝑛 𝑠 , then we convert it back to same length of 𝑚𝑖𝑛 𝑠 by padding. • Compute three Mel-Spectrogram 𝑀 𝑖 (𝑥) with window sizes 𝑊 𝑖 ∈ (128, 512, 2048). • Concatenate the three 𝑀 𝑖 (𝑥) into one 3 channels RGB multiscale image IGithub Repo https://github.com/Luckygyana/Bird-Species-Audio-Identification-Ensembling-and-1D-2D-Signals

Post Processing

We used two Post Processing Technique Scaling Method: We noticed different models have different best thresholds. So we decided to take them into some scale then add the logits. Let 𝑀 𝑖𝑛 𝑇 ℎ be the minimum threshold of all the models to be ensembled. Then we scaled the logits such a way that all the models' predictions below their respective best thresholds are converted into range 0 to 𝑀 𝑖𝑛 𝑇 ℎ , whereas all the models' predictions above their best threshold are converted int the range 𝑀 𝑖𝑛 𝑇 ℎ to 1. Then we average all the logits thus obtained after scaling and predict all the birds which have more probability than 𝑀 𝑖𝑛 𝑇 ℎ .

Voting Ensemble: Let 𝑀 𝑖𝑛 𝐶 is the minimum count of bird should present in all models N. We predict all those birds which have ⋂︀ 𝑁 𝑖=1 𝑀 𝑜𝑑𝑒𝑙 𝑖 > 𝑀 𝑖𝑛 𝐶 .

We submit three type of inference models:

• Spectrograms Model + Waveform Model: We ensemble all the models by above scaling method, Which gave us Cross Validation Score of 0.732 and LeaderBoard Score of 0.6179. • Multi-Domain Meta Trained Model: We optimize the best threshold for the CV and get Cross Validation Score of 0.705 and LeaderBoard Score of 0.6167 by 0.15 threshold. • ATDemucs: We get the Cross Validation Score score of 0.623 and LeaderBoard Score of 0.59. There are many whereabouts to increase the model accuracy.

Results

Table 1 Shows Cross Validation Score of Spectrograms Based Model (Type Model B). After Scaling All the models, we ensembled with a threshold of 0.20 and we got 0.716 accuracy. Direct Validation Score of 0.732 and LeaderBoard Score of 0.6179. We are figuring out the method a good method to ensemble other than averaging of all 3 method along with ATDemucs. We will update all our key findings in the Source code 1 .

Conclusion and future work

We compose several approaches, specifically a spectrogram architecture, a raw-waveform architecture, and multi-domain meta training. In the spectrogram model as well as the raw waveform model, we used two downstream modules: one for predicting whether a bird is present or not and the other for multi-label classification of the birds. We then combined both these approaches using a loss method that back-propagates through both the models. Also, we experimented with the Demucs model and extended the model architecture by adding an attention layer in upsampling block. Ensembling methods including voting and scaling methods helped achieve better results than any individual model. The spectrogram model along with scaling and downstream modules gave us the best result on the Private Leaderboard which helped us reach 67th position in the competition.

Overview of lifeclef 2021: a system-oriented evaluation of automated species identification and species distribution prediction AJoly HGoëau SKahl LPicek TLorieul ECole BDeneu MServajean RRuiz De Castañeda IBolon HGlotin RPlanqué W.-PVellinga ADorso HKlinck TDenton IEggel PBonnet HMüller Proceedings of the Twelfth International Conference of the CLEF Association the Twelfth International Conference of the CLEF Association

CLEF

2021. 2021 freefield1010 -an open dataset for research on audio field recording archives DStowell MDPlumbley Proceedings of the Audio Engineering Society 53rd Conference on Semantic Audio (AES53) the Audio Engineering Society 53rd Conference on Semantic Audio (AES53) Audio Engineering 2014 Overview of birdclef 2021: Bird call identification in soundscape recordings SKahl TDenton HKlinck HGlotin HGoëau W.-PVellinga RPlanqué A Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum 2021 Deep residual learning for image recognition KHe XZhang SRen JSun arXiv:1512.03385 2015 Efficientnet: Rethinking model scaling for convolutional neural networks MTan QVLe arXiv:1905.11946 2020 GHuang ZLiu LVan Der Maaten KQWeinberger arXiv:1608.06993 Densely connected convolutional networks 2018 Specaugment: A simple data augmentation method for automatic speech recognition DSPark WChan YZhang C.-CChiu BZoph EDCubuk QVLe 10.21437/interspeech.2019-2680 Interspeech 2019. 2019 ABrock SDe SLSmith KSimonyan arXiv:2102.06171 High-performance large-scale image recognition without normalization 2021 Exploring the limits of weakly supervised pretraining DKMahajan RBGirshick VRamanathan KHe MPaluri YLi ABharambe LVan Der Maaten ECCV 2018 HZhang MCisse YNDauphin DLopez-Paz arXiv:1710.09412 mixup: Beyond empirical risk minimization 2018 DPKingma JBa arXiv:1412.6980 Adam: A method for stochastic optimization 2017 JLee TKim JPark JNam arXiv:1712.00866 Raw waveform-based audio classification using sample-level cnn architectures 2017 Convolutional, long short-term memory, fully connected deep neural networks TNSainath OVinyals ASenior HSak 10.1109/ICASSP.2015.7178838 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015 Signal-to-noise ratio (snr) as a measure of reproducibility: Design, estimation, and application NElkum MShoukri 10.1007/s10742-008-0030-2 Health Services and Outcomes Research Methodology 8 2008 ADéfossez NUsunier LBottou FBach arXiv:1909.01174 Demucs: Deep extractor for music sources with extra unlabeled data remixed 2019 Bidirectional lstm-crf models for sequence tagging ZHuang WXu KYu arXiv:1508.01991 2015 Language modeling with gated convolutional networks YNDauphin AFan MAuli DGrangier arXiv:1612.08083 2017 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LKaiser IPolosukhin arXiv:1706.03762 2017 AFAgarap arXiv:1803.08375 Deep learning using rectified linear units (relu) 2019 ILoshchilov FHutter arXiv:1711.05101 Decoupled weight decay regularization